LVSKB - User contributions [en]

Talk:IPv6 load balancing

2017-09-20T17:34:33Z

ZaphodB: Created page with "I found the service check and most likely your ICMPv6 issue as well to be caused by the way IPv6 source address selection works on linux. [http://www.davidc.net/networking/ipv..."

I found the service check and most likely your ICMPv6 issue as well to be caused by the way IPv6 source address selection works on linux. [http://www.davidc.net/networking/ipv6-source-address-selection-linux] Basically whenever heartbeat adds a service IP you cannot check IPVS/DR realservers anymore because linux will use the one that was added latest unless you mark it as deprecated. -- [[ZaphodB]]

Building Scalable DNS Cluster using LVS

2008-10-30T17:57:51Z

ZaphodB: /* [http://doc.powerdns.com/built-in-recursor.html PowerDNS recursor] */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== [http://www.isc.org/index.pl?/sw/bind/ BIND9] ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

=== [http://doc.powerdns.com/built-in-recursor.html PowerDNS recursor] ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses [http://doc.powerdns.com/built-in-recursor.html PowerDNS recursor] exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Talk:Building Scalable DNS Cluster using LVS

2008-01-31T16:27:02Z

ZaphodB:

Hi!
What about the storage of zone files? Where were they located? Thanks

: Each DNS server can keep a copy of zone files and the change can be synchronized through some tools, or Each DNS server access shared network file system for zone files. --[[User:Wensong|Wensong]] 23:15, 6 June 2007 (CST)
: Right, my example is from a recursive DNS (caching-only) setup. You can do just the same for authoritative DNS but you need to keep your realservers in sync then else you would end up with a VIP/ServiceIP telling you different things each time for the same request as we used lb_algo wrr (weighted round robin) here. --[[User:ZaphodB|ZaphodB]] 17:26, 31 January 2008 (CET)

Building Scalable DNS Cluster using LVS

2007-07-04T12:48:45Z

ZaphodB: /* PowerDNS recursor */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== [http://www.isc.org/index.pl?/sw/bind/ BIND9] ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

=== [http://doc.powerdns.com/built-in-recursor.html PowerDNS recursor] ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:48:09Z

ZaphodB: /* BIND9 */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== [http://www.isc.org/index.pl?/sw/bind/ BIND9] ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:47:38Z

ZaphodB: /* BIND9 */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== BIND9 ===

When i wrote this example we were using two <a href="http://www.isc.org/index.pl?/sw/bind/">BIND</a> processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:46:12Z

ZaphodB: /* BIND9 */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== BIND9 ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

BIND 9.4 line makes use of this new internal malloc library by default now, but disabling threading will probably free you from the hickups some BIND9 users are experiencing.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:44:29Z

ZaphodB: /* BIND9 */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== BIND9 ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something JINMEI Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;) Btw. IIRC the upcoming BIND 9.4 line makes use of this new internal malloc library by default.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:41:11Z

ZaphodB: /* Introduction */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

The Example below is about setting up a cluster for recursive DNS but you can just as well use the same method for authorative DNS as well. Just remember that clients who use your cluster as a secondary nameservice would need to also-notify{} each of your realservers, not just the service-IP.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== BIND9 ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;) Btw. IIRC the upcoming BIND 9.4 line makes use of this new internal malloc library by default.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:37:45Z

ZaphodB: /* Configuration Example */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

=== BIND9 ===

When i wrote this example we were using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;) Btw. IIRC the upcoming BIND 9.4 line makes use of this new internal malloc library by default.

=== PowerDNS recursor ===

This one is a recursive-only Nameserver with very limited authorative DNS capabilities. The author of this Example uses PowerDNS recursor (v.3.1.4) exclusively for his caching-only DNS cluster by now and is glad that while giving roughly the same queries per second performance it generates less SERVFAIL answers and is generally several times more robust than BIND9.

=== added redundancy via iBGP ===

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2007-07-04T12:28:15Z

ZaphodB: /* Architecture */

== Introduction ==

DNS (Domain Name Service) is one of the primary Internet services, which is to map human-friendly domain names to machine-friendly IP address. If there are a lot of people using DNS service (for example, subscribers use ISP's DNS server), one DNS server might be becoming a bottleneck, and the server might fail.

Scalable DNS cluster can help provide scalability and availability of DNS service.

== Architecture ==

DNS is a simple service, there is no affinity between requests from the same client. DNS usually listens for queries at UDP port 53 and TCP port 53.

LVS can simply load balance UDP port 53 and TCP port 53 among a set of DNS servers, and there is no need to setup any persistence options.

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

We are using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;) Btw. IIRC the upcoming BIND 9.4 line makes use of this new internal malloc library by default.

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2006-03-04T18:12:24Z

ZaphodB: /* Configuration Example */

== Introduction ==

== Architecture ==

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

We are using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;) Btw. IIRC the upcoming BIND 9.4 line makes use of this new internal malloc library by default.

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

FAQ

2006-03-01T17:00:30Z

ZaphodB: /* How do i get counters from ipvsadm in order to create graphs from? */

== General ==
=== What's LVS? ===

LVS stands for Linux Virtual Server, which is a highly scalable and highly available server built on a cluster of real servers, with the [[load balancer]] running on the Linux operating system. Users interact as if it were a single virtual server.

=== Is LVS software free? ===

Yes! All LVS software is released under the [http://www.gnu.org/copyleft/gpl.html GNU General Public License (GPL)].

=== Is there a FreeBSD port of LVS software? ===

Yes, there is a FreeBSD port of IPVS, which supports the [[LVS/DR]] and [[LVS/TUN]] methods now. See [http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm the LVS On FreeBSD page] for more information.

=== Does LVS cluster support Linux servers only? ===

No, real servers can almost run any operating systems in a LVS cluster, such as Linux, BSDs, Solaris, and Windows. [[LVS/NAT]] balance servers of the operating systems having TCP/IP support, [[LVS/TUN]] requires servers having IP Tunneling protocol, and [[LVS/DR]] requires servers having a non-arp device. Almost all the modern operating systems support non-arp device.

== Performance ==

=== How is the concurrent processing performance of current LVS software? ===

The ultimate performance of LVS depends on hardware that LVS runs on. An ordinary box with a single Pentium III processor and 100Mbps NIC card running [[LVS/DR]] can handle about 10,000 connections per second for web service. We have heard that a powerful box with good hardware and kernel tuning achieved 50,000 connections per second.

=== Can LVS handle more than 1 million simultaneous connections? ===

Yes, LVS can handle much more than 1 million simultaneous connections. One connection just costs 128 bytes in the LVS box, so an LVS box with 1G memory can handle more than 8 million simultaneous connections.

== Setup ==

=== How do I check to see if my kernel has IPVS enabled? ===

Try to run "modprobe ip_vs" and try to see if there is /proc/net/ip_vs. If so, your kernel has [[IPVS]] enabled. You can also run "cat /proc/net/ip_vs" or "ipvsadm -Ln" to see the version number of [[IPVS]].

== Statistics ==

=== How do i get counters from ipvsadm in order to create graphs from? ===

The current kernel 2.6 version of ipvsadm (v1.24) supports
ipvsadm --list --stats --numeric --exact
which gives you non-human-readable counters for Connections, Packets and Bytes for each Service Address and Realserver.

FAQ

2006-03-01T16:28:05Z

ZaphodB: typo

== General ==
=== What's LVS? ===

LVS stands for Linux Virtual Server, which is a highly scalable and highly available server built on a cluster of real servers, with the [[load balancer]] running on the Linux operating system. Users interact as if it were a single virtual server.

=== Is LVS software free? ===

Yes! All LVS software is released under the [http://www.gnu.org/copyleft/gpl.html GNU General Public License (GPL)].

=== Is there a FreeBSD port of LVS software? ===

Yes, there is a FreeBSD port of IPVS, which supports the [[LVS/DR]] and [[LVS/TUN]] methods now. See [http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm the LVS On FreeBSD page] for more information.

=== Does LVS cluster support Linux servers only? ===

No, real servers can almost run any operating systems in a LVS cluster, such as Linux, BSDs, Solaris, and Windows. [[LVS/NAT]] balance servers of the operating systems having TCP/IP support, [[LVS/TUN]] requires servers having IP Tunneling protocol, and [[LVS/DR]] requires servers having a non-arp device. Almost all the modern operating systems support non-arp device.

== Performance ==

=== How is the concurrent processing performance of current LVS software? ===

The ultimate performance of LVS depends on hardware that LVS runs on. An ordinary box with a single Pentium III processor and 100Mbps NIC card running [[LVS/DR]] can handle about 10,000 connections per second for web service. We have heard that a powerful box with good hardware and kernel tuning achieved 50,000 connections per second.

=== Can LVS handle more than 1 million simultaneous connections? ===

Yes, LVS can handle much more than 1 million simultaneous connections. One connection just costs 128 bytes in the LVS box, so an LVS box with 1G memory can handle more than 8 million simultaneous connections.

== Setup ==

=== How do I check to see if my kernel has IPVS enabled? ===

Try to run "modprobe ip_vs" and try to see if there is /proc/net/ip_vs. If so, your kernel has [[IPVS]] enabled. You can also run "cat /proc/net/ip_vs" or "ipvsadm -Ln" to see the version number of [[IPVS]].

== Statistics ==

=== How do i get counters from ipvsadm in order to create graphs from? ===

ipvsadm --list --stats --numeric --exact
gives you non-human-readable counters for Connections, Packets and Bytes for each Service Address and Realserver.

FAQ

2006-03-01T16:27:39Z

ZaphodB: get counter with ipvsadm --list --stats --numeric --exact

== General ==
=== What's LVS? ===

LVS stands for Linux Virtual Server, which is a highly scalable and highly available server built on a cluster of real servers, with the [[load balancer]] running on the Linux operating system. Users interact as if it were a single virtual server.

=== Is LVS software free? ===

Yes! All LVS software is released under the [http://www.gnu.org/copyleft/gpl.html GNU General Public License (GPL)].

=== Is there a FreeBSD port of LVS software? ===

Yes, there is a FreeBSD port of IPVS, which supports the [[LVS/DR]] and [[LVS/TUN]] methods now. See [http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm the LVS On FreeBSD page] for more information.

=== Does LVS cluster support Linux servers only? ===

No, real servers can almost run any operating systems in a LVS cluster, such as Linux, BSDs, Solaris, and Windows. [[LVS/NAT]] balance servers of the operating systems having TCP/IP support, [[LVS/TUN]] requires servers having IP Tunneling protocol, and [[LVS/DR]] requires servers having a non-arp device. Almost all the modern operating systems support non-arp device.

== Performance ==

=== How is the concurrent processing performance of current LVS software? ===

The ultimate performance of LVS depends on hardware that LVS runs on. An ordinary box with a single Pentium III processor and 100Mbps NIC card running [[LVS/DR]] can handle about 10,000 connections per second for web service. We have heard that a powerful box with good hardware and kernel tuning achieved 50,000 connections per second.

=== Can LVS handle more than 1 million simultaneous connections? ===

Yes, LVS can handle much more than 1 million simultaneous connections. One connection just costs 128 bytes in the LVS box, so an LVS box with 1G memory can handle more than 8 million simultaneous connections.

== Setup ==

=== How do I check to see if my kernel has IPVS enabled? ===

Try to run "modprobe ip_vs" and try to see if there is /proc/net/ip_vs. If so, your kernel has [[IPVS]] enabled. You can also run "cat /proc/net/ip_vs" or "ipvsadm -Ln" to see the version number of [[IPVS]].

== Statistics ==

=== How do i get counters from ipvsadm in to create graphs from? ===

ipvsadm --list --stats --numeric --exact
gives you non-human-readable counters for Connections, Packets and Bytes for each Service Address and Realserver.

Building Scalable DNS Cluster using LVS

2006-03-01T16:16:42Z

ZaphodB: /* Workaround */

== Introduction ==

== Architecture ==

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

We are using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}
Ah yes, forgot to say: The Dual PIII 800 is not idleing around anymore - its busy running this script 44 times every 6 seconds, which accounts for roughly 12% user cpu and 5% system used at a query rate of ~3600q/s.

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2006-03-01T16:12:26Z

ZaphodB: the SERVFAIL problem

== Introduction ==

== Architecture ==

== Configuration Example ==

keepalived.conf:
! Balancer-Set for udp/53
virtual_server 194.97.173.124 53 {
delay_loop 10
lb_algo wrr
lb_kind DR
protocol UDP
! persistence_timeout 1
! persistence_granularity 255.255.255.255
! eth1.105 -> kai eth1.105
real_server 10.1.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.1.53.1 a resolve.test.roka.net @10.1.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
! eth1.109 -> kai eth1.109
real_server 10.3.53.2 53 {
weight 1
MISC_CHECK {
misc_path "/usr/bin/dig -b 10.3.53.1 a resolve.test.roka.net @10.3.53.2 +time=1 +tries=5 +fail > /dev/null"
misc_timeout 6
}
}
}

As you can dig (;-) we are using an A record with a low TTL to test the service for this setup is a recursive DNS cluster. So far dig works fine with 44 real_servers configured on an idle Dual PIII 800.

on real_server kai we use the following netfilter setup to be able to direct the traffic to different BIND processes on the same machine/mac:
#DNAT 194.97.173.124->10.1.53.2 eth1.105
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.1.53.2:53
iptables -t nat -A PREROUTING -i eth1.105 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.1.53.2:53
#DNAT 194.97.173.124->10.3.53.2 eth1.109
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p tcp --dport 53 -j DNAT --to-destination 10.3.53.2:53
iptables -t nat -A PREROUTING -i eth1.109 -s $net -d 194.97.173.124 -p udp --dport 53 -j DNAT --to-destination 10.3.53.2:53

We are using two BIND processes on the same machine for BIND9 currently just runs faster when it is not threading. Here is something Jinmei Tatuya told me on the bind9-workers Mailinglist which turned out to be very true:
If you go with disabling threads, you may also want to enable
"internal memory allocation". (I hear that) it should use memory more
efficiently (and can make the server faster) but is disabled by
default due to response-performance reasons in the threaded case. You
can enable this feature by adding the following line

#define ISC_MEM_USE_INTERNAL_MALLOC 1

just before the following part of bind9/lib/isc/mem.c:

#ifndef ISC_MEM_USE_INTERNAL_MALLOC
#define ISC_MEM_USE_INTERNAL_MALLOC 0
#endif
Try it and you will keep it. ;)

If you have more than one Loadbalancer at different locations and you can convince your local Networker to let you speak BGP4+ to his routers you can use quagga with something like the following configuration to failover the service IP to the second LB if the first one goes down:
!
router bgp 5430
no synchronization
bgp router-id a.b.c.d
redistribute connected route-map benice
neighbor c.d.e.f remote-as 5430
neighbor c.d.e.f description ffm4-j2
neighbor c.d.e.f send-community both
neighbor c.d.e.f soft-reconfiguration inbound
neighbor c.d.e.f route-map nixda in
neighbor c.d.e.f route-map benice out
neighbor d.c.f.e remote-as 5430
neighbor d.c.f.e description ffm4-j
neighbor d.c.f.e send-community both
neighbor d.c.f.e soft-reconfiguration inbound
neighbor d.c.f.e route-map nixda in
neighbor d.c.f.e route-map benice out
no auto-summary
!
access-list line permit 127.0.0.1/32 exact-match
access-list line deny any
!
ip prefix-list cns-dus2 description dus2 high-metric eq low-perference
ip prefix-list cns-dus2 seq 5 permit 194.97.173.125/32
ip prefix-list cns-dus2 seq 10 deny any
ip prefix-list cns-ffm4 description ffm4 low-metric eq high-preference
ip prefix-list cns-ffm4 seq 5 permit 194.97.173.124/32
ip prefix-list cns-ffm4 seq 10 deny any
!
route-map benice permit 10
match ip address prefix-list cns-ffm4
set local-preference 100
set metric 0
!
route-map benice permit 20
match ip address prefix-list cns-dus2
set local-preference 100
set metric 1
!
route-map nixda deny 10
!
This is the LB at FFM4. Note that the metric at the DUS2 LB is just the other way around.
Here we fancy talking to two core-routers from each LB for extra redundancy.
You can also have an internal anycast ServiceIP if you use the same metric at both LBs and make sure they are attached to the same level of router network-topology-wise. This way traffic gets shared between the two loadbalancers according to your network-topology most interesting of course for large dialin ISPs.

=== Problem ===

dig does not return a non-zero error code when receiving a SERVFAIL but there are situations when some BIND9 versions return SERVFAIL for any query for example when they are out of memory. For a recursive DNS cluster situation we would want to take such BIND processes out of service.

==== Workaround ====

use the following perl script as a wrapper for dig which is quite ugly for perl is an interpretated language and forking it is not much fun so this consumes much user cpu when executed every 6 seconds.
#!/usr/bin/perl
use strict;
use warnings;
# cmdline arguments: <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>
# /usr/bin/dig -b 10.5.53.1 IN A 2.0.0.127.my.test @10.5.53.2 +time=1 +tries=5 +fail
if(
((defined $ARGV[0])&&($ARGV[0]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[1])&&($ARGV[1]=~/^(IN|CHAOS)$/))
&&((defined $ARGV[2])&&($ARGV[2]=~/^(A|ANY|MX|PTR|SRV|TXT|AAAA|NS|CNAME|SOA)$/))
&&((defined $ARGV[3])&&($ARGV[3]=~/^[A-Za-z0-9\-\.]+$/))
&&((defined $ARGV[4])&&($ARGV[4]=~/^\d+\.\d+\.\d+\.\d+$/))
&&((defined $ARGV[5])&&($ARGV[5]=~/^\d+$/))
&&((defined $ARGV[6])&&($ARGV[6]=~/^\d+$/))
&&((defined $ARGV[7])&&($ARGV[7]=~/^\S+$/))
) {
my $transport="notcp";
if((defined $ARGV[8])&&($ARGV[8]=~/^tcp$/i)) {
$transport="tcp";
} elsif ((defined $ARGV[8])&&($ARGV[8]=~/^udp$/i)) {
$transport="notcp";
}
my (@res)=`/usr/bin/dig -b $ARGV[0] $ARGV[1] $ARGV[2] $ARGV[3] \@$ARGV[4] +time=$ARGV[5] +tries=$ARGV[6] +fail +$transport 2>&1`;
my $return=$?;
if(my $error=(map {/status:\s*($ARGV[7])/ ? $1 : ()} @res)[0]) {
die("$error");
} elsif ($return!=0) {
die("dig returned: \"$return\"");
} elsif ($return==0) {
exit 0;
} else {
die("error: \"$return\" HAS BAD VALUE!");
}
} else {
die("dig-wrapper.pl <FromIP> <Class> <QTYPE> <QNAME> <ToIP> <Times> <Tries> <ErrrorMatch> <Transport>");
}

==== Solution ====

use a patched version of dig?

== Conclusion ==

It still just works.

{{lvs-example-stub}}

[[Category:LVS Examples|DNS]]

Building Scalable DNS Cluster using LVS

2006-01-21T12:31:58Z

ZaphodB: /* Configuration Example */

Building Scalable DNS Cluster using LVS

2006-01-20T09:36:48Z

ZaphodB: /* Conclusion */

Building Scalable DNS Cluster using LVS

2006-01-20T09:34:46Z

ZaphodB: /* Configuration Example */ now here is actually something written ;)