Two-node setup with overlapping client subnets

This guide describes how to set up a two-node cluster that handles multiple overlapping client subnets, and keeps clients uniquely identifiable, even as they reach the main application.

Preamble

This setup requires kernel version 2.6.37 or newer. In particular, it depends on the following recent features:

Accept incoming packets with local source (v2.6.33, commit)
Connection tracking zones (v2.6.34, commit)
Netfilter nat INPUT chain, NETMAP changes (v2.6.36, commit)
Netfilter connection tracking for lvs/ipvs (v2.6.37, commit)

Introduction

This setup solves the challenge of serving remote users that originate from multiple different sites that all use the same overlapping subnet, all on a single pair of nodes. It also maps each remote site to a unique IP range so that users can be identified in application logs, etc. So in short, it combines a two-node, multi-interface lvs setup with network remapping. It's ultimately a fairly complicated puzzle.

Network infrastructure

Network diagram:

Main virtual IP: 1.2.3.4 port 80
                                                 /-----------\
          +------------------------------+   /==( 10.0.0.0/24 )
          | Router                       |  //   \-----------/
          | (VPN)   ' ' ' ' ' ' ' ' ' ' '===/      Remote A
          |         '  (crypto map)      |
          |         '           '  '  ' '===\
          |         '           '        |  \\   /-----------\
          |       if0.10     if0.20      |   \==( 10.0.0.0/24 )
          |  172.16.10.1     172.16.20.1 |       \-----------/
          +------------#-----$-----------+         Remote B
                       #     $
               VLAN 10 #     $ VLAN 20
                       #     $
    VIP 172.16.10.2/24 #     $ VIP 172.16.20.2/24
        VIP 1.2.3.4/32 #     $ VIP 1.2.3.4/32
+------------------+   #     $   +------------------+
|  LVS A           |   #     $   |           LVS B  |
|          eth0.10 ######### $ ### eth0.10          |
|  RIP 172.16.10.3 |         $   | RIP 172.16.10.4  |
|                  |         $   |                  |
|          eth0.20 $$$$$$$$$$$$$$$ eth0.20          |
|  RIP 172.16.20.3 |             | RIP 172.16.20.4  |
|                  |             |                  |
+------------------+             +------------------+

Notes

While you can use individual network interfaces, using VLANs saves valuable resources. Combine VLANs with interface bonding to achieve an even higher degree of resilience against failures.

The router maps each remote site to its own VLAN. How this is done isn't really important; the remote sites can be directly connected on separate egress interfaces, or using IPSec VPNs, like in the diagram: This approach is common in Cisco routers by using VRF-aware IPSec. In short, a crypto map is defined so that tunnel A is mapped to VRF (virtual routing and forwarding) instance 10, and tunnel B to VRF 20. These VRF instances will in turn have separate routing tables, pointing the virtual IP towards the LVS VIP on each VLAN.

An obvious and easy solution to the overlapping subnets, would be to have the router do SNAT/masquerading of the incoming packets. In my case, I spent lots of time trying to get that to work on the Cisco router, but without luck.

Theory of operation

Assuming that traffic reaches the LVS pair on both VLAN 10 and 20, the idea is that packets are handled in the following way on each VLAN:

Incoming packets coming from the client subnet 10.0.0.0/24, destined for the virtual IP 1.2.3.4 on port 80, are marked with an fwmark using iptables with the MARK target.
Keepalived/LVS/ipvs is configured to schedule packets based on fwmarks. Mark 10 is loadbalanced to VLAN 10 backends 172.16.10.3 and .4. Mark 20 is loadbalanced to VLAN 20 backends 172.16.20.3 and .4.
Either on the way in on the same node, or on the way out to the other node, the packet's source IP is mapped to a unique subnet using the iptables NETMAP target.
The packet is handled by the main application, and a response packet is sent back to the source.
RPDB entries and custom routing tables are set up using iproute2, to ensure that the response packet makes it back the same way it came, through the NETMAP translation and then out the same interface it came in.

In the end, the main application (running on both nodes) would see clients from site A coming from source 10.0.10.0/24, and clients from site B coming from 10.0.20.0/24. Any application/user logic, log analysis or accounting, would have to take this into account and do a reverse mapping.

Proof of concept

The following script will use iptables and iproute to set the network up to the required state. It assumes that the network interfaces have been set up with the basic IP addresses:

LVS A, eth0.10: 172.16.10.3/24, default gateway 172.16.10.1
LVS A, eth0.20: 172.16.20.3/24, no default gateway
LVS B, eth0.10: 172.16.10.4/24, default gateway 172.16.10.1
LVS B, eth0.20: 172.16.20.4/24, no default gateway

Network configuration script

#!/bin/sh

# This proof of concept script is intended to be straight forward to
# read and understand, rather than being cleverly written with
# variables, loops, etc. It is intended to work on both nodes, so some
# conditional variables must be set initially.

# Unique variables per node, derived from hostname.
case `hostname` in
  lvsa)
    other_mac=00:10:10:10:10:20 # lvsb eth0 mac address
    v10my_ip=172.16.10.3        # lvsa eth0.10 ip addr
    v20my_ip=172.16.20.3        # lvsa eth0.20 ip addr
    v10other_ip=172.16.10.4     # lvsb eth0.10 ip addr
    v20other_ip=172.16.20.4     # lvsb eth0.20 ip addr
    ;;
  lvsb)
    other_mac=00:10:10:10:10:10 # lvsa eth0 mac address
    v10my_ip=172.16.10.4        # lvsb eth0.10 ip addr
    v20my_ip=172.16.20.4        # lvsb eth0.20 ip addr
    v10other_ip=172.16.10.3     # lvsa eth0.10 ip addr
    v20other_ip=172.16.20.3     # lvsa eth0.20 ip addr
    ;;
  *)
    echo 2>&1 "Unknown host: `hostname`"
    exit 1
    ;;
esac

start() {
  ### RPDB (Routing Policy Database)
  # Remote response packets (those that will have to be sent to the
  # other node) will have source IP equal to the outgoing interface's
  # primary address, due to the iptables REDIRECT that rewrites the
  # destination address to the incoming interface's primary address on
  # incoming packets. The routing tables pointed to have the other
  # node as gateway.
  ip rule add pref 110 from $v10my_ip to 10.0.10.0/24 lookup 210
  ip rule add pref 120 from $v20my_ip to 10.0.20.0/24 lookup 220

  # Local response packets will have been NETMAP detranslated already,
  # so the destination will be the untranslated source net. The
  # routing tables pointed to have the upstream router as gateway,
  # since the packets should be sent straight back to the source.
  ip rule add pref 210 to 10.0.10.0/24 lookup 110
  ip rule add pref 220 to 10.0.20.0/24 lookup 120

  # Remote response packets are marked so that they are routed out the
  # correct interface when send out.
  ip rule add pref 310 to 10.0.0.0/24 fwmark 210 lookup 110
  ip rule add pref 320 to 10.0.0.0/24 fwmark 220 lookup 120


  ### Routing
  # Routing tables pointing the default gateway to the upstream
  # router.
  ip route add default via 172.16.10.1 dev eth0.10 table 110
  ip route add default via 172.16.20.1 dev eth0.20 table 120

  # Routing tables pointing the default gateway to the other node.
  ip route add default via $v0other_ip dev eth0.10 table 210
  ip route add default via $v1other_ip dev eth0.20 table 220


  ### Netfilter rules
  # Put request packets on each interface into separate connection
  # tracking zones. NAT rules applied to packets in one zone, will not
  # be touched by other NAT rules that don't apply to that zone. See
  # http://lwn.net/Articles/371028/
  # https://github.com/torvalds/linux/commit/5d0aa2ccd4699a01cfdf14886191c249d7b45a01
  iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1
  iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2

  # Put remote response packets (from the other node) into the
  # corresponding connection tracking zones.
  iptables -t raw -A PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1
  iptables -t raw -A PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2

  # Put local response packets (from this node) into the corresponding
  # connection tracking zones.
  iptables -t raw -A OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1
  iptables -t raw -A OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2

  # Mark incoming packets for lvs/ipvs scheduling. These marks match
  # the ones in keepalived.conf.
  iptables -t raw -A PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 110
  iptables -t raw -A PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 120

  # Mark remote response packets (from the other node) so that they
  # are routed out the correct interface after NETMAP detranslation.
  iptables -t raw -A PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210
  iptables -t raw -A PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220

  # Mark local response packets (from this node) so that they are
  # routed out the correct interface after NETMAP detranslation.
  iptables -t raw -A OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 210
  iptables -t raw -A OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 220

  # Remap remote request packets to unique subnets.
  # https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e
  iptables -t nat -A POSTROUTING -m mark --mark 110 -j NETMAP --to 10.0.10.0/24
  iptables -t nat -A POSTROUTING -m mark --mark 120 -j NETMAP --to 10.0.20.0/24

  # Remap local request packets to unique subnets.
  iptables -t nat -A INPUT -m mark --mark 110 -j NETMAP --to 10.0.10.0/24
  iptables -t nat -A INPUT -m mark --mark 120 -j NETMAP --to 10.0.20.0/24

  # Redirect incoming remote request packets so that the source IP is
  # set to the primary address of the incoming interface. This is
  # essential since it will ensure that the response packet is routed
  # out the same interface. Without it, the routing would select the
  # default route. A potential solution would involve connmark, but
  # the mark is applied after the ip rule is evaluated.
  # FIXME: The mac-source matching should be unnecessary, as the
  # source subnet has been translated already? Needs verification.
  iptables -t nat -A PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT
  iptables -t nat -A PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT


  ### Sysctl settings
  # Use ARP settings that works with our setup.
  # http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L895
  # http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L926
  # http://kb.linuxvirtualserver.org/wiki/ARP_Issues_in_LVS/DR_and_LVS/TUN_Clusters
  sysctl net.ipv4.conf.eth0/10.arp_announce=2
  sysctl net.ipv4.conf.eth0/10.arp_ignore=1

  # Forwarding must be enabled, although forwarding doesn't apply to
  # packets scheduled by lvs/ipvs, it's needed for remote response
  # packets. (Forwarding applies to the inbound interface, not the
  # outbound.)
  sysctl net.ipv4.conf.eth0/10.forwarding=1

  # Accept incoming packets with a local source addres. This is
  # required, as remote response packets will have the virtual IP as
  # their source: That virtual IP is also present as a secondary
  # address on the local incoming interface. Normally it would be
  # dropped, but this sysctl allows it to be accepted.
  # https://github.com/torvalds/linux/commit/8153a10c08f1312af563bb92532002e46d3f504a
  # http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L849
  sysctl net.ipv4.conf.eth0/10.accept_local=1

  # Enable connection tracking for lvs/ipvs connections. This lets us
  # apply the NETMAP rule in POSTROUTING for remote request packets.
  # Without this setting, the netfilter nat table would not be
  # traversed by the ipvs'ed packets.
  # https://github.com/torvalds/linux/commit/f4bc17cdd205ebaa3807c2aa973719bb5ce6a5b2
  # http://lxr.linux.no/#linux+v3.0/net/netfilter/ipvs/Kconfig#L252
  sysctl net.ipv4.vs.conntrack=1

  # Disable accepting and sending ICMP redirects. This is essential to
  # avoid redirecting remote response packets directly to the router:
  # These packets must go through the lvs/ipvs master for correct
  # NETMAP detranslation. Setting this for 'all' is enough as long as
  # other interfaces have forwarding enabled (which they do).
  # http://lxr.linux.no/#linux+v3.0/Documentation/networking/ip-sysctl.txt#L753
  sysctl net.ipv4.conf.all.accept_redirects=0
  sysctl net.ipv4.conf.all.send_redirects=0
}

stop() {
  # Revert most of the settings from start().
  sysctl net.ipv4.conf.eth0/10.accept_local=0
  sysctl net.ipv4.conf.eth0/10.forwarding=0

  iptables -t nat -D PREROUTING -s 10.0.20.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT
  iptables -t nat -D PREROUTING -s 10.0.10.0/24 -m mac --mac-source $other_mac -d 1.2.3.4 -j REDIRECT
  iptables -t nat -D INPUT -m mark --mark 320 -j NETMAP --to 10.0.20.0/24
  iptables -t nat -D INPUT -m mark --mark 310 -j NETMAP --to 10.0.10.0/24
  iptables -t nat -D POSTROUTING -m mark --mark 320 -j NETMAP --to 10.0.20.0/24
  iptables -t nat -D POSTROUTING -m mark --mark 310 -j NETMAP --to 10.0.10.0/24
  iptables -t raw -D OUTPUT -o eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120
  iptables -t raw -D OUTPUT -o eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110
  iptables -t raw -D PREROUTING -i eth0.20 -s 1.2.3.4/32 -d 10.0.20.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 120
  iptables -t raw -D PREROUTING -i eth0.10 -s 1.2.3.4/32 -d 10.0.10.0/24 -p tcp -m tcp --sport 80 -j MARK --set-mark 110
  iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 320
  iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j MARK --set-mark 310
  iptables -t raw -D OUTPUT -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2
  iptables -t raw -D OUTPUT -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1
  iptables -t raw -D PREROUTING -d 10.0.20.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 2
  iptables -t raw -D PREROUTING -d 10.0.10.0/24 -s 1.2.3.4 -p tcp --sport 80 -j CT --zone 1
  iptables -t raw -D PREROUTING -i eth0.20 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 2
  iptables -t raw -D PREROUTING -i eth0.10 -s 10.0.0.0/24 -d 1.2.3.4 -p tcp --dport 80 -j CT --zone 1

  ip route flush table 220
  ip route flush table 210
  ip route flush table 120
  ip route flush table 110
  ip rule del pref 320
  ip rule del pref 310
  ip rule del pref 220
  ip rule del pref 210
  ip rule del pref 120
  ip rule del pref 110
}

case "$1" in
  start|stop)
    $1
    ;;
  restart)
    stop
    start
    ;;
  *)
    echo "Usage: $0 start|stop|restart"
    ;;
esac

keepalived.conf

FIXME: This is fairly stripped down, so might not work out of the box.

vrrp_sync_group mylvs {
  group {
    VI_1
    VI_2
  }
}
vrrp_instance VI_1 {
  state BACKUP
  interface eth0.10
  virtual_router_id 10
  priority 100
  virtual_ipaddress {
    172.16.10.2 # VLAN 10 VIP
    1.2.3.4     # Main virtual IP
  }
}
vrrp_instance VI_2 {
  state BACKUP
  interface eth0.20
  virtual_router_id 20
  priority 150
  virtual_ipaddress {
    172.16.20.2 # VLAN 20 VIP
    1.2.3.4     # Main virtual IP
  }
}
virtual_server fwmark 310 {
  lb_algo lc
  lb_kind DR
  persistence_timeout 0
  delay_loop 20
  protocol TCP
  real_server 172.16.10.3 80 {
    weight 1
  }
  real_server 172.16.10.4 80 {
    weight 1
  }
}
virtual_server fwmark 320 {
  lb_algo lc
  lb_kind DR
  persistence_timeout 0
  delay_loop 20
  protocol TCP
  real_server 172.16.20.3 80 {
    weight 1
  }
  real_server 172.16.20.4 80 {
    weight 1
  }
}

Tips

You can use symbolic routing table names instead of numbers (both for ip rule and ip route) by adding the number-name mapping to /etc/iproute2/rt_tables. You can then use syntax like "ip rule add ... lookup v0vlan", and "ip route add ... table v0vlan". For example:

110     v0vlan
120     v1vlan
210     v0to_peer
220     v1to_peer

Make sure to keep track of fwmark numbers and ip rule preference numbers. They can overlap if you want, but make sure to keep track of each without mixing them up. Use a logical scheme.

Debian implementation

The following script is the one I use on my own setup. It is designed to work on systems that use ifupdown (Debian and Ubuntu, for example). Ifupdown can call hook scripts on events (like before or after the interface is brought up or down). In this case it is called after bringing the interface up, and before bringing it down.

/etc/network/interfaces

LVS A

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  # Load ip_vs here to avoid segfaults in keepalived (it tries to do
  # 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465
  pre-up /sbin/modprobe ip_vs # Avoid
  # Add main virtual IP to lo interface
  up /sbin/ip addr add 1.2.3.4/32 dev lo

# VLAN 10
auto eth0.10
iface eth0.10 inet static
  address 172.16.10.3
  netmask 255.255.255.255
  broadcast 172.16.10.255
  gateway 172.16.10.1

# VLAN 20
auto eth0.20
iface eth0.20 inet static
  address 172.16.20.3
  netmask 255.255.255.255
  broadcast 172.16.20.255

LVS B

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual
  # Load ip_vs here to avoid segfaults in keepalived (it tries to do
  # 'modprobe -k'. See https://bugzilla.redhat.com/show_bug.cgi?id=528465
  pre-up /sbin/modprobe ip_vs # Avoid
  # Add main virtual IP to lo interface
  up /sbin/ip addr add 1.2.3.4/32 dev lo

# VLAN 10
auto eth0.10
iface eth0.10 inet static
  address 172.16.10.4
  netmask 255.255.255.255
  broadcast 172.16.10.255
  gateway 172.16.10.1

# VLAN 20
auto eth0.20
iface eth0.20 inet static
  address 172.16.20.4
  netmask 255.255.255.255
  broadcast 172.16.20.255

ifupdown script

Put this script in /etc/network/if-up.d/, and put a symlink to it in /etc/network/if-down.d/.

#!/bin/sh

# This script is most likely not plug-and-play. Take note of the
# FIXMEs and how the script is intended to work.

# This script depends on MAC and IP addresses, so we only support the
# following two hostnames.
case `hostname` in
  lvsa)
    other_mac=00:10:10:10:10:20 # lvsb eth0 mac address
    ;;
  lvsb)
    other_mac=00:10:10:10:10:10 # lvsa eth0 mac address
    ;;
  *)
    echo >&2 "Skipping unknown hostname: `hostname`"
    exit 0
    ;;
esac

client_net=10.0.0.0/24
virtual_ips="1.2.3.4" # Multiple allowed, space separated.
virtual_ports="80"    # Multiple allowed, space separated.
iface=$IFACE

case "$iface" in
  eth0.10)
    index=0
    ct_zone=1
    my_gw=172.16.10.1
    my_ip=172.16.10.3
    other_ip=172.16.10.4
    test `hostname` = lvsb && {
      my_ip=172.16.10.4
      other_ip=172.16.10.3
    }
    ;;
  eth0.20)
    index=1
    ct_zone=2
    my_gw=172.16.20.1
    my_ip=172.16.20.3
    other_ip=172.16.20.4
    test `hostname` = lvsb && {
      my_ip=172.16.20.4
      other_ip=172.16.20.3
    }
    ;;
  "")
    echo >&2 "Skipping empty interface"
    exit 0
    ;;
  *)
    echo >&2 "Skipping unknown interface: $iface"
    exit 0
    ;;
esac

# These variables use lots of shortcuts based on the $ct_zone number.
netmap=10.0.${ct_zone}0.0/24   # 10.0.10.0/24, 10.0.20.0/24
rt_vlan_num=1${ct_zone}0       # 110, 120
rt_vlan=v${ct_zone}0vlan       # v10vlan, v20vlan
rt_topeer_num=2${ct_zone}0     # 210, 220
rt_topeer=v${ct_zone}0to_peer  # v10to_peer, v20to_peer
rp_topeer=1${ct_zone}0         # 110, 120
rp_return=2${ct_zone}0         # 210, 220
rp_fwmark=3${ct_zone}0         # 310, 320
fwm_ipvs=1${ct_zone}0          # 110, 120
fwm_return=2${ct_zone}0        # 210, 220

start() {
  # Add symbolic routing table names
  grep -q "^$rt_vlan_num\>" /etc/iproute2/rt_tables ||
    echo "$rt_vlan_num\t$rt_vlan" >> /etc/iproute2/rt_tables
  grep -q "^$rt_topeer_num\>" /etc/iproute2/rt_tables ||
    echo "$rt_topeer_num\t$rt_topeer" >> /etc/iproute2/rt_tables

  ### Routing policy database (RPDB) entries
  # Delete stale entries
  ip rule del pref $rp_topeer 2>/dev/null
  ip rule del pref $rp_return 2>/dev/null
  ip rule del pref $rp_fwmark 2>/dev/null

  # From own IP (due to iptables REDIRECT) to mapped subnet,
  # use routing table pointing to peer node.
  ip rule add pref $rp_topeer from $my_ip to $netmap lookup $rt_topeer

  # To mapped subnet (if this node has handled the packet),
  # use routing table pointing to vlan's gateway.
  ip rule add pref $rp_return to $netmap lookup $rt_vlan

  # To original subnet with fwmark (packet has been through ipvs
  # and translation, and is on its way back), use routing table
  # pointing to vlan's gateway.
  ip rule add pref $rp_fwmark to $client_net fwmark $fwm_return lookup $rt_vlan

  ### Routing entries
  # Flush stale tables
  ip route flush table $rt_vlan
  ip route flush table $rt_topeer

  # VLAN gateway
  ip route add default via $my_gw dev $iface table $rt_vlan

  # Other peer is gateway
  ip route add default via $other_ip dev $iface table $rt_topeer

  # Accept local source on interface. This is for packets returning
  # from the second node after being ipvs'ed by this node.
  sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=1
  sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=1
  sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_announce=2
  sysctl net.ipv4.conf.`echo $iface|tr . /`.arp_ignore=1

  # Ensure conntrack of ipvs'ed packets. Sourcing /etc/sysctl.conf
  # from /etc/init.d/procps happens too early.
  sysctl net.ipv4.vs.conntrack=1

  # https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e
  # Use separate conntrack zone for each interface. Request packets.
  for vip in $virtual_ips; do
    for port $virtual_ports; do
      iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone
    done
  done

  # Use separate conntrack zone for each interface. Return packets.
  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      for chain in PREROUTING OUTPUT; do
        iptables -t raw -A $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone
      done
    done
  done

  # Mark packets for ipvs scheduling.
  for vip in $virtual_ips; do
    # FIXME: Use $virtual_ports somehow, and map them to fwmarks?
    iptables -t raw -A PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs
  done

  # Mark packets for rpdb return routes. Forwarded from peer.
  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      iptables -t raw -A PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return
    done
  done

  # Mark packets for rpdb return routes. From self.
  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      iptables -t raw -A OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return
    done
  done

  # Map source network to unique subnet
  iptables -t nat -A INPUT       -m mark --mark $fwm_ipvs -j NETMAP --to $netmap
  iptables -t nat -A POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap

  # DNAT to local iface address if packet is coming from other node
  # (after ipvs scheduling). This lets us do correct rpdb+routing for
  # return packets.
  # FIXME: The mac-source matching should be unnecessary, as the
  # source subnet has been translated already? Needs verification.
  for vip in $virtual_ips; do
    iptables -t nat -A PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT
  done
}

stop() {
  # Return packets
  ip rule del pref $rp_topeer 2>/dev/null
  ip rule del pref $rp_return 2>/dev/null
  ip rule del pref $rp_fwmark 2>/dev/null

  # Return routes
  ip route flush table $rt_vlan 2>/dev/null
  ip route flush $rt_topeer 2>/dev/null

  # Accept local source on interface. This is for packets returning
  # from the second node after being ipvs'ed by this node.
  sysctl net.ipv4.conf.`echo $iface|tr . /`.accept_local=0 2>/dev/null
  sysctl net.ipv4.conf.`echo $iface|tr . /`.forwarding=0 2>/dev/null

  # https://github.com/torvalds/linux/commit/c68cd6cc21eb329c47ff020ff7412bf58176984e
  # Use separate conntrack zone for each interface.
  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport $port -j CT --zone $ct_zone 2>/dev/null
    done
  done

  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      for chain in PREROUTING OUTPUT; do
        iptables -t raw -D $chain -d $netmap -s $vip -p tcp --sport $port -j CT --zone $ct_zone 2>/dev/null
      done
    done
  done

  # Marks for ipvs
  for vip in $virtual_ips; do
    # FIXME: Use $virtual_ports somehow, and map them to fwmarks?
    iptables -t raw -D PREROUTING -i $iface -s $client_net -d $vip -p tcp --dport 80 -j MARK --set-mark $fwm_ipvs 2>/dev/null
  done

  # Marks for rpdb return routes
  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      iptables -t raw -D PREROUTING -i $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null
    done
  done

  for vip in $virtual_ips; do
    for port in $virtual_ports; do
      iptables -t raw -D OUTPUT -o $iface -s $vip -d $netmap -p tcp -m tcp --sport $port -j MARK --set-mark $fwm_return 2>/dev/null
    done
  done

  # Map source network to unique subnet
  iptables -t nat -D INPUT       -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null
  iptables -t nat -D POSTROUTING -m mark --mark $fwm_ipvs -j NETMAP --to $netmap 2>/dev/null

  # DNAT to local iface address if packet is coming from other node
  # (after ipvs scheduling).
  for vip in $virtual_ips; do
    iptables -t nat -D PREROUTING -s $netmap -m mac --mac-source $other_mac -d $vip -j REDIRECT 2>/dev/null
  done

  return 0
}

case "$MODE" in
  start)
    start
    ;;
  stop)
    stop
    ;;
esac

Troubleshooting

While developing this setup, I ran into tons of problems. The following debugging tricks are invaluable when working with complex network setups.

Tcpdump

The mother of all network debugging. It is very useful here, especially with some good filters. Always use the -e option so you can inspect the MAC addresses. They are very important in this sort of setup.

Here's a useful example that dumps packets on eth0.10, filtering on packets to/from port 80, and involving the MAC addresses for either of the LVS nodes. It also shows ARP and ICMP packets, which is very useful.

tcpdump -envi eth0.10 -n port 80 and '( ether host 00:10:10:10:10:10 or ether host 00:10:10:10:10:20 )' or arp or icmp

Iptables logging

Logging in all netfilter tables and chains is a great way to inspect how a packet traverses the stack. This script will set up four rules per chain:

Request packets towards port 80 in the start of the chain
Response packets from port 80 in the start of the chain
Request packets towards port 80 in the end of the chain
Response packets from port 80 in the end of the chain

for t in raw mangle nat filter; do
  for c in PREROUTING INPUT FORWARD OUTPUT POSTROUTING; do
    iptables -t $t -I $c -p tcp --dport 80 -j LOG --log-prefix "REQ-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null
    iptables -t $t -I $c -p tcp --sport 80 -j LOG --log-prefix "RES-A-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null
    iptables -t $t -A $c -p tcp --dport 80 -j LOG --log-prefix "REQ-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null
    iptables -t $t -A $c -p tcp --sport 80 -j LOG --log-prefix "RES-Z-`echo $t|cut -b-3`-`echo $c|cut -b-3` " 2>/dev/null
  done
done

Then use something like "tail -f /var/log/kern.log" to track what's going on.

LVS/ipvs debugging

For detailed lvs/ipvs debugging, you can check if your kernel is compiled with CONFIG_IP_VS_DEBUG enabled. If not, you can recompile the kernel after enabling it. Set the debug level to a suitable number, and you can tail the kernel log to see what's going on.

ICMP redirects

If you leave ICMP redirects enabled, the LVS nodes will teach each other to send remote response packets directly back to the router, instead of through the required NETMAP detranslation. To avoid this, set the net.ipv4.conf.all.accept_redirects sysctl to 0.

In a particularly long debug session, I couldn't figure out why packets were being sent directly back to the router, even if redirects were disabled. Listing the route cache with "ip route show cache" indicated that the route was flagged with 'redirected'. This turned out to be due to a modification where the inet peer cache kept information about a previously learned redirect (before I disabled them), and propagated that to the route cache. Instead of "ip route flush cache", I had to reboot the node to clear the inet peer cache (or wait for it to time out, which could take a while).

Other

Before finding the 2.6.36 NAT and NETMAP modifications, I played around with Virtual Distributed Ethernet and the feature allowing to delete/move the local routing table preference (2.6.33) to loop packets out through a virtual switch and back in again, but it got even more messy.

Thanks

This approach would not be possible without all the recent patches by Patrick McHardy, and of course the years of ground work in netfilter and ipvs that it builds upon.