neveragain.de teletype

AWS: Networking: IPv6, the Mapping Service and the Lack of Multicast

2020-08-12

This post will explore some lesser-known nuances of networking on AWS EC2:

Should We Disable IPv6 DAD?

I started looking into AWS networking details because of an interesting question on Twitter:

Can anyone imagine a situation where they would want IPv6 Duplicate Address Detection to be operational in EC2?

Several questions arose in that thread:

  1. Is disabling IPv6 Duplicate Address Detection (DAD) on EC2 by default a good idea?
  2. Is it even possible to “have” a duplicate address on EC2?
  3. How do SLAAC and DHCPv6 interact, anyway?
  4. How does enabling DAD seem to help with slow failovers?

We’ll happily ignore the core question of why ISC dhclient fails on FreeBSD.

Instead we’ll play around and explore the lack of multicast on AWS, with no specific goal.

Recap: “Ethernet”

Let’s do a quick recap of networking basics. I’m guessing that lots of EC2 users are aware of IP basics, but less so of the magic glue below that. Feel free to skip if you know the “Ethernet things”.

For all our practical purposes here, our network layer – the IP protocols – runs on top of the IEEE 802 data link layer, used for pretty much every relevant physical layer from classic cable-bound Ethernet to Wi-Fi.

IEEE 802 originated from Ethernet, which is designed for a shared medium, meaning any machine can talk to any machine. Originally, all machines where hooked up to the same cable.

IEEE 802 Media Access Control uses 48-bit addresses (MAC addresses for short), usually displayed in hex format, e.g. 66:72:6e:45:12:e1. MAC addresses are literally everywhere – on your fancy Watch, on your TV, on your WiFi router, on your EC2 instances, and probably a bunch of them in your car.

Broadcasting means sending to all machines in the network, accomplished by sending to MAC address ff:ff:ff:ff:ff:ff. Every machine on the local network listens to that address.

Multicasting means sending to a subset of machines, those that subscribed to a specific address. This concept exists only in IP protocols and is treated as a broadcast on Ethernet, but it does use different MAC addresses (01:00:0e:... for IPv4 and 33:33:... for IPv6).

Recap: Address Resolution and IPv6 DAD

Alright, so IPv4 and IPv6 sit on top of that data link layer.

To send IP packets to another machine on the local network, MAC address resolution has to happen first.

To figure out which MAC address to send to, the sender will broadcast to all machines: Who has IP address XXX?, to which the receiver will reply with its MAC address. Only then can the sender start to send IP packets to the receiver’s MAC address.

This process is basically the same for IPv4 (ARP, Address Resolution Protocol) and IPv6 (Neighbor Discovery), IPv6 just uses a special multicast address instead of broadcast.

One of the features of IPv6 is mandatory Duplicate Address Detection (DAD). Whenever a new address is configured, an IPv6 machine (a node in IPv6 lingo) has to check if that address is already in use by another node.

It does so by using that same address resolution process: It asks this Who has? question for the IP address being configured – if some node responds to that, the address is obviously already in use.

No broadcast, no multicast!

As we’ve seen, multicast / broadcast are essential for any IP communication to take place.

Here’s the kicker about AWS networking: There is no broadcasting. There is no multicasting.

I’ve just recently discovered that. Spontaneously I cannot even find a clear piece of official documentation on that, the VPC FAQ doesn’t mention it either.

What you can find is an excellent piece on the AWS APN blog: Amazon VPC for On-Premises Network Engineers:

Routing and switching: This is the fun part. The re:Invent session Another Day, Another Billion Packets, which I mentioned at the beginning of this blog post, covers this topic in depth.

There are a few deviations network engineers will want to know about routing and switching in a VPC:

All traffic is unicast (the Address Resolution Protocol, or ARP, is gone too). The VPC forwarding design knows where all IP addresses are, and proxies ARP responses for all ARP requests locally at the hypervisor.

All subnets in a VPC have direct access to one another, unless they are filtered in the host or in security groups or network ACLs.

The re:Invent session video is really interesting. It also explains how ARP (and Neighbor Discovery) requests are handled by a Mapping Service.

Let’s play

IPv6 relies on several multicast mechanisms, so let’s see what happens.

Multicast addresses

IPv6 defines several multicast addresses, e.g. ff02::1 for all hosts and ff02::2 for all routers. Let’s try those.

In a conventional network, this works as expected:

Pinging all hosts yields many replies:

root@pooch:~ # ping6 ff02::1%wlan0
PING6(56=40+8+8 bytes) fe80::76da:38ff:fe6a:d72d%wlan0 --> ff02::1%wlan0
16 bytes from fe80::7[...]%wlan0, icmp_seq=0 hlim=64 time=0.665 ms
16 bytes from fe80::6[...]%wlan0, icmp_seq=0 hlim=64 time=13.595 ms(DUP!)
16 bytes from fe80::2[...]%wlan0, icmp_seq=0 hlim=64 time=17.098 ms(DUP!)
16 bytes from fe80::1[...]%wlan0, icmp_seq=0 hlim=64 time=108.220 ms(DUP!)
16 bytes from fe80::1[...]%wlan0, icmp_seq=0 hlim=64 time=240.956 ms(DUP!)
16 bytes from fe80::8[...]%wlan0, icmp_seq=0 hlim=64 time=262.192 ms(DUP!)

When pinging all routers, the local router dutifully responds:

root@pooch:~ # ping6 ff02::2%wlan0
PING6(56=40+8+8 bytes) fe80::76da:38ff:fe6a:d72d%wlan0 --> ff02::2%wlan0
16 bytes from fe80::[...]%wlan0, icmp_seq=0 hlim=64 time=11.548 ms

On AWS:

Pinging all hosts on AWS yields nothing, except for the pinging node that responds to itself:

root@n0-1c:~ # ping6 ff02::1%xn0
PING6(56=40+8+8 bytes) fe80::8ca:5ff:fe9c:fc14%xn0 --> ff02::1%xn0
16 bytes from fe80::8ca:5ff:fe9c:fc14%xn0, icmp_seq=0 hlim=64 time=0.066 ms

And also nothing for all routers, where you might expect the AWS VPC router to respond:

root@n0-1c:~ # ping6 ff02::2%xn0
PING6(56=40+8+8 bytes) fe80::8ca:5ff:fe9c:fc14%xn0 --> ff02::2%xn0
^C
--- ff02::2%xn0 ping6 statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss

So indeed, there is no multicast traffic between nodes.

Router Solicitation

Interestingly, sending a Router Solicitation to all routers, the VPC router replies, as seen in tcpdump:

root@n0-1c:~ # rtsol xn0
[...] fe80::8ca:5ff:fe9c:fc14 > ff02::2: ICMP6, router solicitation, length 16
[...] fe80::8a1:79ff:fedb:2106 > 2a05:d014:631:380c:7022:b490:22d9:95b8: ICMP6, router advertisement, length 56

… which leads me to believe that the VPC router does, in fact, receive packets on the all routers address (and maybe others), but simply drops most request types. I haven’t digged into that though.

Also noteworthy: In response to a Router Solicitation, the reply is sent to the assigned address (here: 2a05:d014:631:380c:7022:b490:22d9:95b8). That’s not even configured on this node (yet)!

The periodic unsolicited Router Advertisements go to the multicast all hosts instead: fe80::8a1:79ff:fedb:2106 > ff02::1: ICMP6, router advertisement, length 56.

Duplicate Address Detection

If there is no multicast, then there’s no way that DAD can work, right? Other nodes will never see the Neighbor Solicitation for the new IP address. Let’s confirm this.

First, we’ll configure some address on one node:

root@n1-1c:~ # ifconfig xn0 inet6 2001:db8::2/64

Then we’ll configure the same address on another node (same Subnet) and see silence in tcpdump:

root@n0-1c:~ # tcpdump -enixn0 'ip6 and not icmp6[icmp6type]=icmp6-routeradvert' &
[1] 1126
root@n0-1c:~ # ifconfig xn0 inet6 2001:db8::2/64
[...] :: > ff02::1:ff00:2: ICMP6, neighbor solicitation, who has 2001:db8::2, length 32

… and no reply. So indeed, DAD is absolutely pointless on EC2.

Manually configuring IP addresses

But is it possible to manually configure IP addresses, anyway? Well, of course you can configure them, but do they work? Because only if this works could a duplicate address ever become a problem at all.

Spoiler: It does work! There’s two hoops we need to jump through:

  1. Since the Mapping Service does not know of our secret little IP addresses, it will not resolve them. We need to tell our machines about the others’ MAC addresses manually. That’s annoying, but it’s possible. For ad-hoc configurations, we can just add them on the fly using ndp (on Linux: ip neighbor). For permanent configuration, there’s the ancient /etc/ethers.

  2. By default, the EC2 hypervisor only allows IP packets to be sent/received by instances when they match the addresses that the Mapping Service knows for them. In other words, it will block packets when it thinks those IP addresses are wrong. Fortunately, there’s many scenarios where you need to do this, e.g. running your own proxy / router / NAT instance, so EC2 provices the option to Disable Src/Dest Check (for both sending and receiving instances in our case).

Also keep in mind that the Security Groups need to allow traffic, as always.

So let’s configure an address on our first node, add some other nodes’ ether addresses and start pinging:

root@n0-1c:~ # ifconfig xn0 inet6 2001:db8::0/64
root@n0-1c:~ # ndp -s 2001:db8::1 0a:7b:67:a0:20:6a
root@n0-1c:~ # ndp -s 2001:db8::2 06:48:58:f5:f1:8a
root@n0-1c:~ # ping6 2001:db8::1
PING6(56=40+8+8 bytes) 2001:db8:: --> 2001:db8::1
[...] 2001:db8:: > 2001:db8::1: ICMP6, echo request, seq 0, length 16
[...] 2001:db8:: > 2001:db8::1: ICMP6, echo request, seq 1, length 16

… we’re sending pings to the correct MAC address, but no replies yet. We can even see those packets arriving in tcpdump on the target node, once we disable the Src/Dest Check.

We need to add the IP address as well as the MAC addresses (for the return packets!) on the second node:

root@n1-1c:~ # ifconfig xn0 inet6 2001:db8::1/64
root@n1-1c:~ # ndp -s 2001:db8::0 0a:ca:05:9c:fc:14
root@n1-1c:~ # ndp -s 2001:db8::2 06:48:58:f5:f1:8a

Et voilà! The first node successfully receives ping replies:

[...] 2001:db8:: > 2001:db8::1: ICMP6, echo request, seq 47, length 16
[...] 2001:db8::1 > 2001:db8::: ICMP6, echo reply, seq 47, length 16
16 bytes from 2001:db8::1, icmp_seq=47 hlim=64 time=0.352 ms

It’s always such a nice feeling when ping finally works…

… until you notice that something isn’t quite right yet. Let’s configure the third node, in a different Subnet:

root@n2-1b:~ # ifconfig xn0 inet6 2001:db8::2/64
root@n2-1b:~ # ndp -s 2001:db8::0 0a:ca:05:9c:fc:14
root@n2-1b:~ # ndp -s 2001:db8::1 0a:7b:67:a0:20:6a

Unfortunately, nothing:

root@n0-1c:~ # ping6 2001:db8::2
PING6(56=40+8+8 bytes) 2001:db8:: --> 2001:db8::2
[...]2001:db8:: > 2001:db8::2: ICMP6, echo request, seq 0, length 16
[...]2001:db8:: > 2001:db8::2: ICMP6, echo request, seq 1, length 16

So it seems that another reason for having different IP ranges on different Subnets is that you cannot have “direct” connections between machines in different Subnets (because for non-local destination IPs, the node will send to a router instead, in this case our VPC router). Interesting! By the way, it makes no difference if those Subnets are on the same Availability Zone or not.

Either way: Do not do this. Do not manually configure MAC addresses. It might be years from now, but someone will stab you if you do this. It’s like using /etc/hosts, but a hundred times more nasty.

The Mapping Service, which translates your IP address to the correct MAC address, works for global IPv6 addresses (e.g. 2a05:...) just as it does for IPv4 – it responds only to addresses it knows about, i.e. addresses that have been configured in EC2.

But here’s a cool thing: The Mapping Service will respond to any valid link-local address (as long as it’s EUI-64 based).

The link-local address (e.g. fe80::...) is usually constructed with an Interface Identifier based on the MAC address. That means the MAC address for a given IPv6 link-local address can be easily calculated.

Let’s use ndisc6 to generate some Neighbor Solicitations for made-up IP addresses:

root@n0-1c:~ # ndisc6 -q fe80::8ca:5ff:fe9c:fc02 xn0
0A:CA:05:9C:FC:02
root@n0-1c:~ # ndisc6 -q fe80::8ca:5ff:fe9c:fc03 xn0
0A:CA:05:9C:FC:03
root@n0-1c:~ # ndisc6 -q fe80::8ca:5ff:fe9c:fc04 xn0
0A:CA:05:9C:FC:04

As we can see, the Mapping Services generates the right MAC address for each address.

You may have noticed that this would cause DAD to malfunction: If the Mapping Service would respond to the node’s own address, the address would appear to be already in-use. Luckily, the Mapping Service does not respond for the sender’s own address:

root@n0-1c:~ # ndisc6 fe80::8ca:5ff:fe9c:fc14 xn0
Soliciting fe80::8ca:5ff:fe9c:fc14 (fe80::8ca:5ff:fe9c:fc14) on xn0...
Timed out.
Timed out.
Timed out.
No response.

Implications

If there’s only unicast traffic, it’s not possible to use some old friends – it’s mostly clustering and failover solutions that use multicast and broadcast to discover their peers. The Virtual Router Redundancy Protocol (VRRP) is popular – not only for routers, but for sets (usually pairs) of redundant systems as well, e.g. loadbalancers or reverse proxies.

You can’t use all that.

Alternatives for Failover

Some software might support unicast addressing. For example, you can run keepalived using unicast_peers (see this example).

You could also hack something up to re-assign an Elastic IP to the active instance.

Another interesting idea is to configure a dedicated Failover Virtual IP using the AWS API to control a Route Table entry

Loose Ends

Open questions:

Interaction of SLAAC and DHCPv6

DHCPv6 might work if used alone – but usually SLAAC (Stateless Address Autoconfiguration) would signal the presence of further configuration services. SLAAC is required per RfC.

On AWS, SLAAC is used with the Router Advertisement Flag “Managed”, which tells hosts to retrieve their address and other configuration via DHCPv6. The Router Advertisement is also important to know about the default router to be used.

How Does DAD Help with Faster Failover?

I have no idea. The tweet mentions “certain switches/routers”, so I’ll assume the experience was not on AWS. And on non-AWS I can at least remotely imagine that DAD causes recipients (or switches along the way) to unclog their MAC address table, but not really.

As we’ve seen, this cannot make a difference on AWS.

But What’s the Issue with dhclient?

I can reproduce the described effect. Sometimes.

It’s quite weird.

But I removed the isc-dhcp44-client package and installed the dhcpcd package instead. Then I’ve changed /etc/rc.conf to 1) remove the ifconfig_DEFAULT line (we don’t want the system to start dhclient) 2) remove the dual-dhclient hack 3) add dhcpcd_enable=YES. That’s the dhcpcd by Roy Marples, as suggested in one of the replies. It takes good care of both IPv4 and IPv6.

And with dhcpcd, getting the proper address via DHCPv6 just works. Always.

It’s not an answer, but it is a solution.

You’d think the story ends here. But oh no! After setting up dhcpcd, suddenly the nodes can’t ping each others’ link-local addresses anymore. What happened?

After some thorough head-scratching, I noticed that the link-local addresses are different now:

	inet6 fe80::5562:63fd:691b:3c92%xn0 prefixlen 64 scopeid 0x2

This is not an EUI-64 address (those are easily identified by having ff:fe as middle part).

Usually this wouldn’t be a problem, but on AWS, as we’ve learned, this doesn’t work: The Mapping Service does not know how to handle those. It can only answer for EUI-64 based addresses. Therefore our Neighbor Solicitation for other addresses goes unanswered.

As it turns out, dhcpcd is configured by default to generate a private address, i.e. one that does not “leak” the system’s MAC address. I’d argue this makes sense for globally routed addresses, but not for link-local addresses. The culprit is slaac private in /usr/local/etc/dhcpcd.conf. After removing this option, everything works as expected.

You’d think the story ends here. But oh no!

DHCPv6 was working correctly now to assign the global address (e.g. 2a05:...). What I didn’t test was if the nodes can ping each other with those addresses (I had only tested the link-local addresses).

Turns out, they can’t:

root@n0-1c:~ # ping6 2a05:d014:631:380c:8114:7e93:4c0f:4488
PING6(56=40+8+8 bytes) 2a05:d014:631:380c:7022:b490:22d9:95b8 --> 2a05:d014:631:380c:8114:7e93:4c0f:4488
ping6: sendmsg: No buffer space available
ping6: wrote 2a05:d014:631:380c:8114:7e93:4c0f:4488 16 chars, ret=-1

Excuse me…?!

As it turns out, DHCPv6, by design, really just assigns single addresses, without a prefix len. So the address gets configured as a /128. The system doesn’t know anything about a local /64 that’s supposed to be there.

This is another function of SLAAC / Router Advertisements: It tells nodes which networks (prefixes) are on-link. The nodes then install a route entry, so they know this prefix is directly reachable.

What I had missed, when removing the ifconfig_DEFAULT lines from rc.conf, is that it also contained accept_rtadv. Without this interface flag, the kernel will not process those advertisements and will not know about the on-link prefixes.

I’d thought that re-adding ifconfig_DEFAULT="inet6 accept_rtadv" would be enough, but for whatever reason, this doesn’t take effect. That flag is not activated.

So instead I returned to using dhclient_program:

#dhcpcd_enable=NO
dhclient_program="/usr/local/sbin/dhcpcd"
ifconfig_DEFAULT="SYNCDHCP inet6 accept_rtadv"

NOW everything works!

… except for detaching an additional NIC, but I’m done for today.

Conclusion

We’ve learned a few things:

You’re still here? Wow. Next time I’ll make it two parts. Pinky swear!


Discuss on Twitter