A tale of two route controllers

Ever since I built a Linux 3.2.0 kernel for my Debian Stable laptop to take advantage of some of the newer kernel features, I have been experiencing occasional short periods of disconnect/reconnect on the Wi-Fi network.

This wasn’t happening heaps (maybe a couple times a day), but it was starting to get annoying, so I decided to sort it out properly and do a kernel driver and microcode update for my Intel Centrino Wireless-N 1000 card.

The firmware/microcode update was easy enough, simply a case of downloading the latest code from Intel and installing into /lib/firmware/ – the kernel driver does the rest, finding it and loading it into the Wi-Fi card at boot time.

Next step was building a new kernel for my machine, I went through and tuned the module selection very carefully tossing out all the hardware my laptop will never use, as I was getting sick of wasting lots of disk space on the billion+ device modules in Linux these days.

After finding that my initial kernel lacked support for my video card (turns out the Lenovo X201i laptops still use AGP-based i915 cards, I was assuming PCIe) I got a working kernel up and running.

Except that my Wi-Fi stability problem was worse than ever, instead of losing connectivity every few hours, it was now doing so ever few minutes. :-(

The logs weren’t particularly helpful – NetworkManager likes to give reason numbers but I couldn’t easily find a documented explanation of these (but maybe I’m looking in the wrong place).

19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed for access point (b201)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed.
19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 9 -> 3 (reason 0)
19:44:36 NetworkManager[1650]: <info> (wlan0): deactivating device (reason: 0).
19:44:36 NetworkManager[1650]: <info> (wlan0): canceled DHCP transaction, DHCP client pid 3354
19:44:36 kernel: [  391.070772] wlan0: deauthenticating from 00:0c:42:67:8b:bc by local choice (reason=3)
19:44:36 kernel: [  391.185461] wlan0: moving STA 00:0c:42:67:8b:bc to state 2
19:44:36 kernel: [  391.185466] wlan0: moving STA 00:0c:42:67:8b:bc to state 1
19:44:36 kernel: [  391.185470] wlan0: moving STA 00:0c:42:67:8b:bc to state 0
19:44:36 wpa_supplicant[1682]: CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys
19:44:36 NetworkManager[1650]: <error> [1337240676.376011] [nm-system.c:1229] check_one_route(): (wlan0): \
         error -34 returned from rtnl_route_del(): Netlink Error (errno = Numerical result out of range)
19:44:36 kernel: [  391.233344] cfg80211: Calling CRDA to update world regulatory domain
19:44:36 avahi-daemon[1633]: Withdrawing address record for 192.168.1.11 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.1.11.
19:44:36 avahi-daemon[1633]: Interface wlan0.IPv4 no longer relevant for mDNS.
19:44:36 avahi-daemon[1633]: Withdrawing address record for 2407:1000:1003:99:226:c7ff:fe66:b822 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv6 with address 2407:1000:1003:99:226:c7ff:fe66:b822.
19:44:36 NetworkManager[1650]: <info> (wlan0): writing resolv.conf to /sbin/resolvconf
19:44:36 avahi-daemon[1633]: Joining mDNS multicast group on interface wlan0.IPv6 with address fe80::226:c7ff:fe66:b822.
19:44:36 avahi-daemon[1633]: Registering new address record for fe80::226:c7ff:fe66:b822 on wlan0.*.

So I proceeded to debug:

Cursed and wished my 300m spool of Cat6 ethernet wasn’t in Wellington.
Rolled back the microcode update – my initial thought was that the new code was making the card unstable and the result was the card dropping the connection and NetworkManager doing the clean up.
Did a full power down to make sure that the microcode wasn’t remaining active on the card across reboots (had this problem once with a dodgy GPU once).
Verdict: Microcode upgrade was OK, must be something else.
Upgraded NetworkManager from 0.8.1 to 0.8.4 from Debian Backports – 0.8.1 isn’t too recent, was tempted to try 0.9 series but would have required a lot more backporting work.
Verdict: Appears not to be a NetworkManager issue in the 0.8 series – maybe something fixed in 0.9 or later?
Upgraded wpasupplicant from 0.6.10 to 1.0 by manual backport from unstable – the activation error made me consider it might have been a bug with newer kernels & wpasupplicant’s AP negotiation.
Verdict: No change to the issue.
Built a Linux 3.3 kernel with the older less-crashy 3.2 iwlwifi driver to see if it was driver specific, or otherwise-kernel related.
Verdict: Same issue continued to occur, rolling back driver version infact made no change – something about the 3.3 kernel itself was the problem.
Got suspicious about NetworkManager – either it or the kernel had to be at fault, one possibility was some weird API breakage with the age gap between the software versions being used. The kernel is *usually* pretty solid and something like wifi drivers dropping every couple of minutes would be a pretty serious bug to get through, so I looked through the logs to see if I could get anything more useful with NetworkManager’s logs.
Spotted a kernel error “ICMPv6 RA: ndisc_router_discovery() failed to add default route.“. This error tended to occur shortly before any WiFi disconnection occurred, but not immediately so.
Found an entry in Red Hat’s bugzilla.
And then the upstream bug fix from 19th April.

Turns out that the Linux 3.3 kernel and NetworkManager fight over which one is going to control the default route for each router advertised link – the kernel adds one, Network Manager removes and then the kernel gets upset and drops all router advertisements.

In hindsight, I should have spotted it sooner, but I had discarded the RA statement from being related initially as the disconnection often didn’t happen till a minute of two after the log entry occurred – eg:

19:51:40 kernel: [  814.274903] ICMPv6 RA: ndisc_router_discovery() failed to add default route.
19:52:47 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)

What’s interesting about this bug, is that at first reading it explains a loss of IPv6 connectivity perfectly – however it doesn’t explain why IPv4 or the Wi-Fi connection itself was impacted.

The reason this happened, is that NetworkManager was set to have IPv6 as a requirement for that connection to be established – in the event of IPv6 not working, NetworkManager would consider the interface to be down, even if IPv4 was up.

There is a good reason for this, that the developers detailed on their (excellently written) blog, explaining that by having NetworkManager check for IPv6, it allows applications to be written smarter to better understand their level of connectivity.

For users of the NetworkManager 0.9 series, there’s a patch already committed which you can grab here and I would expect the next NetworkManager update will have this fix.

If you’re on the NetworkManager 0.8 series, this patch won’t apply cleanly – I might make some time to go and backport it, but you can workaround it for now by using the Ignore method so that NetworkManager does nothing and leaves it up to the Linux kernel in the background to negotiate IPv6 addressing.

Breaking vs Working Network Manager Settings

Of course if you’re not connecting to any IPv6 capable networks, you don’t have anything to worry about (other than the fact you’re still stuck in the 20th century).

Initially I was a bit annoyed at NetworkManager for being so silly as to drop the whole interface when just one of the two networking stacks was broken, however after thinking about it for a bit, it does make some sense as to why it chose that behavior – often most interface issues can be fixed by reconnecting – maybe the AP got rebooted, maybe the laptop just moved between two of them, etc – a reconnect can solve many of these.

But a smarter approach, would be to determine whether network issues are layer 2 or layer 3 – if it’s just a layer 3 issue, then there’s little need to drop the Wi-Fi connection itself, instead attempt to re-establish IPv4 or IPv6 connectivity where appropriate, and if unable to do so, use the notifications to tell the user that “IPv6 connectivity is experiencing a problem, some hosts and services may be unreachable”.

It’s actually something that Windows does semi-OK – it figures out roughly how borked a user’s connection is and then does a balloon popup stating that there’s limited connectivity or IP conflict, or some other sometimes helpful message.

This may be better in newer versions of NetworkManager, I’ll have to have a play with a more recent release and see.