Tag Archives: kernel

Automatically restarting GNU/Linux hosts upon hung storage

We recently experienced a weird and frustrating problem with storage reliability on our RabbitMQ cluster running on AWS c5-series EC2 instances with EBS storage and Ubuntu LTS 16.04.

  1. One of our three-node RabbitMQ cluster instances will experience an issue that results in it being unable to persist anything to disk, on any mounted volume on the instance.
  2. When this happens, the instance is *supposed* to remove itself from the cluster as an unhealthy member and have the remaining two instances take over all responsibilities with zero downtime to operational systems.
  3. Sadly for some unknown reason, the way this issue impacts RabbitMQ does not result in the instance being evicted from the cluster. Instead, it remains in the cluster exchanging healthy status messages with other members, but (and this is the critical bit) it manages to then jam up queues across the entire cluster, bringing down the two healthy instances along with the one unhealthy.
  4. Operations (me) gets paged to solve a critical outage on the platform that’s going to impact customers.

The problem is super weird in that it occurs somewhat randomly – no obvious correlation to load, time of day – but it does tend to happen after the instance has been running for at least a few weeks. It also occurs on any of the three RabbitMQ instances, so it’s not something specifically weird about any one instance in the fleet.

The one thing we do know, is that the issue is storage related. Firstly nothing is persisted in the logs (RabbitMQ or system/kernel) from the time the issue occurs and secondly we can see a large spike in disk I/O wait time in our Datadog monitoring for the instance, showing that the instance is stuck with processes waiting for the disk to respond.

Why RabbitMQ is impacted in this manner is unclear. It makes sense that the cluster quorum and status negotiation wouldn’t require working disk to keep running, but in every test we made where we deliberately broke the storage, the RabbitMQ process would correctly detect something was wrong on the host and go into an unhealthy state, removing it from the cluster. Tested ripping out EBS storage whilst still mounted, corrupting with dd, force unmounting, etc… nothing could trigger the exact same behaviour.

Reviewing what differs about production was difficult since it didn’t persist any of the kernel or RabbitMQ logs, however we did manage to extract some information from the AWS instance console for one of the impacted systems before we restarted it:

Ubuntu 16.04.4 LTS localhost ttyS0
[349442.682614] Not tainted 4.4.0-1062-aws #71-Ubuntu
[349442.684363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[349442.686890] INFO: myprocess:1956 blocked for more than 120 seconds.

Essentially the Linux kernel is proceeding to log a number of different processes (basically everything on the box that does anything) as being blocked for over 120 seconds, thanks to the storage failing and being unable to do anything about it to unblock the processes

Given we have been unable to identify the exact fault or reproduce the behaviour (could be something in the Linux Kernel, could be something in AWS c5 or EBS…), we needed a solution that would at least help us by terminating any instance that experienced this storage issue.

The solution is helpfully identified by the kernel log lines above. We can use the hung task panic feature in the Linux kernel to force a host to immediately reboot itself if processes are hung for too long.

We do this using two different sysctl configuration changes (note – you need to set these up in /etc/sysctl.conf to survive reboots):

# Panic if a hung task was found
sysctl kernel.hung_task_panic=1

# Reboot 5 seconds after panic
sysctl kernel.panic=5

The first instructs the kernel to panic if a hung task (any task blocked for more than 120 seconds) occurs. The second, instructs it to reboot shortly after this occurs. We set it to 5 seconds, to give time for any logging to persist or be delivered about the hung task before it’s reboot, although in this particular situation, with all storage being busted its of very limited benefit.

This has been in place for several months now and is working beautifully. Every so often an instance experiences this fault and instead of causing any disruption, it is quickly self-terminated and replaced. Because it terminates completely, the RabbitMQ cluster negotiation is successfully able to re-assign responsibilities to the other instances in the cluster.

In theory there is a 2-minute period where the unhealthy instance is still running, however reviewing the production metrics, it appears that when the fault occurs, RabbitMQ doesn’t immediately break – sometimes it continues to run for 15mins or more before jamming up the cluster. So having it run for 2 minutes has turned out to be just fine.

Ideally going forwards we need to setup a network logging endpoint for these hosts to see if we can capture anything like a stack trace from a kernel driver. It seems likely that it’s a Linux kernel bug rather than an AWS EBS bug, since the issue is resolved with a soft reboot, rather than a force stop-start of the instance via the AWS API, meaning it’s still running on the same hypervisor host, etc. But until then, this kernel configuration parameter means we are not going to disrupt operational services when the fault does occur.

Resize ext4 on RHEL/CentOS/EL

I use an ext4 filesystem for a couple large volumes on my virtual machines. I know that btrfs and ZFS are the new cool kids in town for filesystems, but when I’m already running RAID and LVM on the VM server underneath the guest, these newer filesystems with their fancy features don’t add a lot of value for me, so a good, simple, traditional, reliable filesystem is just what I want.

There’s actually a bit of confusion and misinformation online about using ext4 filesystems with RHEL/CentOS (EL) 5 & 6 systems, so just wanted to clarify some things for people.

ext4 uses the same tools as ext3 and ext2 before it – this means still using the e2fsprogs package which provides utilities such as e2fsck and resize2fs, which handle ext2, ext3 and ext4 filesystems with the same command. But there are a few catches with EL…

On an EL 5 system, the version of e2fsprogs is too old and doesn’t support ext4. Therefore there is an additional package called “e4fsprogs” which is actually just a newer version of e2fsprogs that provides renamed tools such as “resize4fs”. This tends to get confusing for users who then try to find these tools on EL 6 or other distributions, so it’s important to be aware that it’s a non-standard thing. Why they didn’t just upgrade the tools/backport the ext4 support is a bit weird to me, especially considering the fact these tools are extremely backwards compatible and should meet even Red Hat’s API compatibility requirements, but it-is-what-it-is sadly.

On an EL 6 system, the version of e2fsprogs is modern enough to support ext4, which means you need to use the normal “resize2fs” command to do an ext4 filesystem resize (as per the docs). Generally this works fine, but I think I may have found a bug with the stock EL 6 kernel and e2fsprog tools.

One of my filesystems almost reached 100% full, so I expanded the underlying block volume and then issued an offline e2fsck -f /dev/volume & resize2fs /dev/volume which completed as normal without error, however it left the filesystem size unchanged. Repeated checks and resize attempts made no difference. The block volume correctly reported it’s new size, so it wasn’t a case of the kernel not having seen the changed size.

However by mounting the ext4 filesystem and doing an online resize, the resize works correctly, although quite slowly, as the online resize seems to trickle-resize the disk, gradually increasing it’s size until finally complete.

This inability to resize offline is not something I’ve come across before, so may be a rare bug trigged by the full size of the filesystem that could well be fixed in newer kernels/e2fsprog packages, but figured I’d mention it for any other poor sysadmins scratching their heads over a similar issue.

Finally, be aware that you need to use either no partition table, or GPT if you want file systems over 2TB and also that the e2fsprogs package shipped with EL 6 has a 16TB limit, so you’ll have to upgrade the package manually if you need more than that.

cifs, ipv6 and rhel 5

Unfortunately with my recent project enabling IPv6 across my entire personal server environment, I’ve bumped into a number of annoying issues – nothing that isn’t fixable, but things that are generally frustrating and which just shouldn’t be an issue.

Particular thanks goes to my many RHEL/CentOS 5 virtual machines, which lack some pretty key stuff such as:

  • IPv6 connection tracking preventing the ESTABLISHED,RELATED ip6tables rules from working.
  • Unexpected behavior of certain bootscript configuration options.
  • Lack of IPv6 support with CIFS (Samba/SMB) share mounting.
  • Some weirdness with Dovecot I still need to resolve.

(Personally, based on the number of headaches I’ve found with RHEL 5, my recommendation is accelerate any plans to upgrade to RHEL 6 – or some other distribution – before deploying IPv6 in production.)

At the moment, CIFS IPv6 support on RHEL 5 & 6 has been causing me the most pain. My internal file server is dual stacked and has both A and AAAA DNS records – it’s a stock-standard CentOS 6 box running distribution-shipped Samba packages and works perfectly from the server side and modern IPv6 hosts have no issue mounting the shares via IPv6.

Very typical dual stack configuration:

# host fileserver.example.com 
fileserver.example.com has address 192.168.0.10
fileserver.example.com has IPv6 address 2001:0DB8::10

However, when I run the following legitimate and syntactically correct command to mount the CIFS share provided by the Samba server on other RHEL 5 hosts, it breaks with a error message that is typical of incorrect syntax with the mount options:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody
mount: wrong fs type, bad option, bad superblock on //fileserver.example.com/tmp,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Taking a look a the kernel log, it shows a non-descriptive error explanation:

kernel:  CIFS VFS: cifs_mount failed w/return code = -22

This isn’t particularly helpful, made more infuriating by the fact that I know the command syntax is correct and should be working perfectly fine.

Seeing as a number of things broke after switching on IPv6 across the entire network, I’ve become even more of a cynical bastard and ran some tests using specifically stated IPv6 and IPv4 addresses in the mount command.

I found that by passing the IPv6 address instead of the DNS name, you can produce the additional error message which offers some additional insight:

kernel: CIFS: ip address too long

Huh. Looks like a text book IPv6 support bug to me. (Even I have made this mistake in some older generation web apps that didn’t foresee long 128-bit addresses).

In testing, I found that the following commands are all acceptable on a dual-stack network with a RHEL 5 host:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10

However all ways of specifying IPv6 will fail, as well as pure DNS resolution:

# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

No method of connecting via IPv6 would work, leaving stock RHEL 5 hosts only being able to work with CIFS shares via IPv4. :-(

Unfortunately this error is due to a known kernel bug in 2.6.18, which was fixed in 2.6.31, but sadly not backported to RHEL 5’s kernel (as of version 2.6.18-308.8.1.el5 anyway), leaving RHEL 5 users in a position where the stock OS is unable to mount CIFS shares on an IPv6 or dual-stacked network. :-(

The ideal solution would be to patch the kernel to resolve the issue – and in fact if you are running on a native IPv6-only (not dual stacked), it would be the only option to get a working solution.

However, typically if you’re using RHEL, custom kernels aren’t that popular due to the impact they make to supportability/guarantee of the platform by vendor and added headaches of security update tracking and application, so another approach is needed.

The following methods will all work for stock RHEL/Centos 5:

  • Use the ip=X mount option to overule DNS.
  • Add an entry to /etc/hosts.
  • Have a separate DNS entry that only has an A record for your file servers (ie //fileserverv4only.example.com/)
  • Disable IPv6 entirely (and suffer the scorn of your cooler IPv6 enabled friends).

These solutions all suck – having manually fixed IPs isn’t great for long term supportability, additional DNS records is an additional pain for management, and let’s not even begin to cover why disabling IPv6 entirely is wrong.

Of course RHEL 5 is a little outdated now, so I took a look at how RHEL 6 fared. On the plus side, it *can* mount IPv6 shares, all of the following mount commands are accepted without fault:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10

However, any mount of a IPv6 server using the DNS name will still fail, just like how they did with RHEL 5:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

The solution is that you need to install the “cifs-utils” package which provides the /sbin/mount.cifs binary offering smarter handling of shares – once installed, all mount command options will work OK on RHEL 6, including the standard DNS-based command we all know and love. :-D

I had always assumed that all Linux systems that could mount CIFS shares had the /sbin/mount.cifs binary installed, but it seems that’s not the case, rather the standard /bin/mount command can handle mounting CIFS using just the standard kernel mount() function

However when /bin/mount detects a /sbin/mount.FILESYSTEM binary, it will call that process instead of calling the kernel mount() directly, these binaries can offer additional logic and handling off the mount command before passing it through to the Linux kernel.

For example, the following strace from a RHEL 5 host shows that /sbin/mount checks for the existence of /sbin/mount.cifs, before then going on to call the Linux kernel mount() directly with the provided arguments:

stat64("/sbin/mount.cifs", 0xbfc9dd20)  = -1 ENOENT (No such file or directory)
...
mount("//fileserver.example.com/tmp", "/mnt", "cifs", MS_MGC_VAL, "user=nobody,password=nobody") = -1 EINVAL (Invalid argument)

But a RHEL 6 host with cifs-utils installed provides /sbin/mount.cifs, which appears to do it’s own name resolution, then establishes a connection to both the IPv4 and IPv6 sockets, before deciding which to use and instructs the kernel using the ip=X parameter.

stat64("/sbin/mount.cifs", {st_mode=S_IFREG|0755, st_size=29376, ...}) = 0
clone(Process 1666 attached
...
[pid  1666] mount("//fileserver.example.com/tmp/", ".", "cifs", 0, "ip=2001:0DB8::10",user=nobody,password=nobody) = 0

So I had an idea….. what if I could easily modify a version of cifs-utils to run on RHEL 5 dual-stack servers, yet only ever resolve DNS queries to IPv4 addresses to work around the kernel issue? :-D

Turns out you can – effectively I just made the nastiest hack ever by just tearing out the IPv6 name resolver. :-/

I’m going to hell for this, but damn, feels good man. ;-)

I wasn’t totally evil, I added an info level syslog notice about the IPv4 enforcement incase any poor admin is ever getting puzzled by someone’s customized RHEL 5 box refusing to connect to CIFS shares IPv6 – that would be a bit too cruel. ;-)

The hack is pretty crude, it actually just breaks the IPv6 socket connection attempt and so it then falls back to IPv4, so it throws up a couple errors in the logs, but doesn’t actually impact the mounting at all.

mount.cifs: Warning: Using specially patched cifs-utils to ignore IPv6 address resolution - enforcing IPv4 only!
kernel:  CIFS VFS: Error connecting to socket. Aborting operation
kernel:  CIFS VFS: cifs_mount failed w/return code = -111

But wait, there’s more! I have shiny cifs-util i386/x86_64/SRPM packages with this evil hack available for download from amberdms-os repository (or directly from the server here).

Naturally this is a bit of a kludge, don’t trust it for mission critical stuff, you ONLY need it for RHEL 5, not RHEL 6 and I can’t guarantee it won’t eat all your data and bring upon the end times, etc, etc.

I’ve tested it on my devel systems and it seems like the nicest fix – sure it won’t work for any hosts needing to run on native IPv6, but by the time I come to drop IPv4 addressing entirely I certainly will have moved on my last hosts from RHEL 5 to something a bit newer. :-)

A tale of two route controllers

Ever since I built a Linux 3.2.0 kernel for my Debian Stable laptop to take advantage of some of the newer kernel features, I have been experiencing occasional short periods of disconnect/reconnect on the Wi-Fi network.

This wasn’t happening heaps (maybe a couple times a day), but it was starting to get annoying, so I decided to sort it out properly and do a kernel driver and microcode update for my Intel Centrino Wireless-N 1000 card.

The firmware/microcode update was easy enough, simply a case of downloading the latest code from Intel and installing into /lib/firmware/ – the kernel driver does the rest, finding it and loading it into the Wi-Fi card at boot time.

Next step was building a new kernel for my machine, I went through and tuned the module selection very carefully tossing out all the hardware my laptop will never use, as I was getting sick of wasting lots of disk space on the billion+ device modules in Linux these days.

After finding that my initial kernel lacked support for my video card (turns out the Lenovo X201i laptops still use AGP-based i915 cards, I was assuming PCIe) I got a working kernel up and running.

Except that my Wi-Fi stability problem was worse than ever, instead of losing connectivity every few hours, it was now doing so ever few minutes. :-(

The logs weren’t particularly helpful – NetworkManager likes to give reason numbers but I couldn’t easily find a documented explanation of these (but maybe I’m looking in the wrong place).

19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed for access point (b201)
19:44:36 NetworkManager[1650]: <warn> Activation (wlan0) failed.
19:44:36 NetworkManager[1650]: <info> (wlan0): device state change: 9 -> 3 (reason 0)
19:44:36 NetworkManager[1650]: <info> (wlan0): deactivating device (reason: 0).
19:44:36 NetworkManager[1650]: <info> (wlan0): canceled DHCP transaction, DHCP client pid 3354
19:44:36 kernel: [  391.070772] wlan0: deauthenticating from 00:0c:42:67:8b:bc by local choice (reason=3)
19:44:36 kernel: [  391.185461] wlan0: moving STA 00:0c:42:67:8b:bc to state 2
19:44:36 kernel: [  391.185466] wlan0: moving STA 00:0c:42:67:8b:bc to state 1
19:44:36 kernel: [  391.185470] wlan0: moving STA 00:0c:42:67:8b:bc to state 0
19:44:36 wpa_supplicant[1682]: CTRL-EVENT-DISCONNECTED - Disconnect event - remove keys
19:44:36 NetworkManager[1650]: <error> [1337240676.376011] [nm-system.c:1229] check_one_route(): (wlan0): \
         error -34 returned from rtnl_route_del(): Netlink Error (errno = Numerical result out of range)
19:44:36 kernel: [  391.233344] cfg80211: Calling CRDA to update world regulatory domain
19:44:36 avahi-daemon[1633]: Withdrawing address record for 192.168.1.11 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv4 with address 192.168.1.11.
19:44:36 avahi-daemon[1633]: Interface wlan0.IPv4 no longer relevant for mDNS.
19:44:36 avahi-daemon[1633]: Withdrawing address record for 2407:1000:1003:99:226:c7ff:fe66:b822 on wlan0.
19:44:36 avahi-daemon[1633]: Leaving mDNS multicast group on interface wlan0.IPv6 with address 2407:1000:1003:99:226:c7ff:fe66:b822.
19:44:36 NetworkManager[1650]: <info> (wlan0): writing resolv.conf to /sbin/resolvconf
19:44:36 avahi-daemon[1633]: Joining mDNS multicast group on interface wlan0.IPv6 with address fe80::226:c7ff:fe66:b822.
19:44:36 avahi-daemon[1633]: Registering new address record for fe80::226:c7ff:fe66:b822 on wlan0.*.

So I proceeded to debug:

  1. Cursed and wished my 300m spool of Cat6 ethernet wasn’t in Wellington.
  2. Rolled back the microcode update – my initial thought was that the new code was making the card unstable and the result was the card dropping the connection and NetworkManager doing the clean up.
  3. Did a full power down to make sure that the microcode wasn’t remaining active on the card across reboots (had this problem once with a dodgy GPU once).
  4. Verdict: Microcode upgrade was OK, must be something else.
  5. Upgraded NetworkManager from 0.8.1 to 0.8.4 from Debian Backports – 0.8.1 isn’t too recent, was tempted to try 0.9 series but would have required a lot more backporting work.
  6. Verdict: Appears not to be a NetworkManager issue in the 0.8 series – maybe something fixed in 0.9 or later?
  7. Upgraded wpasupplicant from 0.6.10 to 1.0 by manual backport from unstable – the activation error made me consider it might have been a bug with newer kernels & wpasupplicant’s AP negotiation.
  8. Verdict: No change to the issue.
  9. Built a Linux 3.3 kernel with the older less-crashy 3.2 iwlwifi driver to see if it was driver specific, or otherwise-kernel related.
  10. Verdict: Same issue continued to occur, rolling back driver version infact made no change – something about the 3.3 kernel itself was the problem.
  11. Got suspicious about NetworkManager – either it or the kernel had to be at fault, one possibility was some weird API breakage with the age gap between the software versions being used. The kernel is *usually* pretty solid and something like wifi drivers dropping every couple of minutes would be a pretty serious bug to get through, so I looked through the logs to see if I could get anything more useful with NetworkManager’s logs.
  12. Spotted a kernel error “ICMPv6 RA: ndisc_router_discovery() failed to add default route.“. This error tended to occur shortly before any WiFi disconnection occurred, but not immediately so.
  13. Found an entry in Red Hat’s bugzilla.
  14. And then the upstream bug fix from 19th April.

Turns out that the Linux 3.3 kernel and NetworkManager fight over which one is going to control the default route for each router advertised link – the kernel adds one, Network Manager removes and then the kernel gets upset and drops all router advertisements.

In hindsight, I should have spotted it sooner, but I had discarded the RA statement from being related initially as the disconnection often didn’t happen till a minute of two after the log entry occurred – eg:

19:51:40 kernel: [  814.274903] ICMPv6 RA: ndisc_router_discovery() failed to add default route.
19:52:47 NetworkManager[1650]: <info> (wlan0): device state change: 8 -> 9 (reason 5)

What’s interesting about this bug, is that at first reading it explains a loss of IPv6 connectivity perfectly – however it doesn’t explain why IPv4 or the Wi-Fi connection itself was impacted.

The reason this happened, is that NetworkManager was set to have IPv6 as a requirement for that connection to be established – in the event of IPv6 not working, NetworkManager would consider the interface to be down, even if IPv4 was up.

There is a good reason for this, that the developers detailed on their (excellently written) blog, explaining that by having NetworkManager check for IPv6, it allows applications to be written smarter to better understand their level of connectivity.

For users of the NetworkManager 0.9 series, there’s a patch already committed which you can grab here and I would expect the next NetworkManager update will have this fix.

If you’re on the NetworkManager 0.8 series, this patch won’t apply cleanly – I might make some time to go and backport it, but you can workaround it for now by using the Ignore method so that NetworkManager does nothing and leaves it up to the Linux kernel in the background to negotiate IPv6 addressing.

Breaking vs Working Network Manager Settings

Of course if you’re not connecting to any IPv6 capable networks, you don’t have anything to worry about (other than the fact you’re still stuck in the 20th century).

 

Initially I was a bit annoyed at NetworkManager for being so silly as to drop the whole interface when just one of the two networking stacks was broken, however after thinking about it for a bit, it does make some sense as to why it chose that behavior – often most interface issues can be fixed by reconnecting – maybe the AP got rebooted, maybe the laptop just moved between two of them, etc – a reconnect can solve many of these.

But a smarter approach, would be to determine whether network issues are layer 2 or layer 3 – if it’s just a layer 3 issue, then there’s little need to drop the Wi-Fi connection itself, instead attempt to re-establish IPv4 or IPv6 connectivity where appropriate, and if unable to do so, use the notifications to tell the user that “IPv6 connectivity is experiencing a problem, some hosts and services may be unreachable”.

It’s actually something that Windows does semi-OK – it figures out roughly how borked a user’s connection is and then does a balloon popup stating that there’s limited connectivity or IP conflict, or some other sometimes helpful message.

This may be better in newer versions of NetworkManager, I’ll have to have a play with a more recent release and see.