bug | Jethro Carr

We recently experienced a weird and frustrating problem with storage reliability on our RabbitMQ cluster running on AWS c5-series EC2 instances with EBS storage and Ubuntu LTS 16.04.

One of our three-node RabbitMQ cluster instances will experience an issue that results in it being unable to persist anything to disk, on any mounted volume on the instance.
When this happens, the instance is *supposed* to remove itself from the cluster as an unhealthy member and have the remaining two instances take over all responsibilities with zero downtime to operational systems.
Sadly for some unknown reason, the way this issue impacts RabbitMQ does not result in the instance being evicted from the cluster. Instead, it remains in the cluster exchanging healthy status messages with other members, but (and this is the critical bit) it manages to then jam up queues across the entire cluster, bringing down the two healthy instances along with the one unhealthy.
Operations (me) gets paged to solve a critical outage on the platform that’s going to impact customers.

The problem is super weird in that it occurs somewhat randomly – no obvious correlation to load, time of day – but it does tend to happen after the instance has been running for at least a few weeks. It also occurs on any of the three RabbitMQ instances, so it’s not something specifically weird about any one instance in the fleet.

The one thing we do know, is that the issue is storage related. Firstly nothing is persisted in the logs (RabbitMQ or system/kernel) from the time the issue occurs and secondly we can see a large spike in disk I/O wait time in our Datadog monitoring for the instance, showing that the instance is stuck with processes waiting for the disk to respond.

Why RabbitMQ is impacted in this manner is unclear. It makes sense that the cluster quorum and status negotiation wouldn’t require working disk to keep running, but in every test we made where we deliberately broke the storage, the RabbitMQ process would correctly detect something was wrong on the host and go into an unhealthy state, removing it from the cluster. Tested ripping out EBS storage whilst still mounted, corrupting with dd, force unmounting, etc… nothing could trigger the exact same behaviour.

Reviewing what differs about production was difficult since it didn’t persist any of the kernel or RabbitMQ logs, however we did manage to extract some information from the AWS instance console for one of the impacted systems before we restarted it:

Ubuntu 16.04.4 LTS localhost ttyS0
 [349442.682614]       Not tainted 4.4.0-1062-aws #71-Ubuntu
 [349442.684363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [349442.686890] INFO: myprocess:1956 blocked for more than 120 seconds.

Essentially the Linux kernel is proceeding to log a number of different processes (basically everything on the box that does anything) as being blocked for over 120 seconds, thanks to the storage failing and being unable to do anything about it to unblock the processes

Given we have been unable to identify the exact fault or reproduce the behaviour (could be something in the Linux Kernel, could be something in AWS c5 or EBS…), we needed a solution that would at least help us by terminating any instance that experienced this storage issue.

The solution is helpfully identified by the kernel log lines above. We can use the hung task panic feature in the Linux kernel to force a host to immediately reboot itself if processes are hung for too long.

We do this using two different sysctl configuration changes (note – you need to set these up in /etc/sysctl.conf to survive reboots):

# Panic if a hung task was found
sysctl kernel.hung_task_panic=1

# Reboot 5 seconds after panic
sysctl kernel.panic=5

The first instructs the kernel to panic if a hung task (any task blocked for more than 120 seconds) occurs. The second, instructs it to reboot shortly after this occurs. We set it to 5 seconds, to give time for any logging to persist or be delivered about the hung task before it’s reboot, although in this particular situation, with all storage being busted its of very limited benefit.

This has been in place for several months now and is working beautifully. Every so often an instance experiences this fault and instead of causing any disruption, it is quickly self-terminated and replaced. Because it terminates completely, the RabbitMQ cluster negotiation is successfully able to re-assign responsibilities to the other instances in the cluster.

In theory there is a 2-minute period where the unhealthy instance is still running, however reviewing the production metrics, it appears that when the fault occurs, RabbitMQ doesn’t immediately break – sometimes it continues to run for 15mins or more before jamming up the cluster. So having it run for 2 minutes has turned out to be just fine.

Ideally going forwards we need to setup a network logging endpoint for these hosts to see if we can capture anything like a stack trace from a kernel driver. It seems likely that it’s a Linux kernel bug rather than an AWS EBS bug, since the issue is resolved with a soft reboot, rather than a force stop-start of the instance via the AWS API, meaning it’s still running on the same hypervisor host, etc. But until then, this kernel configuration parameter means we are not going to disrupt operational services when the fault does occur.

I use an ext4 filesystem for a couple large volumes on my virtual machines. I know that btrfs and ZFS are the new cool kids in town for filesystems, but when I’m already running RAID and LVM on the VM server underneath the guest, these newer filesystems with their fancy features don’t add a lot of value for me, so a good, simple, traditional, reliable filesystem is just what I want.

There’s actually a bit of confusion and misinformation online about using ext4 filesystems with RHEL/CentOS (EL) 5 & 6 systems, so just wanted to clarify some things for people.

ext4 uses the same tools as ext3 and ext2 before it – this means still using the e2fsprogs package which provides utilities such as e2fsck and resize2fs, which handle ext2, ext3 and ext4 filesystems with the same command. But there are a few catches with EL…

On an EL 5 system, the version of e2fsprogs is too old and doesn’t support ext4. Therefore there is an additional package called “e4fsprogs” which is actually just a newer version of e2fsprogs that provides renamed tools such as “resize4fs”. This tends to get confusing for users who then try to find these tools on EL 6 or other distributions, so it’s important to be aware that it’s a non-standard thing. Why they didn’t just upgrade the tools/backport the ext4 support is a bit weird to me, especially considering the fact these tools are extremely backwards compatible and should meet even Red Hat’s API compatibility requirements, but it-is-what-it-is sadly.

On an EL 6 system, the version of e2fsprogs is modern enough to support ext4, which means you need to use the normal “resize2fs” command to do an ext4 filesystem resize (as per the docs). Generally this works fine, but I think I may have found a bug with the stock EL 6 kernel and e2fsprog tools.

One of my filesystems almost reached 100% full, so I expanded the underlying block volume and then issued an offline e2fsck -f /dev/volume & resize2fs /dev/volume which completed as normal without error, however it left the filesystem size unchanged. Repeated checks and resize attempts made no difference. The block volume correctly reported it’s new size, so it wasn’t a case of the kernel not having seen the changed size.

However by mounting the ext4 filesystem and doing an online resize, the resize works correctly, although quite slowly, as the online resize seems to trickle-resize the disk, gradually increasing it’s size until finally complete.

This inability to resize offline is not something I’ve come across before, so may be a rare bug trigged by the full size of the filesystem that could well be fixed in newer kernels/e2fsprog packages, but figured I’d mention it for any other poor sysadmins scratching their heads over a similar issue.

Finally, be aware that you need to use either no partition table, or GPT if you want file systems over 2TB and also that the e2fsprogs package shipped with EL 6 has a 16TB limit, so you’ll have to upgrade the package manually if you need more than that.