We recently experienced a weird and frustrating problem with storage reliability on our RabbitMQ cluster running on AWS c5-series EC2 instances with EBS storage and Ubuntu LTS 16.04.
- One of our three-node RabbitMQ cluster instances will experience an issue that results in it being unable to persist anything to disk, on any mounted volume on the instance.
- When this happens, the instance is *supposed* to remove itself from the cluster as an unhealthy member and have the remaining two instances take over all responsibilities with zero downtime to operational systems.
- Sadly for some unknown reason, the way this issue impacts RabbitMQ does not result in the instance being evicted from the cluster. Instead, it remains in the cluster exchanging healthy status messages with other members, but (and this is the critical bit) it manages to then jam up queues across the entire cluster, bringing down the two healthy instances along with the one unhealthy.
- Operations (me) gets paged to solve a critical outage on the platform that’s going to impact customers.
The problem is super weird in that it occurs somewhat randomly – no obvious correlation to load, time of day – but it does tend to happen after the instance has been running for at least a few weeks. It also occurs on any of the three RabbitMQ instances, so it’s not something specifically weird about any one instance in the fleet.
The one thing we do know, is that the issue is storage related. Firstly nothing is persisted in the logs (RabbitMQ or system/kernel) from the time the issue occurs and secondly we can see a large spike in disk I/O wait time in our Datadog monitoring for the instance, showing that the instance is stuck with processes waiting for the disk to respond.
Why RabbitMQ is impacted in this manner is unclear. It makes sense that the cluster quorum and status negotiation wouldn’t require working disk to keep running, but in every test we made where we deliberately broke the storage, the RabbitMQ process would correctly detect something was wrong on the host and go into an unhealthy state, removing it from the cluster. Tested ripping out EBS storage whilst still mounted, corrupting with dd, force unmounting, etc… nothing could trigger the exact same behaviour.
Reviewing what differs about production was difficult since it didn’t persist any of the kernel or RabbitMQ logs, however we did manage to extract some information from the AWS instance console for one of the impacted systems before we restarted it:
Ubuntu 16.04.4 LTS localhost ttyS0
[349442.682614] Not tainted 4.4.0-1062-aws #71-Ubuntu
[349442.684363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[349442.686890] INFO: myprocess:1956 blocked for more than 120 seconds.
Essentially the Linux kernel is proceeding to log a number of different processes (basically everything on the box that does anything) as being blocked for over 120 seconds, thanks to the storage failing and being unable to do anything about it to unblock the processes
Given we have been unable to identify the exact fault or reproduce the behaviour (could be something in the Linux Kernel, could be something in AWS c5 or EBS…), we needed a solution that would at least help us by terminating any instance that experienced this storage issue.
The solution is helpfully identified by the kernel log lines above. We can use the hung task panic feature in the Linux kernel to force a host to immediately reboot itself if processes are hung for too long.
We do this using two different sysctl configuration changes (note – you need to set these up in /etc/sysctl.conf to survive reboots):
# Panic if a hung task was found
# Reboot 5 seconds after panic
The first instructs the kernel to panic if a hung task (any task blocked for more than 120 seconds) occurs. The second, instructs it to reboot shortly after this occurs. We set it to 5 seconds, to give time for any logging to persist or be delivered about the hung task before it’s reboot, although in this particular situation, with all storage being busted its of very limited benefit.
This has been in place for several months now and is working beautifully. Every so often an instance experiences this fault and instead of causing any disruption, it is quickly self-terminated and replaced. Because it terminates completely, the RabbitMQ cluster negotiation is successfully able to re-assign responsibilities to the other instances in the cluster.
In theory there is a 2-minute period where the unhealthy instance is still running, however reviewing the production metrics, it appears that when the fault occurs, RabbitMQ doesn’t immediately break – sometimes it continues to run for 15mins or more before jamming up the cluster. So having it run for 2 minutes has turned out to be just fine.
Ideally going forwards we need to setup a network logging endpoint for these hosts to see if we can capture anything like a stack trace from a kernel driver. It seems likely that it’s a Linux kernel bug rather than an AWS EBS bug, since the issue is resolved with a soft reboot, rather than a force stop-start of the instance via the AWS API, meaning it’s still running on the same hypervisor host, etc. But until then, this kernel configuration parameter means we are not going to disrupt operational services when the fault does occur.