As part of my two weeks of annual leave, I’ve been making good of the spare time to work on upgrading a lot of my servers, adjusting configurations and performing a large shuffle of virtual machines between some of the hosts I have in different data centers.
As part of this work, I’ve been upgrading what was previously a DR-only host to run as full production after some nice memory and disk upgrades.
Unfortunately I ran into the beloved “Memory squeeze in netback driver” bug as per Xensource bug 762.
This delightful bug leads to a situation where although the server has about 8GB of available memory, Xen runs out of memory for networking to the VMs after a certain number of guests are started.
It’s a known fault with something to do with the Xen dom memory ballooning – one workaround is to force the domain to a certain memory size – easy enough to do, one change in the bootloader and another in the xen configuration files.
However I had to be clever. I thought to myself “Why not just tell the Xen dom to just set the memory now using xm mem-set command and save a reboot?”. Sadly my brilliant idea didn’t extend to checking how much memory the host was actually using….
Since it had been running for a while, a few processes had decided to take advantage of the additional memory and didn’t take kindly to having to fit into the new size, promptly consuming the allocated 256MB plus the swap space on the host.
If you’ve never exhausted a Linux box of memory, what happens next is never fun – Essentially the kernel invokes the Out Of Memory killer, which goes and kills of processes that it thinks are most deserving of being terminated to free resources.
Whilst this sounds like a smart feature, the OOM killer isn’t actually that smart and can do some undesirable activities – in this case, it went and terminated almost all the processes on the server, including both cron and SSH in an attempt to free memory.
I had setup a script to automatically restart the server should another remote server be unable to establish an SSH connection after 10mins whilst working on the changes just-in-case I did something silly and killed networking, however with cron terminated, this script isn’t getting executed.
So I now have a box that can do nothing other than ping, located in a data centre requiring a technician to power cycle it – the nightmare of any sysadmin. :-(
These situations are pretty rare these days thanks to most workloads being inside virtual machines or on servers with lights out management, but they still happen from time to time sadly. :-(
This bug is also one of the reasons why I’m really enjoying KVM on RHEL 6 over Xen on RHEL 5, so far it appears far more stable, less buggy and generally less “hacky” in nature.
Interestingly, only seen this bug on x86_64 xen hosts… many of the bugs I find with Xen seem to be architecture specific bugs and often don’t happen on i386 or vice-versa.
Sadly most of my production boxes still have another 12-24 months of life before I can justify upgrading them all to shiny new KVM hosts with LOM capabilities, I look forwards to when I can.
Meanwhile, I think some research into the OOM killer is needed, to find out how I can best configure it not to kill key processes.
The OOM killer isn’t entirely stupid, it does a number of metrics to try and make the best of a bad situation as per the documentation, but at the end of the day it’s just a really nasty tool for a problem you never ever want.