Tag Archives: virtualisation

Installing EL7 onto EL5 Xen hosts

With RedHat recently releasing RHEL 7 (and CentOS promptly getting their rebuild out the door shortly after), I decided to take the opportunity to start upgrading some of my ageing RHEL/CentOS (EL) systems.

My personal co-location server is a trusty P4 3.0Ghz box running EL 5 for both host and Xen guests. Xen has lost some popularity in favour of HVM solutions like KVM, however it’s still a great hypervisor and can run Linux guests really nicely on even hardware as old as mine that lacks HVM CPU extensions.

Considering that EL 5, 6 and 7 are all still supported by RedHat, I would expect that installing EL 7 as a guest on EL 5 should be easy – and to be fair to RedHat it mostly is, the installation was pretty standard.

Like EL 5 guests, EL 7 guests can be installed entirely from the command line using the standard virt-install command – for example:

$ virt-install --paravirt \
 --name MyCentOS7Guest \
 --ram 1024 \
 --vcpus 1 \
 --location http://mirror.centos.org/centos/7/os/x86_64/ \
 --file /dev/lv_group/MyCentOS7Guest \
 --network bridge=xenbr0

One issue I had is that the installer no longer prompts for network information to use to download the rest of the installer and instead assumes you have a DHCP server, an assumption that isn’t always correct. If you want to force it to use a static address, append the following parameters to the virt-install command.

 -x 'ip=192.168.1.20 netmask=255.255.255.0 dns=8.8.8.8 gateway=192.168.1.1'

The installer will proceed and give you an option to either use VNC to get a graphical installer, or to accept the more basic/limited text mode installer. In my case I went with the text mode installer, generally this is fine for average installations, except that it doesn’t give you a lot of control over partitioning.

Installation completed successfully, but I was not able to subsequently boot the new guest, with an error being thrown about pygrub being unable to find the boot partition.

# xm create -c vmguest
Using config file "./vmguest".
Traceback (most recent call last):
  File "/usr/bin/pygrub", line 774, in ?
    raise RuntimeError, "Unable to find partition containing kernel"
RuntimeError: Unable to find partition containing kernel
No handlers could be found for logger "xend"
Error: Boot loader didn't return any data!
Usage: xm create <ConfigFile> [options] [vars]

 

Xen works a little differently than VMWare/KVM/VirtualBox in that it doesn’t try to emulate hardware unnecessarily in paravirtualised mode, so there’s no BIOS. Instead Xen ships with a tool called pygrub, that is essentially an application that implements grub and goes through the process of reading the guest’s /boot filesystem, displaying a grub interface using the config in /boot, then when a kernel is selected grabs the kernel and associated information and launches the guest with it.

Generally this works well, certainly you can boot any of your EL 5 guests with it as well as other Linux distributions with Xen paravirtulised compatible kernels (it’s merged into upstream these days).

However RHEL has moved on a bit since 2007 adding a few new tricks, such as replacing Grub with Grub2 and moving from the typical ext3 boot partition to an xfs boot partition. These changes confuse the much older utilities written for Xen, leaving it unable to read the boot loader data and launch the guest.

The two main problems come down to:

  1. EL 5 can’t read the xfs boot partition created by default by EL 7 hosts. Even if you install optional xfs packages provided by centosplus/centosextras, you still can’t read the filesystem due to the version of xfs being too new for it to comprehend.
  2. The version of pygrub shipped with EL 5 doesn’t have support for Grub2. Well, technically it’s supposed to according to RedHat, but I suspect they forgot to merge in fixes needed to make EL 7 boot.

I hope that RedHat fix this deficiency soon, presumably there will be RedHat customers wanting to do exactly what I’m doing who will apply some pressure for a fix, however until then if you want to get your shiny new EL 7 guests installed, I have a bunch of workarounds for those whom are not faint of heart.

 

For these instructions, I’m assuming that your guest is installed to /dev/lv_group/vmguest, however these instructions should work equally for image files or block devices.

Firstly, we need to check what the state of the /boot partition is – we need to make sure it is an ext3 volume, or convert it if not. If you installed via the limited text mode installer, it will be an xfs partition, however if you installed via VNC, you might be able to change the type to ext3 and avoid the next few steps entirely.

We use kpartx -a and -d respectively to expose the partitions inside the block device so we can manipulate the contents. We then use the good ol’ file command to check what type of filesystem is on the first partition (which is presumably boot).

# kpartx -a /dev/lv_group/vmguest
# file -sL /dev/mapper/vmguestp1
/dev/mapper/vmguestp1: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
# kpartx -d /dev/lv_group/vmguest

Being xfs, we’re probably unable to do much – if we install xfsprogs (from centos extras), we can verify it’s unreadable by the host OS:

# yum install xfsprogs
# xfs_check /dev/mapper/vmguestp1
bad sb version # 0xb4b4 in ag 0
bad sb version # 0xb4a4 in ag 1
bad sb version # 0xb4a4 in ag 2
bad sb version # 0xb4a4 in ag 3
WARNING: this may be a newer XFS filesystem.
#

Technically you could fix this by upgrading the kernel, but EL 5’s kernel is a weird monster that includes all manor of patches for Xen that were never included into upstream, so it’s not a simple (or even feasible) operation.

We can convert the filesystem from xfs to ext3 by using another newer Linux system. First we need to export the boot volume into an image file:

# dd if=/dev/mapper/vmguestp1  | bzip2 > /tmp/boot.img.bz2

Then copy the file to another host, where we will unpack it and recreate the image file with ext3 and the same contents.

$ bunzip2 boot.img.bz2
$ mkdir tmp1 tmp2
$ sudo mount -t xfs -o loop boot.img tmp1/
$ sudo cp -avr tmp1/* tmp2/
$ sudo umount tmp1/
$ mkfs.ext3 boot.img
$ sudo mount -t ext3 -o loop boot.img tmp1/
$ sudo cp -avr tmp2/* tmp1/
$ sudo umount tmp1
$ rm -rf tmp1 tmp2
$ mv boot.img boot-new.img
$ bzip2 boot-new.img

Copy the new file (boot-new.img) back to the Xen host server and replace the guest’s/boot volume with it.

# kpartx -a /dev/lv_group/vmguest
# bzcat boot-new.img.bz2 > /dev/mapper/vmguestp1
# kpartx -d /dev/lv_group/vmguest

 

Having fixed the filesystem, Xen’s pygrub will be able to read it, however your guest still won’t boot. :-( On the plus side, it throws a more useful error showing that it could access the filesystem, but couldn’t parse some data inside it.

# xm create -c vmguest
Using config file "./vmguest".
Using <class 'grub.GrubConf.Grub2ConfigFile'> to parse /grub2/grub.cfg
Traceback (most recent call last):
  File "/usr/bin/pygrub", line 758, in ?
    chosencfg = run_grub(file, entry, fs)
  File "/usr/bin/pygrub", line 581, in run_grub
    g = Grub(file, fs)
  File "/usr/bin/pygrub", line 223, in __init__
    self.read_config(file, fs)
  File "/usr/bin/pygrub", line 443, in read_config
    self.cf.parse(buf)
  File "/usr/lib64/python2.4/site-packages/grub/GrubConf.py", line 430, in parse
    setattr(self, self.commands[com], arg.strip())
  File "/usr/lib64/python2.4/site-packages/grub/GrubConf.py", line 233, in _set_default
    self._default = int(val)
ValueError: invalid literal for int(): ${next_entry}
No handlers could be found for logger "xend"
Error: Boot loader didn't return any data!

At a glance, it looks like pygrub can’t handle the special variables/functions used in the EL 7 grub configuration file, however even if you remove them and simplify the configuration down to the core basics, it will still blow up.

# xm create -c vmguest
Using config file "./vmguest".
Using <class 'grub.GrubConf.Grub2ConfigFile'> to parse /grub2/grub.cfg
WARNING:root:Unknown image directive load_video
WARNING:root:Unknown image directive if
WARNING:root:Unknown image directive else
WARNING:root:Unknown image directive fi
WARNING:root:Unknown image directive linux16
WARNING:root:Unknown image directive initrd16
WARNING:root:Unknown image directive load_video
WARNING:root:Unknown image directive if
WARNING:root:Unknown image directive else
WARNING:root:Unknown image directive fi
WARNING:root:Unknown image directive linux16
WARNING:root:Unknown image directive initrd16
WARNING:root:Unknown directive source
WARNING:root:Unknown directive elif
WARNING:root:Unknown directive source
Traceback (most recent call last):
  File "/usr/bin/pygrub", line 758, in ?
    chosencfg = run_grub(file, entry, fs)
  File "/usr/bin/pygrub", line 604, in run_grub
    grubcfg["kernel"] = img.kernel[1]
TypeError: unsubscriptable object
No handlers could be found for logger "xend"
Error: Boot loader didn't return any data!
Usage: xm create <ConfigFile> [options] [vars]

Create a domain based on <ConfigFile>

At this point it’s pretty clear that pygrub won’t be able to parse the configuration file, so you’re left with two options:

  1. Copy the kernel and initrd file from the guest to somewhere on the host and set Xen to boot directly using those host-located files. However then kernel updating the guest is a pain.
  2. Backport a working pygrub to the old Xen host and use that to boot the guest. This requires no changes to the Grub2 configuration and means your guest will seamlessly handle kernel updates.

Because option 2 is harder and more painful, I naturally chose to go down that path, backporting the latest upstream Xen pygrub source code to EL 5. It’s not quite vanilla, I had to make some tweaks to rip out a couple newer features that were breaking it on EL 5, so I’ve packaged up my version of pygrub and made it available in both source and binary formats.

Download Jethro’s pygrub backport here

Installing this *will* replace the version installed by the Xen package – this means an update to the package on the host will undo these changes – I thought about installing it to another path or making an RPM, but my hope is that Red Hat get their Xen package fixed and make this whole blog post redundant in the first place so I haven’t invested that level of effort.

Copy to your server and unpack with:

# tar -xkzvf xen-pygrub-6f96a67-JCbackport.tar.gz
# cd xen-pygrub-6f96a67-JCbackport

Then you can build the source into a python module and install with:

# yum install xen-devel gcc python-devel
# python setup.py build
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.4
creating build/lib.linux-x86_64-2.4/grub
copying src/GrubConf.py -> build/lib.linux-x86_64-2.4/grub
copying src/LiloConf.py -> build/lib.linux-x86_64-2.4/grub
copying src/ExtLinuxConf.py -> build/lib.linux-x86_64-2.4/grub
copying src/__init__.py -> build/lib.linux-x86_64-2.4/grub
running build_ext
building 'fsimage' extension
creating build/temp.linux-x86_64-2.4
creating build/temp.linux-x86_64-2.4/src
creating build/temp.linux-x86_64-2.4/src/fsimage
gcc -pthread -fno-strict-aliasing -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fPIC -I../../tools/libfsimage/common/ -I/usr/include/python2.4 -c src/fsimage/fsimage.c -o build/temp.linux-x86_64-2.4/src/fsimage/fsimage.o -fno-strict-aliasing -Werror
gcc -pthread -shared build/temp.linux-x86_64-2.4/src/fsimage/fsimage.o -L../../tools/libfsimage/common/ -lfsimage -o build/lib.linux-x86_64-2.4/fsimage.so
running build_scripts
creating build/scripts-2.4
copying and adjusting src/pygrub -> build/scripts-2.4
changing mode of build/scripts-2.4/pygrub from 644 to 755

# python setup.py install

Naturally I recommend reviewing the source code and making sure it’s legit (you do trust random blogs right?) but if you can’t get it to build/lack build tools/like gambling, I’ve included pre-built binaries in the archive and you can just do

# python setup.py install

Then do a quick check to make sure pygrub throws it’s help message, rather than any nasty errors indicating something went wrong.

# /usr/bin/pygrub

 

We’re almost ready to try booting again! First create a directory that the new pygrub expects:

# mkdir /var/run/xend/boot/

Then launch the machine creation – this time, it should actually boot and run through the usual systemd startup process. If you installed with /boot set to ext3 via the installer, everything should just work and you’ll be up and running!

If you had to do the xfs to ext3 conversion trick, the bootup process will explode with scary errors like the following:

.......
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-245...95b2c23.device.
[DEPEND] Dependency failed for /boot.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Relabel all filesystems, if necessary.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.
[  101.134423] systemd-journald[414]: Received request to flush runtime journal from PID 1
[  101.658465] type=1305 audit(1405735466.679:4): audit_pid=476 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
Give root password for maintenance
(or type Control-D to continue):

The issue is that the conversion of the filesystem changed it’s UUID, plus the filesystem type in /etc/fstab no longer matches.

We can fix this easily by dropping to the recovery shell by entering the root password above and executing the following commands:

guest# sed -i -e '/boot/ s/UUID=[0-9\-]*/\/dev\/xvda1/' /etc/fstab
guest# sed -i -e '/boot/ s/xfs/ext3/' /etc/fstab
guest# cat /etc/fstab | grep '/boot'

Make sure the cat returns a valid /boot line, it should be using /dev/xvda1 as the device and ext3 as the filesystem now.

Finally, stop and start the instance (reboots seem to hang for me):

guest# shutdown -h now
xm create -c vmguest1

It should now boot correctly! Go forth and enjoy your new VM!

CentOS Linux 7 (Core)
Kernel 3.10.0-123.el7.x86_64 on an x86_64

This is certainly a hack – doing this backport of pygrub solved my personal issue, but it’s entirely possible it may break other things, so do your own testing and determine whether it’s suitable for you and your environment or not.

Fedora x86_64 installer hanging on KVM hosts

Had an annoying problem today where a Fedora x86_64 guest wouldn’t install on my CentOS KVM server. Weirdly the i386 version had installed perfectly, but the x86_64 version would repeatedly crash and chew up heaps of CPU at the software package selection screen.

I hate being stuck here...

Stuck here, unresponsive console, no mouse, etc?

Turns out that 512MB of RAM isn’t enough to install Fedora x86_64, but is enough to get away with installing Fedora i386 on. Simply boost the RAM allocation of the VM up to 1GB, and the installation will proceed OK. You can drop the RAM allocation down again afterwards.

Unsure why the installer dies in such a strange fashion, I would have expected Linux’s OOM to terminate the installer and leave me with a clear message, but maybe Anaconda is doing something weird like OOM protection and just ends up with the system running out of memory and hanging.

Delicious Entropy

I run a large GNU/Linux server with KVM for running numerous virtual machine guests, including build hosts used to package and compile software for different GNU/Linux distributions and other operating systems.

I recently ran into an issue during a kernel compile where the kernel compile hung indefinitely whilst GPG (tried) to sign kernel modules as part of the build process, due to the virtual machine guest running out of available entropy and being unable to proceed until more random data was available.

Bro, I'm stalled as bro!

Bro, I’m stalled as bro!

On Linux there are two sources of random data  – /dev/random, which provides high quality random data and /dev/urandom which provides an unlimited amount of pseudo-random data based on a seed value taken from the random pool initially.

Linux generates this random data by collecting entropy from somewhat-random events, such as disk activity, network activity, keyboard, mouse and other sources. When the pool of entropy is exhausted, /dev/random will block (ie force processes to freeze) until more is available, whereas /dev/urandom will continue to serve continuous pseudo-random data, although the quality of the random data is not considered as secure as /dev/random.

On a workstation or single server this tends to be enough to generate sufficient random data for most applications (although if you’re doing certain tasks you may still have an issue). Virtual machines on the other hand, lack hardware sources of entropy such as disks or keyboards and it’s very easy to quickly exhaust the available entropy pool and have some applications block until more is available.

Applications like Apache (with mod_ssl) and OpenSSL use /dev/urandom so aren’t impacted by shortages of entropy, but some signing processes, such as GPG require /dev/random and can be impacted if the source of entropy is exhausted  – which is exactly what happened to my kernel signing process.

 

It’s pretty easy to use to test and see how quickly a Linux system re-fills the entropy pool by running a test to read data from /dev/random, forcing the pool to empty and be repopulated.

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
16+1 records out
8496 bytes (8.5 kB) copied, 149.849 s, 0.1 kB/s

The host doing this test has around 12 physical hard disks, 10 active KVM virtual machines spewing out packets, an unfiltered WAN link feeding random junk – all which is good for generating a decent amount of entropy. The numbers may look pretty bad, but when compared with the amount of entropy generated by my laptop…

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
16+1 records out
8409 bytes (8.4 kB) copied, 1389.95 s, 0.0 kB/s

The rate of entropy generation on my laptop is quite depressing – but at least my laptop has a keyboard, mouse and hardware environmental values to help add sometime to the entropy sources.

When I run the same test on a virtual machine guest, which lacks all these physical sources, it comes to  a grinding halt:

# dd if=/dev/random of=/dev/null count=10000
0+24 records in
0+0 records out
0 bytes (0 B) copied, 1865.68 s, 0.0 kB/s

I was forced to kill the above test due to it timing out indefinitely thanks to the host running out of any available entropy and being unable to generate any more to complete the test. :-(

Even when performing an intensive activity such as compiling a large software library, it still takes considerable time to complete this test on a VM:

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
15+1 records out
8018 bytes (8.0 kB) copied, 2560.36 s, 0.0 kB/s

It seems that the lack of the random data generated by active physical hardware is too much for the VM guest to be able to complete the test. And whilst some applications like an HTTPS website would continue to operate fine, others like a build host GPG-signing packages may fail and hang indefinitely, unable to obtain the required volume of random data to complete it’s key generation process.

 

For times when this lack of entropy becomes an issue for your applications, it is possible to obtain additional entropy from a hardware random number generator – this can be as simple as using a feed such as analog noise from the sound card or as sophisticated as a hardware random number generator or functionality built into certain CPUs which is designed to be extremely random and unpredictable.

A while ago I picked up a pair of Simtec Electronic’s Entropy Keys, a small USB device which generates truly random sources of data by a clever method of abusing semiconductors and connected one to my primary KVM servers.

The device ships with an open source daemon that takes random data from the key and injects it into the Linux entropy pool for use by all /dev/random using applications. It instantly makes a huge difference to the available volume by generating almost 3.9KB/s of random data.

Gain entropy with just 1 easy repayment!

Gain entropy with just 1 easy repayment! Call now!

After starting the daemon and re-running the test, the performance looks much better:

# dd if=/dev/random of=/dev/null count=1000
0+1000 records in
145+1 records out
74504 bytes (75 kB) copied, 21.8926 s, 3.4 kB/s

The numbers are still low, but the reality is you generally you only need a few bytes at a time, rather than massive volumes like this test demands – for general signing usage, 3.4kB/s is a huge volume to have.

So whilst this test doesn’t reflect the real way /dev/random is used, it does illustrates the difference in data volume a proper random number generator can make. And whilst this might not be a common problem thanks to the low volume of random data required for most applications to function, the increasing use of virtualisation makes this issue possibly one that people may bump into more in future.

Now that I have my host server getting a reliable and steady flow of random data, my next step is to share that data to the virtual machines running on the host – as I’m doing all my signing in guests, it’s vital that I get that random data through to them,

I’m in the process of investigating a few different options and will cover these in a follow up blog post, as it’s a somewhat sizeable topic in it’s own right.

virt-viewer remote access tricks

Sometimes I need to connect directly to the console of my virtual machines, typically this is usually when working with development or experimental VMs where SSH/RDP/VNC isn’t working for whatever reason, or when I’m installing a new OS entirely.

To view virtual machines using libvirt (by both KVM or Xen), you use the virt-viewer command, this launches a window and establishes a VNC or SPICE connection into the virtual machine.

Historically I’ve just run this by SSHing into the virtual machine host and then using X11 forwarding to display the virtual machine window on my laptop. However this performs really badly on slow connections, particularly 3G where it’s almost unusable due to the design of X11 forwarding not being particularly efficient.

However virt-viewer has the capability to run locally and connect to a remote server, either directly to the libvirt daemon, or via an SSH tunnel. To do the latter, the following command will work for KVM (qemu) based hypervisors:

virt-viewer --connect qemu+ssh://user@host.example.com/system vmnamehere

With the above, you’ll have to enter your SSH password twice – first to establish the connection to the hypervisor and secondly to establish a tunnel to the VM’s VNC/SPICE session – you’ll probably quickly decide to get some SSH keys/certs setup to prevent annoyance. ;-)

This performs way faster than X11 forwarding, plus the UI of virt-manager stays much more responsive, including grabbing/ungrabbing of the local keyboard/mouse, even if the connection or server is lagging badly.

If you’re using Xen with libvirt, the following should work (I haven’t tested this, but based on the man page and some common sense):

virt-viewer --connect xen+ssh://user@host.example.com/ vmnamehere

If you wanted to open up the right ports on your server’s firewall and are sending all traffic via a secure connection (eg VPN), you can drop the +ssh and use –direct to connect directly to the hypervisor and VM without port forwarding via SSH.

Munin Performance

Munin is a popular open source network resource monitoring tool which polls the hosts on your network for statistics for various services, resources and other attributes.

A typical deployment will see Munin being used to monitor CPU usage, memory usage, amount of traffic across network interface, I/O statistics and more – it’s very handy for seeing long term performance trends and for checking the impact that upgrades or adjustments to the environment have made.

Whilst having some overlap with Nagios, Munin isn’t really a replacement, more an addition – I use Nagios to do critical service and resource monitoring and use Munin to graph things in more detail – something that Nagios doesn’t natively do.

A typical Munin graph - Munin provides daily, weekly, monthly and yearly graphs (RRD powered)

Rather than running as a daemon, the Munin master runs a cronjob every 5minutes that calls a sequence of scripts to poll the configured servers and generate new graphs.

  1. munin-update to poll configured hosts for new statistics and store the information in RRD databases.
  2. munin-limits to highlight perceived issues in the web interface and optionally to a file for Nagios integration.
  3. munin-graph to generate all the graphs for all the services and hosts.
  4. munin-html to generate the html files for the web interface (which is purely static).

The problem with this model, is that it doesn’t scale particularly well – once you start getting a substantial number of servers, the step-by-step approach can start to run out of resources and time to complete within the 5minute cron period.

For example, the following are the results for the 3 key scripts that run on my (virtualised) Munin VM monitoring 18 hosts:

sh-3.2$ time /usr/share/munin/munin-update
real    3m22.187s
user    0m5.098s
sys     0m0.712s

sh-3.2$ time /usr/share/munin/munin-graph
real    2m5.349s
user    1m27.713s
sys     0m9.388s

sh-3.2$ time /usr/share/munin/munin-html
real    0m36.931s
user    0m11.541s
sys     0m0.679s

It’s a total of around 6 minutes time to run – long enough that the finishing job is going to start clashing with the currently running job.

So why so long?

Firstly, munin-update – munin-update’s time is mostly spent polling the munin-node daemon running on all the monitored systems and then a small amount of I/O time writing the new information to the on-disk RRD files.

The developers have appeared to realise the issue of scale with munin-update and have the ability to run it in a forked mode – however this broke horribly for me with a highly virtualised environment, since sending a poll to 12+ servers all running on the one physical host would cause a sudden load spike and lead to a service poll timeout, with no values being returned at all. :-(

This occurs because by default Munin allows a maximum of 5 seconds for each service query to complete across all hosts and queries all the hosts and services rapidly, ignoring any that fail to respond fast enough. And when querying a large number of servers on one physical host, the server would be too loaded to respond quickly enough.

I ended up boosting the timeouts on some servers to 60 seconds (particular the KVM hosts themselves, as there would sometimes be 60+ LVM volumes that Munin wanted statistics for), but it still wasn’t a good solution and the load spikes would continue.

There are some tweaks that can be used, such as adjusting the max number of forked processes, but it ended up being more reliable and easier to support to just run a single thread and make sure it completed as fast as possible – and taking 3 mins to poll all 18 servers and save to the RRD database is pretty reasonable, particular for a staggered polling session.

 

After getting munin-update to complete in a reasonable timeframe, I took a look into munin-html and munin-graph – both these processes involve reading the RRD databases off the disk and then writing HTML and RRDTool Graphs (PNG files) to disk for the web interface.

Both processes have the same issue – they chew a solid amount of CPU whilst processing data and then they would get stuck waiting for the disk I/O to catch up when writing the graphs.

The I/O on this server isn’t the fastest at the best of times, considering it’s an AES-256 encrypted RAID 6 volume and the time taken to write around 200MB of changed data each time was a bit too much to do efficiently.

Munin offers some options, including on-demand graph generation using CGIs, however I found this just made the web interface unbearably slow to use – although from chats with the developer, it sounds like version 2.0 will resolve many of these issues.

I needed to fix the performance with the current batch generation model. Just watching the processes in top quickly shows the issue with the scripts, particular with munin-graph which runs 4 concurrent processes, all of them waiting for I/O. (Linux process crash course: S is sleeping (idle), R is running, D is performing I/O operations – or waiting for them).

Clearly this isn’t ideal – I can’t do much about the underlying performance, other than considering putting the monitoring VM onto a different I/O device without encryption, however I then lose all the advantages of having everything on one big LVM pool.

I do however, have plenty of CPU and RAM (Quad Phenom, 16GB RAM) so I decided to boost the VM from 256MB to 1024MB RAM and setup a tmpfs filesystem, which is a in-memory filesystem.

Munin has two main data sources – the RRD databases and the HTML & graph outputs:

# du -hs /var/www/html/munin/
227M    /var/www/html/munin/

# du -hs /var/lib/munin/
427M    /var/lib/munin/

I decided that putting the RRD databases in /var/lib/munin/ into tmpfs would be a waste of RAM – remember that munin-update is running single-threaded and waiting for results from network polls, meaning that I/O writes are going to be spread out and not particularly intensive.

The other problem with putting the RRD databases into tmpfs, is that a server crash/power down would lose all the data and that then requires some regular processes to copy it to a safe place, etc, etc – not ideal.

However the HTML & graphs are generated fresh each time, so a loss of their data isn’t an issue. I setup a tmpfs filesystem for it in /etc/fstab with plenty of space:

tmpfs  /var/www/html/munin   tmpfs   rw,mode=755,uid=munin,gid=munin,size=300M   0 0

And ran some performance tests:

sh-3.2$ time /usr/share/munin/munin-graph 
real    1m37.054s
user    2m49.268s
sys     0m11.307s

sh-3.2$ time /usr/share/munin/munin-html 
real    0m11.843s
user    0m10.902s
sys     0m0.288s

That’s a decrease from 161 seconds (2.68mins) to 108 seconds (1.8 mins). It’s a reasonable increase, but the real difference is the massive reduction in load for the server.

For a start, we can see from watching the processes with top that the processor gets worked a bit more to complete the process, since there’s not as much waiting for I/O:

With the change, munin-graph spends almost all it’s time doing CPU processing, rather than creating I/O load – although there’s the occasional period of I/O as above, I suspect from the time spent reading the RRD databases off the slower disk.

Increased bursts of CPU activity is fine – it actually works out to less CPU load, since there’s no need for the CPU to be doing disk encryption and hammering 1 core for a short period of time is fine, there’s plenty of other cores and Linux handles scheduling for resources pretty well.

We can really see the difference with Munin’s own graphs for the monitoring VM after making the change:

In addition, the host server’s load average has dropped significantly, as well as the load time for the web interface on the server being insanely fast, no more waiting for my browser to finish pulling all the graphs down for a page, instead it loads in a flash. Munin itself gives you an idea of the difference:

If performance continues to be a problem, there are some other options such as moving RRD databases into memory, patching Munin to do virtualisation-friendly threading for munin-update or looking at better ways to fix CGI on-demand graphing – the tmpfs changes would help a bit to start with.

Virtualbox Awesomeness

Work recently upgraded us to the latest MS Office edition for our platform. Most of our staff run MacOS, but we have a handful of Windows users and one dedicated Linux user (guess who?) who received MS Office 2010 for Windows.

I’ve been using MS Office 2007 under Wine for several years, it was never perfect, but about 90% of the functionality worked with some exceptions such as PDF export and certain UI and performance artifacts.

With the 2010 upgrade I decided to instead switch to using Windows under a VM on my laptop to avoid any headaches and to fix the missing features and performance issues experienced running Office under Wine.

Whilst I’m a fan of Xen and KVM, they aren’t so well suited for desktop virtualisation as they’re designed more for server environments and don’t offer some of the more desktop focused features such as seamless integration, video acceleration and easy point & click management interfaces.

Instead I went with VirtualBox thanks to it being mostly open source (open source with exception for a few extensions for USB 2.0 forwarding and network boot) and with a pretty good reputation as a decent VM application.

It also has some of the user-friendly desktop features you’d expect such as being able to forward USB hardware through to guest, mounting any folder on the host as a network share (without needing to setup samba) and 2D/3D video acceleration.

But the real killer feature for me was the seamless windows feature, which allows me to boot the virtual windows desktop and Windows applications alongside my Linux applications smoothly and without the nastiness of an RDP window.

Windows & Linux application windows running together concurrently.

Sadly it’s not quite good enough for you to be able to run the latest Windows games in as the 3D acceleration is quite basic, but it’s magnificent for just about any other non-multimedia application.

The only glitch I found, is that if you have dual screens, you can only run the windows session on one screen at a time, although virtualbox does allow moving the session between monitors whilst running so it’s not too big a deal.

The other annoying issue I had with virtualbox is that it uses image files for storing the guest VMs and it doesn’t appear possible to get it to use an LVM volume instead – so in my case, I waste a bit of space and performance for unnecessary filesystem formatting to store the Windows VM. I guess this is a feature that only a small subset of users would want so it’s not particularly high priority for them to add it.

I’m running Win7 with 2 virtual cores and 1GB of RAM on top of a host with an Intel Core i5 CPU (with hardware virtualisation enabled), 8GB RAM and a Intel 320 series SSD and it’s pretty damn snappy.

As a side note, the seemless window integration also works for Linux-based guests, so you could also do the same ontop of a Windows host, or even Linux-on-Linux if desired.

KVM/libvirt change CDROM

I was setting up some Windows virtual machines this evening on my Linux KVM/libvirt server, in order to experiment with how Windows handles IPv6 networks.

Installing windows was easy enough – standard virt-install commands, however post-reboot, Windows XP wants to access the CDROM again.

However the reboot causes the CDROM ISO to be unattached from the virtual CDROM drive – so it’s necessary to re-add it to continue installation

However the logical syntax based on virsh help, doesn’t work:

virsh # attach-disk devel-winxp1 /tmp/winxp.iso hdc
error: Failed to attach disk
error: this function is not supported by the connection driver: disk bus 'ide' cannot be hotplugged.

The correct syntax is:

virsh # attach-disk devel-winxp1 /tmp/winxp.iso hdc --type cdrom --mode readonly 
Disk attached successfully

Basically you need to tell libvirt that you’re attaching a *cdrom* and not an actual disk – I’m not sure why it doesn’t just figure that out, based on the fact the user is trying to obviously attach an ISO to a virtual optical drive device – maybe nobody has gotten around to implementing a nice autodetect method yet…

DHCP, I/O and other virtualisation fun with KVM

I recently shifted from having two huge server racks down to having a single speedy home server running KVM virtual machines, with the intent of packaging all my servers – experimental, development, staging, etc, into a single reliable system which will reduce power and maintenance costs.

As part of this change, I went from having dedicated DHCP & DNS servers to having everything located onto the KVM host.

The design I’ve used, has the host OS running with minimal services – the host just runs KVM, OpenVPN, DHCP and a DNS caching nameserver – all other services run as guest VMs, with a virtual network for the guests and host to communicate over.

Guests run as DHCP clients – this makes it easy to assign or adjust addressing if needed and get their information from the host OS.

However this does mean you can’t get away with hammering the host too badly – for example, running an I/O and network intensive backup can cause some interesting problems when you also need the host for services, such as DHCP.

Take a look at the following log messages from a mostly idle VM – these were taken whilst another VM on the server was running a bonnie++ process to test performance:

Mar  6 10:18:06 virtguest dhclient: 5 bad udp checksums in 5 packets
Mar  6 10:18:27 virtguest dhclient: DHCPREQUEST on eth0 to 10.8.12.1 port 67
Mar  6 10:18:45 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  6 10:19:00 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  6 10:19:07 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  6 10:19:15 virtguest dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Mar  6 10:19:15 virtguest dhclient: 5 bad udp checksums in 5 packets

That’s some messed up stuff – what you’re seeing is that the guest VM is trying to renew the DHCP address with the host server – but the host is so sluggish with having to run the I/O intensive virtual machine that is actually corrupting or dropping the UDP packets, preventing the guest VM from renewing it’s address.

This of course raises the most important question: What happens if the guest can’t renew it’s IP address?


In this case, the Linux/CentOS 5 guest VM actually completely lost it’s IP address after a long period of DHCPREQUEST attempts, fell off the network entirely and caused my phone to go nuts with Nagios alerts.

Now of course in any sane production environment, nobody would be running a bonnie++ processes on a VM on an active server – however there’s some pretty key points still made here:

  • The isolation is a lie: Guests are only *somewhat* isolated from one another – one guest can still mess with another and effectively denial-of-service attack the other VMs by utilising all the available resources.
  • Guests can be jerks: Organisations running KVM (or some other systems) with untrusted guest VMs should carefully consider how they are going to monitor and protect the service from users running crazily resource intensive processes. (after all, there will be someone who wants to bonnie++ test their new VM simply for the lols).
  • cgroups to the rescue? Linux cgroups does have an I/O controller (blkio-cgroup) although whilst this controls read/write flow, it won’t restrict seeks which can also badly impact spinning rust based servers.
  • WTF DHCP? The approach of the guests simply dropping their DHCP address after losing contact with the DHCP server is a pretty bad design limitation – if the DHCP server is unreachable, it should keep the original address (of course if the “physical” ethernet connection dropped, that would be a different situation, and it should drop it’s address to match).
  • Also: I wonder what OSes/distributions have the above behavior?

I’m currenting running a number of bonnie++ tests on my KVM server and will have a blog post in the near future detailing these findings in more detail, I’m also planning to look into cgroups and other resource control and limiting functions and will report back on how these fare when you have guest VMs running heavy processes.

Overall it made my weekend of geekery that bit more exciting. :-D