Keeping Android Wifi Awake

I run a number of backgrounded applications on my Android phone, such as Nagios (server monitoring) CSipSimple (VoIP/SIP), OpenVPN (SSL-based VPN) and IMAP idle (push email).

Whilst this does impact battery life somewhat, I’ve got things reasonably well tuned so that the frequency of polling and keepalives is just long enough to prevent firewall timeouts, but long enough to avoid excessive waking of 3G & wifi hardware.

(For example, the default OpenVPN keepalive of 10 seconds is far more aggressive than what is actually needed, in reality, I was able to drop my phone back to one keepalive every 5 minutes – short enough to keep sessions active, but long enough that the transmitting hardware can sleep regularly whilst it isn’t needed).

However one problem I wasn’t able to fix, was the amount that the wifi disconnected – this would really screw up things, since services such as IMAP idle and SIP were trying to run over the VPN, but this would be broken by the VPN being dropped when wifi turned itself off.

I found the fix thanks to a friend who came around and told me about the hidden “Advanced” menu in the wifi network selection page:

When on the wifi network selection screen, you need to press the menu key (not sure what the option is for newer Android phones that don’t have the menu key any more?) and then a single “Advanced” menu item will appear.

Selecting this item will give you a couple extra options, including the important “Keep Wi-Fi on during sleep” option that stops the phone from dropping the wifi connection whenever you turn off the screen.

This resolved my issues with backgrounded services and I found it also made the phone generally perform better when doing any data-related services, since wifi didn’t have to renegotiate with the AP as frequently.

It’s not totally perfect, Android seems to sometimes have an argument with the AP and then drop the connection and waste a minute trying to reconnect, but it’s a lot better than it was. :-)

Tagged , , , , , | 6 Comments

cifs, ipv6 and rhel 5

Unfortunately with my recent project enabling IPv6 across my entire personal server environment, I’ve bumped into a number of annoying issues – nothing that isn’t fixable, but things that are generally frustrating and which just shouldn’t be an issue.

Particular thanks goes to my many RHEL/CentOS 5 virtual machines, which lack some pretty key stuff such as:

  • IPv6 connection tracking preventing the ESTABLISHED,RELATED ip6tables rules from working.
  • Unexpected behavior of certain bootscript configuration options.
  • Lack of IPv6 support with CIFS (Samba/SMB) share mounting.
  • Some weirdness with Dovecot I still need to resolve.

(Personally, based on the number of headaches I’ve found with RHEL 5, my recommendation is accelerate any plans to upgrade to RHEL 6 – or some other distribution – before deploying IPv6 in production.)

At the moment, CIFS IPv6 support on RHEL 5 & 6 has been causing me the most pain. My internal file server is dual stacked and has both A and AAAA DNS records – it’s a stock-standard CentOS 6 box running distribution-shipped Samba packages and works perfectly from the server side and modern IPv6 hosts have no issue mounting the shares via IPv6.

Very typical dual stack configuration:

# host fileserver.example.com 
fileserver.example.com has address 192.168.0.10
fileserver.example.com has IPv6 address 2001:0DB8::10

However, when I run the following legitimate and syntactically correct command to mount the CIFS share provided by the Samba server on other RHEL 5 hosts, it breaks with a error message that is typical of incorrect syntax with the mount options:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody
mount: wrong fs type, bad option, bad superblock on //fileserver.example.com/tmp,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

Taking a look a the kernel log, it shows a non-descriptive error explanation:

kernel:  CIFS VFS: cifs_mount failed w/return code = -22

This isn’t particularly helpful, made more infuriating by the fact that I know the command syntax is correct and should be working perfectly fine.

Seeing as a number of things broke after switching on IPv6 across the entire network, I’ve become even more of a cynical bastard and ran some tests using specifically stated IPv6 and IPv4 addresses in the mount command.

I found that by passing the IPv6 address instead of the DNS name, you can produce the additional error message which offers some additional insight:

kernel: CIFS: ip address too long

Huh. Looks like a text book IPv6 support bug to me. (Even I have made this mistake in some older generation web apps that didn’t foresee long 128-bit addresses).

In testing, I found that the following commands are all acceptable on a dual-stack network with a RHEL 5 host:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10

However all ways of specifying IPv6 will fail, as well as pure DNS resolution:

# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

No method of connecting via IPv6 would work, leaving stock RHEL 5 hosts only being able to work with CIFS shares via IPv4. :-(

Unfortunately this error is due to a known kernel bug in 2.6.18, which was fixed in 2.6.31, but sadly not backported to RHEL 5’s kernel (as of version 2.6.18-308.8.1.el5 anyway), leaving RHEL 5 users in a position where the stock OS is unable to mount CIFS shares on an IPv6 or dual-stacked network. :-(

The ideal solution would be to patch the kernel to resolve the issue – and in fact if you are running on a native IPv6-only (not dual stacked), it would be the only option to get a working solution.

However, typically if you’re using RHEL, custom kernels aren’t that popular due to the impact they make to supportability/guarantee of the platform by vendor and added headaches of security update tracking and application, so another approach is needed.

The following methods will all work for stock RHEL/Centos 5:

  • Use the ip=X mount option to overule DNS.
  • Add an entry to /etc/hosts.
  • Have a separate DNS entry that only has an A record for your file servers (ie //fileserverv4only.example.com/)
  • Disable IPv6 entirely (and suffer the scorn of your cooler IPv6 enabled friends).

These solutions all suck – having manually fixed IPs isn’t great for long term supportability, additional DNS records is an additional pain for management, and let’s not even begin to cover why disabling IPv6 entirely is wrong.

Of course RHEL 5 is a little outdated now, so I took a look at how RHEL 6 fared. On the plus side, it *can* mount IPv6 shares, all of the following mount commands are accepted without fault:

# mount -t cifs //192.168.0.10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //2001:0DB8::10/tmp /mnt/tmpshare -o user=nobody
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=192.168.0.10
# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody,ip=2001:0DB8::10

However, any mount of a IPv6 server using the DNS name will still fail, just like how they did with RHEL 5:

# mount -t cifs //fileserver.example.com/tmp /mnt/tmpshare -o user=nobody

The solution is that you need to install the “cifs-utils” package which provides the /sbin/mount.cifs binary offering smarter handling of shares – once installed, all mount command options will work OK on RHEL 6, including the standard DNS-based command we all know and love. :-D

I had always assumed that all Linux systems that could mount CIFS shares had the /sbin/mount.cifs binary installed, but it seems that’s not the case, rather the standard /bin/mount command can handle mounting CIFS using just the standard kernel mount() function

However when /bin/mount detects a /sbin/mount.FILESYSTEM binary, it will call that process instead of calling the kernel mount() directly, these binaries can offer additional logic and handling off the mount command before passing it through to the Linux kernel.

For example, the following strace from a RHEL 5 host shows that /sbin/mount checks for the existence of /sbin/mount.cifs, before then going on to call the Linux kernel mount() directly with the provided arguments:

stat64("/sbin/mount.cifs", 0xbfc9dd20)  = -1 ENOENT (No such file or directory)
...
mount("//fileserver.example.com/tmp", "/mnt", "cifs", MS_MGC_VAL, "user=nobody,password=nobody") = -1 EINVAL (Invalid argument)

But a RHEL 6 host with cifs-utils installed provides /sbin/mount.cifs, which appears to do it’s own name resolution, then establishes a connection to both the IPv4 and IPv6 sockets, before deciding which to use and instructs the kernel using the ip=X parameter.

stat64("/sbin/mount.cifs", {st_mode=S_IFREG|0755, st_size=29376, ...}) = 0
clone(Process 1666 attached
...
[pid  1666] mount("//fileserver.example.com/tmp/", ".", "cifs", 0, "ip=2001:0DB8::10",user=nobody,password=nobody) = 0

So I had an idea….. what if I could easily modify a version of cifs-utils to run on RHEL 5 dual-stack servers, yet only ever resolve DNS queries to IPv4 addresses to work around the kernel issue? :-D

Turns out you can – effectively I just made the nastiest hack ever by just tearing out the IPv6 name resolver. :-/

I’m going to hell for this, but damn, feels good man. ;-)

I wasn’t totally evil, I added an info level syslog notice about the IPv4 enforcement incase any poor admin is ever getting puzzled by someone’s customized RHEL 5 box refusing to connect to CIFS shares IPv6 – that would be a bit too cruel. ;-)

The hack is pretty crude, it actually just breaks the IPv6 socket connection attempt and so it then falls back to IPv4, so it throws up a couple errors in the logs, but doesn’t actually impact the mounting at all.

mount.cifs: Warning: Using specially patched cifs-utils to ignore IPv6 address resolution - enforcing IPv4 only!
kernel:  CIFS VFS: Error connecting to socket. Aborting operation
kernel:  CIFS VFS: cifs_mount failed w/return code = -111

But wait, there’s more! I have shiny cifs-util i386/x86_64/SRPM packages with this evil hack available for download from amberdms-os repository (or directly from the server here).

Naturally this is a bit of a kludge, don’t trust it for mission critical stuff, you ONLY need it for RHEL 5, not RHEL 6 and I can’t guarantee it won’t eat all your data and bring upon the end times, etc, etc.

I’ve tested it on my devel systems and it seems like the nicest fix – sure it won’t work for any hosts needing to run on native IPv6, but by the time I come to drop IPv4 addressing entirely I certainly will have moved on my last hosts from RHEL 5 to something a bit newer. :-)

Tagged , , , , , , , , , , , , , | 2 Comments

Largefiles strike again!

With modern Linux systems – hell, even systems from 5+ years ago – there’s usually very little issue with handling large files (> 2GB), in fact files considered large a decade ago are now tiny in comparison.

However sometimes poor sysadmins like myself have to support much older machines, in my case, a legacy accounting platform which is tied to the RHEL 2.1 host it was installed on and you suddenly get to re-discover the headaches that plagued sysadmins before us.

In my case, the backup scripts for this application suddenly stopped working recently with the error of:

cpio: standard input is closed: Value too large for defined data type

Turns out that their data had finally crept over the 2GB limit, which left cpio able to write the backup, but unable to read it for verification or restore purposes.

Thankfully cpio does support largefiles, but it’s a case of adding -D_FILE_OFFSET_BITS=64 to the gcc options at build time, so I built which fixes the problem (or at least till we hit the 16GB filesystem limits) ;-)

The version of cpio on the server is ancient, dating back to 2001 (with RHEL 2.1 being first released in 2002), so it’s over a decade old now, and I found it quite difficult to obtain the source for the specific installed version of cpio on the server, Red Hat seemed to be missing the exact release (they have -23 and -28, but not -25) so I pulled the Red Hat 8 source which comes from around the same time period – one of the advantages of having RHN is being able to quickly pull old packages, both binary and source. :-)

If you have this exact issue with a legacy system using cpio, feel free to grab my binary or source package from my repos and save yourself some build time. :-)

Tagged , , , , , , | Leave a comment

Bit Flipping Cycle Lanes

On my recent walk to Devonport I was amazed at the design of the cycle lanes made by the North Shore City Council (now part of the amalgamated Auckland City Council).

Aside from the initial amazement that such a car-focused city knew what cycle lanes where, I was also extremely amused to see how exactly they chose to implement them….

 

Exhibit A: The flipped cycleway.

Having a standard such as “people to the left, bikes to the right” clearly wasn’t exciting enough, so let’s have bike and pedestrian lanes randomly swapping sides at each junction.

Quick everyone, change places!

 

Exhibit B: Multipathing!

Having just one bike lane isn’t enough, let’s add a second bike lane – one on the footpath and one on the road. And whilst we’re at it, let’s make it so it goes bike, pedestrian and then bike again. :-/

Pedestrians: Cyclist sandwich filling.

Not pictured are the other great cycle designs I came across on my wander including:

  • The suddenly ending and then re-starting bike lane.
  • The going-on-and-off-the-footpath bike lane
  • The bizarre invisible bike lane – I found it in one suburb, where a single bike symbol was painted on the side of the road in a side street, with no other markings around, not even  a cycle lane line marking.

Whilst it’s great to see a council working to lay some cycle lanes, the lack of thought around planning and standardization of the lanes is a source of great amusement, but also a potential risk to both cyclists and pedestrians if these lanes start getting used more heavily.

Tagged , , , , , , | Leave a comment

Look to the past to see the future

I came across  a great tweet the other day, pretty much sums up the whole marriage equality debate being had across the world:

All this has happened before. All this will happen again. ~ Scrolls of Pythia, Battlestar Galatica

Pretty happy that I come from a country that recognizes the rights and privileges for my LGBTWTF friends – it’s not 100% perfect yet, but it’s getting there.

Under NZ law, gay couples can get a civil union, but not marriage – the only technical difference is terminology, and due to a poorly structured bit of legalization, a gay couple can’t adopt, as it explicitly requires a “married” couple.

I’m hopeful that it won’t be too much longer before we can fix that final bit of legalization to make a civil union or marriage available to any couple and have exactly equal standing. :-)

Tagged , , , | 3 Comments

Matangi Trains

I was in Wellington the other week to catch up with friends and family and had the opportunity to catch the new Matangi trains out to Johnsonville – you might remember my previous trip out there featured the pre-WW2 relics, so it was exciting to check out some 21st century transportation. :-)

In some ways, it’s sad to lose the old relics since they were great fun as a visitor, but I can imagine that the local are grateful for some of the more modern comforts and quietness.

Speedy train is speedy! (or crappy phone camera is crappy)

I do think showing the train's model name rather than the actual destination is going to be pretty unhelpful for tourists, I'd be pretty worried if I was trying to catch the "Johnsonville" train if it had a sign saying "Matangi". :-/

Nice and new :-)

Of particular interest is that the Johnsonville units are specially marked, as they feature an additional feature of “wheel flange lube” –  apparently this is to help deal with reducing wear on the tight Johnsonville line rails by keeping the wheels lubricated.

Wheel flange lube? Sounds kinky!

Tagged , , , | Leave a comment

mailx contains invalid character

Whilst my network is predominately  CentOS 5 hosts, I’ve started moving many of them to CentOS 6, mostly on a basis of doing so whenever a host needs a particularly newer version, since I don’t really want to spend an entire week rebuilding all 30-odd VMs.

One problem I encountered was a number of scripts failing when sending emails, throwing out messages to STDERR:

[example] contains invalid character '['
send-mail: invalid option -- 's'
send-mail: invalid option -- 's'
send-mail: fatal: usage: send-mail [options]

What I found is that on CentOS/RHEL 5, the following would work fine:

# mail root -s "[example] message"
test message content
Cc: 
#

But on CentOS/RHEL 6, it would ignore the subject field (as can be seen by it re-asking for it) and then fail with an annoying “invalid character” error:

# mail root -s "[example] message"
[example] contains invalid character '['
Subject: 
test message content
EOT
#
# send-mail: invalid option -- 's'
send-mail: invalid option -- 's'
send-mail: fatal: usage: send-mail [options]
#

Turns out that between mailx version 8.1.1 and mailx version 12.4, the mailx binary got a lot more fussy about the formatting of the command line options.

Viewing the help on both versions shows that options need to come before the destination user, however it seems that older versions of mailx were a bit slacker and accepted some flexibility of command line options.

Usage: mail -eiIUdEFntBDNHRV~ -T FILE -u USER -h hops -r address \
 -s SUBJECT -a FILE -q FILE -f FILE -A ACCOUNT -b USERS -c USERS \
 -S OPTION users

The correct solution, is to always have the target user as the final field, after the command line options, aka:

# mail -s "[example] message" root
test message content
Cc: 
#

This will work happily on all versions since it’s correct syntax of the command line options.

Hopefully everyone else is smart enough to do this the right way the first time, but figured I’d post this incase some other poor sysadmin is having the same confusion over the invalid character message. :-)

Tagged , , , , , , | Leave a comment

Welly June Meets

For those of you in Wellington, I’m aiming to do a couple meetups with friends for my trip this weekend – it’s going to be pretty busy trip, but keen to catch up as much as possible.

I land in Wellington on Saturday evening, so first priority will be getting some tasty food into me. :-)

  • Saturday – 19:30 dinner in Cuba St (probably Little India), followed by drinks.
  • Sunday –  15:00 (may vary) coffee in CBD.
  • Monday – Available for morning brunch.

If you want to meet up, let me know. :-)

Posted in Uncategorized | Tagged | 2 Comments

Inside Rain

I volunteered to help a friend move this weekend, expecting a relatively straightforwards process of putting a few computers, bags and boxes into my car and shifting to another place.

However there was still a bit of cleanup necessary – no thanks to previous flatmates who had left, leaving lots of stuff behind for others to deal with – so I got dug into helping tidy up the flat.

The apartment has an upstairs space, so I went up there to tidy up some stuff, minding my head with the low roof.

Notice that lovely sprinkler pipe aimed down at head height?

Unfortunately when I stood up a bit too far, I knocked the sprinkler with my back – not particular hard, I don’t have any bruises or anything – but enough to break the glass and cause a torrent of water to pour over me and into the apartment. :-(

Somewhat bad luck on my part, but the real issue was due to both poor design placement of the sprinkler and fatigued/damaged sprinkler heads.

Most of the sprinklers in the apartment have a metal shield around them, protecting the glass bulb, that when shattered, sets off the sprinkler in that area. (design being that a certain amount of heat shatters the glass and triggers the sprinkler).

A sane sprinker - sprinker head aimed upwards, so someone bumping the roof isn't going to hit the fragile end.

However the sprinkler I triggered looks more like this:

The dodgy sprinkler - note the missing metal shield around it, and the fact that it's facing down where it can be easily hit, rather than upright.

Without a before picture we can’t see what it originally looked like, but there’s no metal shield on the sprinker, nor any parts of it found in the post-flood clean up, so definitely appears to be an existing fault, which was the verdict of the fire department and to which the property manager agreed.

Thankfully that simplifies liability – not entirely sure who’s insurance would be liable otherwise, whether it would fall under the property owner’s insurance, the tenant’s contents insurance or the indemnity insurance of my own contents policy – regardless, it certainly shows the importance of actually having insurance, not just for your stuff, but for the liability protection.

Also, “sprinkler” really doesn’t aptly name these devices considering how much water pours out of them….

One sprinkler sure makes a bit of a flood....

Ground Zero - thankfully most of the water ended up in the bathroom and slowly draining away through the floor drains.

Priority number 1 - save the computers!

Thankfully it doesn’t appear that any assets got too damaged and the apartment should dry out with minimal impact, so it was a close call. However along with the other problems of apartment living, this incident gives me yet more reasons to avoid living in apartments.

Also any friends asking me to move apartments again are going to get a polite no – even the time I had to carry a piano down a hill to a Wellington house was less annoying than dealing with apartment parking, body corporates and fire suppression system flooding. :-)

Tagged , , , , | 8 Comments

Munin Performance

Munin is a popular open source network resource monitoring tool which polls the hosts on your network for statistics for various services, resources and other attributes.

A typical deployment will see Munin being used to monitor CPU usage, memory usage, amount of traffic across network interface, I/O statistics and more – it’s very handy for seeing long term performance trends and for checking the impact that upgrades or adjustments to the environment have made.

Whilst having some overlap with Nagios, Munin isn’t really a replacement, more an addition – I use Nagios to do critical service and resource monitoring and use Munin to graph things in more detail – something that Nagios doesn’t natively do.

A typical Munin graph - Munin provides daily, weekly, monthly and yearly graphs (RRD powered)

Rather than running as a daemon, the Munin master runs a cronjob every 5minutes that calls a sequence of scripts to poll the configured servers and generate new graphs.

  1. munin-update to poll configured hosts for new statistics and store the information in RRD databases.
  2. munin-limits to highlight perceived issues in the web interface and optionally to a file for Nagios integration.
  3. munin-graph to generate all the graphs for all the services and hosts.
  4. munin-html to generate the html files for the web interface (which is purely static).

The problem with this model, is that it doesn’t scale particularly well – once you start getting a substantial number of servers, the step-by-step approach can start to run out of resources and time to complete within the 5minute cron period.

For example, the following are the results for the 3 key scripts that run on my (virtualised) Munin VM monitoring 18 hosts:

sh-3.2$ time /usr/share/munin/munin-update
real    3m22.187s
user    0m5.098s
sys     0m0.712s

sh-3.2$ time /usr/share/munin/munin-graph
real    2m5.349s
user    1m27.713s
sys     0m9.388s

sh-3.2$ time /usr/share/munin/munin-html
real    0m36.931s
user    0m11.541s
sys     0m0.679s

It’s a total of around 6 minutes time to run – long enough that the finishing job is going to start clashing with the currently running job.

So why so long?

Firstly, munin-update – munin-update’s time is mostly spent polling the munin-node daemon running on all the monitored systems and then a small amount of I/O time writing the new information to the on-disk RRD files.

The developers have appeared to realise the issue of scale with munin-update and have the ability to run it in a forked mode – however this broke horribly for me with a highly virtualised environment, since sending a poll to 12+ servers all running on the one physical host would cause a sudden load spike and lead to a service poll timeout, with no values being returned at all. :-(

This occurs because by default Munin allows a maximum of 5 seconds for each service query to complete across all hosts and queries all the hosts and services rapidly, ignoring any that fail to respond fast enough. And when querying a large number of servers on one physical host, the server would be too loaded to respond quickly enough.

I ended up boosting the timeouts on some servers to 60 seconds (particular the KVM hosts themselves, as there would sometimes be 60+ LVM volumes that Munin wanted statistics for), but it still wasn’t a good solution and the load spikes would continue.

There are some tweaks that can be used, such as adjusting the max number of forked processes, but it ended up being more reliable and easier to support to just run a single thread and make sure it completed as fast as possible – and taking 3 mins to poll all 18 servers and save to the RRD database is pretty reasonable, particular for a staggered polling session.

 

After getting munin-update to complete in a reasonable timeframe, I took a look into munin-html and munin-graph – both these processes involve reading the RRD databases off the disk and then writing HTML and RRDTool Graphs (PNG files) to disk for the web interface.

Both processes have the same issue – they chew a solid amount of CPU whilst processing data and then they would get stuck waiting for the disk I/O to catch up when writing the graphs.

The I/O on this server isn’t the fastest at the best of times, considering it’s an AES-256 encrypted RAID 6 volume and the time taken to write around 200MB of changed data each time was a bit too much to do efficiently.

Munin offers some options, including on-demand graph generation using CGIs, however I found this just made the web interface unbearably slow to use – although from chats with the developer, it sounds like version 2.0 will resolve many of these issues.

I needed to fix the performance with the current batch generation model. Just watching the processes in top quickly shows the issue with the scripts, particular with munin-graph which runs 4 concurrent processes, all of them waiting for I/O. (Linux process crash course: S is sleeping (idle), R is running, D is performing I/O operations – or waiting for them).

Clearly this isn’t ideal – I can’t do much about the underlying performance, other than considering putting the monitoring VM onto a different I/O device without encryption, however I then lose all the advantages of having everything on one big LVM pool.

I do however, have plenty of CPU and RAM (Quad Phenom, 16GB RAM) so I decided to boost the VM from 256MB to 1024MB RAM and setup a tmpfs filesystem, which is a in-memory filesystem.

Munin has two main data sources – the RRD databases and the HTML & graph outputs:

# du -hs /var/www/html/munin/
227M    /var/www/html/munin/

# du -hs /var/lib/munin/
427M    /var/lib/munin/

I decided that putting the RRD databases in /var/lib/munin/ into tmpfs would be a waste of RAM – remember that munin-update is running single-threaded and waiting for results from network polls, meaning that I/O writes are going to be spread out and not particularly intensive.

The other problem with putting the RRD databases into tmpfs, is that a server crash/power down would lose all the data and that then requires some regular processes to copy it to a safe place, etc, etc – not ideal.

However the HTML & graphs are generated fresh each time, so a loss of their data isn’t an issue. I setup a tmpfs filesystem for it in /etc/fstab with plenty of space:

tmpfs  /var/www/html/munin   tmpfs   rw,mode=755,uid=munin,gid=munin,size=300M   0 0

And ran some performance tests:

sh-3.2$ time /usr/share/munin/munin-graph 
real    1m37.054s
user    2m49.268s
sys     0m11.307s

sh-3.2$ time /usr/share/munin/munin-html 
real    0m11.843s
user    0m10.902s
sys     0m0.288s

That’s a decrease from 161 seconds (2.68mins) to 108 seconds (1.8 mins). It’s a reasonable increase, but the real difference is the massive reduction in load for the server.

For a start, we can see from watching the processes with top that the processor gets worked a bit more to complete the process, since there’s not as much waiting for I/O:

With the change, munin-graph spends almost all it’s time doing CPU processing, rather than creating I/O load – although there’s the occasional period of I/O as above, I suspect from the time spent reading the RRD databases off the slower disk.

Increased bursts of CPU activity is fine – it actually works out to less CPU load, since there’s no need for the CPU to be doing disk encryption and hammering 1 core for a short period of time is fine, there’s plenty of other cores and Linux handles scheduling for resources pretty well.

We can really see the difference with Munin’s own graphs for the monitoring VM after making the change:

In addition, the host server’s load average has dropped significantly, as well as the load time for the web interface on the server being insanely fast, no more waiting for my browser to finish pulling all the graphs down for a page, instead it loads in a flash. Munin itself gives you an idea of the difference:

If performance continues to be a problem, there are some other options such as moving RRD databases into memory, patching Munin to do virtualisation-friendly threading for munin-update or looking at better ways to fix CGI on-demand graphing – the tmpfs changes would help a bit to start with.

Tagged , , , , , , , | 6 Comments