Tag Archives: cloud

FreeBSD in the cloud

This weekend I was playing around with FreeBSD in order to add support to Pupistry. Although I generally use Linux exclusively, it’s fun to play around with other platforms now and then, bit like going on vacation. Plus building support for other platforms ensures that I’m writing code that’s more portable.

FreeBSD is probably the most popular BSD in use and it’s the only one available for download from the Amazon Web Services (AWS) Marketplace and as a supported platform from Digital Ocean alongside their Linux offerings.

However as popular as FreeBSD is, it pales in comparison to Linux, which means that it doesn’t get as much love and things don’t work quite as seamlessly with these cloud providers. In my process of testing FreeBSD with both providers I ran into some interesting feature differences and annoyances.

 

FreeBSD on Digital Ocean

I started with Digital Ocean first, love them since they’re a nice simple, cheap cloud provider for personal stuff – not much need for the AWS enterprise feature set when I’m building personal machines and paying the price of a coffee for a month of compute sure is nice.

They provide a FreeBSD 10.1 image via the usual droplet creation screen, I have to give Digital Ocean credit for such a nice clean simple interface – limiting user selection does make it much more approachable for people, something Apple always understood with their products.

Screen Shot 2015-04-18 at 23.30.50

As always Digital Ocean is pretty speedy, bringing up a machine within a minute or so. Once ready, login as the freebsd user and you can just sudo to root.

Digital Ocean provides a pretty recent image with pkg already installed and ready to go, although you’ll want to run the update process to get the latest patches. You need to login initially as the freebsd user and then can sudo to acquire root powers.

Over all it’s great – so naturally there is a catch. Digital Ocean doesn’t yet support user data with their droplets. So whilst you can fill in the user data field, it won’t actually get executed.

This is pretty annoying for anyone wanting to automate large number of machines, since it now means you have to SSH to each of them to get them provisioned. I’ve raised a question on their community forum around this issue, but I wouldn’t expect a quick fix since the upstream bsd-cloudinit project they use hasn’t implemented support yet either.

It’s not going to be an end-of-the-world for most people, but it could be barrier if you’re wanting to roll out a fleet of BSD boxen.

The best feature from Digital Ocean is actually their documentation – with the launch of FreeBSD on their platform, they’ve produced some excellent tutorials and guides to their platform which can be found here and are useful to both Linux gurus and noobs alike.

Finally their native IPv6 support extends to FreeBSD, so your machines can join the internet of the 21st century from day one.

 

FreeBSD on Amazon Web Services (AWS)

Next I spun up an instance in Amazon Web Services (AWS) which is the granddaddy of cloud providers and provides an impressive array of functionality, although this comes at a cost premium over Digital Ocean’s tight pricing.

It’s actually been the first time in a long time that I’ve built a machine via the AWS web console, normally for work we just build all of our systems via Cloud Formation and it was an interesting experience to see the usability difference of AWS’s setup page vs that of Digital Ocean’s.

The fact that the launch wizard has 7 different screens says a lot and I suspect AWS is at risk of having it’s consumer user base eaten by the likes of Linode and Digital Ocean – but when a consumer user is paying $5.00 a month and an enterprise customer pays $300,000 a month, I suspect AWS isn’t going to be too worried.

Launching a FreeBSD instance is not really any different to that of a Linux one, you just need to search for “freebsd” in the AWS Market Place to find the AMI and launch as normal.

Screen Shot 2015-04-18 at 23.43.18

 

Once launched, things get more interesting. Digital Ocean’s FreeBSD instance came up in around 1 minute which is standard for their systems – but AWS took a whopping 8-10mins to launch the AMI to the level where I could login via SSH!

Digging into the startup log reveals why – it seems the AWS AMI (Amazon’s machine images/templates) for FreeBSD launches the instance, then runs a prolonged upgrade task (freebsd-update fetch install), before doing a subsequent reboot and finally starting SSH.

Whilst I appreciate the good default security posture this provides, there’s a few issues with it:

  1. It differs from most other AWS images which deal with patching by having new images built semi-frequently and leaving the patching in-between up to the admin’s choice.
  2. During the whole process, the admin can’t login which causes some confusion. I initially assumed the AMI images were broken after reviewing my security groups and seeing no reason why I shouldn’t be able to login immediately.
  3. You can’t trust the AMI images to be a solid unchanging base, which means you need to be vary wary if doing autoscaling. Not only is 10mins a bit too slow for autoscaling, having the potential risk of it not coming up due to app changes in the latest update is always something to watch out for. If doing autoscaling with these images, you’ll need to consider
  4. It caused me no end of frustration when trying to test user data since I had to wait 10mins each time to get a confirmation of my test!

The last point brings me to user data – the good news is that Amazon correctly supports user data for FreeBSD machines, so you can paste in your tcsh script (not bash remember!) and it will get invoked at launch time.

The downside is that the user data handling of FreeBSD is a lot more fragile than Linux images. Generally with Linux, the OS boots (including SSH) and then runs the user data. If the user data breaks or hangs or does anything other than expected, you can still login and debug. Whereas since FreeBSD runs the user data before starting up SSH, if something goes wrong you have no way to easily login and debug. And given the differences between tcsh and bash plus annoying commands that default to expecting user input on non-interactive ptys, changes are you’ll have more than one attempt that results in a machine getting stuck at launch.

The ultimate fix is that you’ll probably have to use Packer if using FreeBSD in any serious way on AWS to get the startup performance to an acceptable level.

Finally remember that on AWS, you need to login as the ec2-user and then su –  to become root.

 

Which one?

If you’re interested in FreeBSD and want to pick a provider to play around with, the choice seems pretty simple to me – Digital Ocean. They’re got the better pricing (~ $5/month vs $15/month) and their ridiculously simple dashboard coupled with the excellent documentation they’ve assembled makes it really attractive for anyone new to the *.nix or cloud space. Plus they’ve bothered to invest in IPv6 which I appreciate.

However if you’re doing business/enterprise systems and want user data, autoscaling or the benefit of automating entire stacks with Cloud Formation, then you will probably find AWS the more attractive offering purely due to the additional functionality offered by that platform. Just be prepared to spend a bit of time baking your own AMI to allow you to skip the overhead of having to wait for updates to apply for each instance you bring up.

Neither provider has got their FreeBSD experience to be quite as slick as that of their Linux offerings, however hopefully they improve on these deficiencies over time –  there’s not much needed to get the experience up to the same level as Linux distributions and it’s nice having a different type of unix to play with for a change.

Recovering SW RAID with Ubuntu on Amazon AWS

Amazon’s AWS cloud service is a very popular and generally mature offering, but it does have it’s issues at times – in particular it’s storage options and limited debug facilities.

When using AWS, you have three main storage options for your instances (virtual machine servers):

  1. Ephemeral disk , storage attached locally to your instance which is lost at shutdown or if the instance terminates unexpectedly. A fixed amount is included with your instance, the size set depending on your instance size.
  2. Elastic Block Storage (EBS) which is a network-attached block storage exposed to your Linux instance as if it was a traditional local disk.
  3. EBS with provisioned IOPs – the same as the above, but with guarantees around performance – for a price of course. ;-)

With EBS, there’s no need to use RAID from a disk reliability perspective- the EBS volume itself has it’s own underlying redundancy (although one should still perform snapshots and backups to handle end user failure or systematic EBS failure), which is the common reason for using RAID with conventional physical hosts.

So with RAID being pointless for redundancy in an Amazon world, why write about recovering hosts in AWS using software RAID? Because there are still situations where you may end up using it for purposes other than redundancy:

  1. Poor man’s performance gains – EBS provisioned IOPs are the proper way of getting performance from EBS to meet your particular requirements. But it comes with a cost attached – you pay increasingly more for faster disk, but also need proportionally larger disks minimum sizes to go with the higher speeds (10:1 ratio IOPs:size) which can quickly make a small fast volume prohibitively expensive. A software RAID array can allow you to get more performance by combining numerous small volumes together at low cost.
  2. Merging multiple EBS volumes – EBS volumes have an Amazon-imposed limit of 1TB per volume. If a single filesystem of more than 1TB is required, either LVM or software RAID is needed to merge them.
  3. Merging multiple ephemeral volumes – software RAID can be used to also merge the multiple EBS volumes that Amazon provides on some larger instances. However being ephemeral, if your RAID gets degraded, there’s no need to repair it – just destroy the instance and build a nice new one.

So whilst using software RAID with your AWS Instances can be a legitimate exercise, it can also introduce it’s own share of issues.

Firstly you can no longer use EBS snapshotting to do backups of the EBS volumes, unless you first halt the entire RAID array/freeze the filesystem writes for the duration of all the snapshots to be created – which depending on your application may or may not be feasible.

Secondly you now have the issue of increased complexity of your I/O configuration. If using automation to build your instances, you need to do additional work to handle the setup of the array which is a one-time investment, but the use of RAID also adds complexity to the maintenance (such as resizes) and increases the risk of a fault occurring.

I recently had the excitement/misfortune of such an experience. We had a pair of Ubuntu 12.04 LTS instances using GlusterFS to provide a redundant NFS mount to some of our legacy applications running in AWS (AWS unfortunately lacks a hosted NFS filer service). To provide sufficient speed to an otherwise small volume, RAID 0 had been used with a number of small EBS volumes.

The RAID array was nearly full, so a resize/grow operation was required. This is not an uncommon requirement and just involves adding an EBS volume to the instance, growing the RAID array size and expanding the filesystem on top. Unfortunately something nasty happened between Gluster and the Linux kernel, where the RAID resize operation on one of the two hosts suddenly triggered a kernel panic and failed, killing the host. I wasn’t able to get the logs for it, but at this stage it looks like gluster tried to do some operation right when the resize was active and instead of being blocked, triggered a panic.

Upon a subsequent restart, the host didn’t come back online. Connecting to the AWS Instance’s console output (ec2-get-console-output <instanceid>) showed that the RAID array failure was preventing the instance from booting back up, even through it was an auxiliary mount, not the root filesystem or anything required to boot.

The system may have suffered a hardware fault, such as a disk drive
failure.  The root device may depend on the RAID devices being online. One
or more of the following RAID devices are degraded:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : inactive xvdn[9](S) xvdm[7](S) xvdj[4](S) xvdi[3](S) xvdf[0](S) xvdh[2](S) xvdk[5](S) xvdl[6](S) xvdg[1](S)
      13630912 blocks super 1.2

unused devices: <none>
Attempting to start the RAID in degraded mode...
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
[31761224.516958] bio: create slab <bio-1> at 1
[31761224.516976] md/raid:md0: not clean -- starting background reconstruction
[31761224.516981] md/raid:md0: reshape will continue
[31761224.516996] md/raid:md0: device xvdm operational as raid disk 7
[31761224.517002] md/raid:md0: device xvdj operational as raid disk 4
[31761224.517007] md/raid:md0: device xvdi operational as raid disk 3
[31761224.517013] md/raid:md0: device xvdf operational as raid disk 0
[31761224.517018] md/raid:md0: device xvdh operational as raid disk 2
[31761224.517023] md/raid:md0: device xvdk operational as raid disk 5
[31761224.517029] md/raid:md0: device xvdl operational as raid disk 6
[31761224.517034] md/raid:md0: device xvdg operational as raid disk 1
[31761224.517683] md/raid:md0: allocated 10592kB
[31761224.517771] md/raid:md0: cannot start dirty degraded array.
[31761224.518405] md/raid:md0: failed to run raid set.
[31761224.518412] md: pers->run() failed ...
mdadm: failed to start array /dev/md0: Input/output error
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
Could not start the RAID in degraded mode.
Dropping to a shell.

BusyBox v1.18.5 (Ubuntu 1:1.18.5-1ubuntu4.1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

Dropping to a shell during bootup problems is an approach that has differing perspectives – personally I want my hosts to boot regardless of how messed up things are so I can get SSH, but others differ and prefer the safety of halting and dropping to a recovery shell for the sysadmin to resolve. Ubuntu is configured to do the latter by default.

But regardless of your views on this subject, dropping to a shell leaves you stuck when running AWS instances, since there is no way to interact with this console – Amazon doesn’t have a proper console for interacting with instances like a traditional VPS provider, you’re limited to only seeing the console log.

Ubuntu’s documentation actually advises that in the event of a failed RAID array, you can still force a boot by setting a kernel option bootdegraded=true. This helps if the array was degraded, but in this case the array had entirely failed, rather than being degraded, and Ubuntu treats that differently.

Thankfully it is possible to recover the failed instance, by attaching it’s volume to another instance, adjusting the initramfs to allow booting even whilst the RAID is failed and then once booted, you can do a repair on the host itself.

To do this repair you require an additional Linux instance to use as a recovery host and the Amazon CLI tools to be installed on your workstation.

# Set some variables with your instance IDs (eg i-abcd3)
export FAILED=setme
export RECOVERY=setme

# Fetch the root filesystem EBS volume ID and set a var with it:
export VOLUME=vol-setme

# Now stop the failed instance, so we can detach it's root volume.
# (Note: wait till status goes from "stopping" to "stopped")
ec2-stop-instances --force $FAILED
ec2-describe-instances $FAILED | grep INSTANCE | awk '{ print $5 }'

# Attach the root volume to the recovery host as /dev/sdo
ec2-detach-volume $VOLUME -i $FAILED
ec2-attach-volume $VOLUME -i $RECOVERY -d /dev/sdo

# Mount the root volume on the recovery host
ssh recoveryhost.example.com
mkdir /mnt/recovery
mount /dev/sdo /mnt/recovery

# Disable raid startup scripts for initramfs/initrd. We need to
# unpack the old file and modify the startup scripts inside it.
cp /mnt/recovery/boot/initrd.img-LATESTHERE-virtual /tmp/initrd-old.img
cd /tmp/
mkdir initrd-test
cd initrd-test
cpio --extract < ../initrd-old.img
vim scripts/local-premount/mdadm
- degraded_arrays || exit 0
- mountroot_fail || panic "Dropping to a shell."
+ #degraded_arrays || exit 0
+ #mountroot_fail || panic "Dropping to a shell."
find . | cpio -o -H newc > ../initrd-new.img
cd ..
gzip initrd-new.img
cp initrd-new.img.gz /mnt/recovery/boot/initrd.img-LATESTHERE-virtual

# Disable mounting of filesystem at boot (otherwise startup process
# will fail despite the array being skipped).
vim /mnt/recovery/etc/fstab
- /dev/md0    /mnt/myraidarray    xfs    defaults    1    2
+ #/dev/md0    /mnt/myraidarray    xfs    defaults    1    2

# Work done, umount volume.
umount /mnt/recovery

# Re-attach the root volume back to the failed instance
ec2-detach-volume $VOLUME -i $RECOVERY
ec2-attach-volume $VOLUME -i $FAILED -d /dev/sda1

# Startup the failed instance.
# (Note: Wait for status to go from pending to running)
ec2-start-instances $FAILED
ec2-describe-instances $FAILED | grep INSTANCE | awk '{ print $5 }'

# Watch the startup console. Note: java.lang.NullPointerException
# means that there is no output from the console yet.
ec2-get-console-output $FAILED

# Host should startup, you can get access via SSH and repair RAID
# array via usual means.

The above is very Ubuntu-specific, but the techniques shown are transferable to other platforms as well – just note that the scripts inside the initramfs/initrd will vary per distribution, it’s one of the components of a GNU/Linux system that is completely specific to the distribution vendor.

Adventures in I/O Hell

Earlier this year I had a disk fail in my NZ-based file & VM server. This isn’t something unexpected, the server has 12x hard disks, which are over 2 years old, some failures are to be expected over time. I anticipated just needing to arrange the replacement of one disk and being able to move on with my life.

Of course computers are never as simple as this and it triggered a delightful adventure into how my RAID configuration is built, in which I encountered many lovely headaches and problems, both hardware and software.

My beautiful baby

Servers are beautiful, but also such hard work sometimes.

 

The RAID

Of the 12x disks in my server, 2x are for the OS boot drives and the remaining 10x are pooled together into one whopping RAID 6 which is then split using LVM into volumes for all my virtual machines and also file server storage.

I could have taken the approach of splitting the data and virtual machine volumes apart, maybe giving the virtual machines a RAID 1 and the data a RAID 5, but I settled on this single RAID 6 approach for two reasons:

  1. The bulk of my data is infrequently accessed files but most of the activity is from the virtual machines – by mixing them both together on the same volume and spreading them across the entire RAID 6, I get to take advantage of the additional number of spindles. If I had separate disk pools, the VM drives would be very busy, the file server drives very idle, which is a bit of a waste of potential performance.
  2. The RAID 6 redundancy has a bit of overhead when writing, but it provides the handy ability to tolerate the failure and loss of any 2 disks in the array, offering a bit more protection than RAID 5 and much more usable space than RAID 10.

In my case I used Linux’s Software RAID – whilst many dislike software RAID, the fact is that unless I was to spend serious amounts of money on a good hardware RAID card with a large cache and go to the effort of getting all the vendor management software installed and monitored, I’m going to get little advantage over Linux Software RAID, which consumes relatively little CPU. Plus my use of disk encryption destroys any minor performance advantages obtained by different RAID approaches anyway, making the debate somewhat moot.

The Hardware

To connect all these drives, I have 3x SATA controllers installed in the server:

# lspci | grep -i sata
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
02:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
03:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)

The onboard AMD/ATI controller uses the standard “ahci” Linux kernel driver and the Marvell controllers use the standard “sata_mv” Linux kernel driver. In both cases, the controllers are configured purely in JBOD mode, which exposes each drive as-is to the OS, with no hardware RAID or abstraction taking place.

You can see how the disks are allocated to the different PCI controllers using the sysfs filesystem on the server:

# ls -l /sys/block/sd*
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sda -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdb -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdc -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host2/target2:0:0/2:0:0:0/block/sdc
lrwxrwxrwx. 1 root root 0 Mar 14 04:50 sdd -> ../devices/pci0000:00/0000:00:02.0/0000:02:00.0/host3/target3:0:0/3:0:0:0/block/sdd
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdi -> ../devices/pci0000:00/0000:00:11.0/host8/target8:0:0/8:0:0:0/block/sdi
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdj -> ../devices/pci0000:00/0000:00:11.0/host9/target9:0:0/9:0:0:0/block/sdj
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdk -> ../devices/pci0000:00/0000:00:11.0/host10/target10:0:0/10:0:0:0/block/sdk
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdl -> ../devices/pci0000:00/0000:00:11.0/host11/target11:0:0/11:0:0:0/block/sdl

There is a single high-wattage PSU driving the host and all the disks, in a well designed Lian-Li case with proper space and cooling for all these disks.

The Failure

In March my first disk died. In this case, /dev/sde attached to the host AMD controller failed and I decided to hold off replacing it for a couple of months as it’s internal to the case and I would be visiting home in May in just a month or so. Meanwhile the fact that RAID 6 has 2x parity disks would ensure that I had at least a RAID-5 level of protection until then.

However a few weeks later, a second disk in the same array also failed. This become much more worrying, since two bad disks in the array removed any redundancy that the RAID array offered me and would mean that any further disk failures would cause fatal data corruption of the array.

Naturally due to Murphy’s Law, a third disk then promptly failed a few hours later triggering a fatal collapse of my array, leaving it in a state like the following:

md2 : active raid6 sdl1[9] sdj1[8] sda1[4] sdb1[5] sdd1[0] sdc1[2] sdh1[3](F) sdg1[1] sdf1[7](F) sde1[6](F)
      7814070272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [10/7] [UUU_U__UUU]

Whilst 90% of my array is regularly backed up, there was some data that I hadn’t been backing up due to excessive size that wasn’t fatal, but nice to have. I tried to recover the array by clearing the faulty flag on the last-failed disk (which should still have been consistent/in-line with the other disks) in order to bring it up so I could pull the latest data from it.

mdadm --assemble /dev/md2 --force --run /dev/sdl1 /dev/sdj1 /dev/sda1 /dev/sdb1 /dev/sdd1 /dev/sdc1 /dev/sdh1 /dev/sdg1
mdadm: forcing event count in /dev/sdh1(3) from 6613460 upto 6613616
mdadm: clearing FAULTY flag for device 6 in /dev/md2 for /dev/sdh1
mdadm: Marking array /dev/md2 as 'clean'

mdadm: failed to add /dev/sdh1 to /dev/md2: Invalid argument
mdadm: failed to RUN_ARRAY /dev/md2: Input/output error

Sadly this failed in a weird fashion, where the array state was cleared to OK, but /dev/sdh was still failed and missing from the array, leading to corrupted data and all attempts to read the array successfully being hopeless. At this stage the RAID array was toast and I had no other option than to leave it broken for a few weeks till I could fix it on a scheduled trip home. :-(

Debugging in the murk of the I/O layer

Having lost 3x disks, it was unclear as to what the root cause of the failure in the array was at this time. Considering I was living in another country I decided to take the precaution of ordering a few different spare parts for my trip, on the basis that spending a couple hundred dollars on parts I didn’t need would be more useful than not having the part I did need when I got back home to fix it.

At this time, I had lost sde, sdf and sdh – all three disks belonging to a single RAID controller, as shown by /sys/block/sd, and all of them Seagate Barracuda disks.

lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sde -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host4/target4:0:0/4:0:0:0/block/sde
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdf -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host5/target5:0:0/5:0:0:0/block/sdf
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdg -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sdg
lrwxrwxrwx. 1 root root 0 Mar 14 04:55 sdh -> ../devices/pci0000:00/0000:00:03.0/0000:03:00.0/host7/target7:0:0/7:0:0:0/block/sdh

Hence I decided to order 3x 1TB replacement Western Digital Black edition disks to replace the failed drives, deliberately choosing another vendor/generation of disk to ensure that if it was some weird/common fault with the Seagate drives it wouldn’t occur with the new disks.

I also placed an order of 2x new SATA controllers, since I had suspicions about the fact that numerous disks attached to one particular controller had failed. The two add in PCI-e controllers were somewhat simplistic ST-Labs cards, providing 4x SATA II ports in a PCI-e 4x format, using a Marvell 88SX7042 chipset. I had used their SiI-based PCI-e 1x controllers successfully before and found them good budget-friendly JBOD controllers, but being a budget no-name vendor card, I didn’t hold high hopes for them.

To replace them, I ordered 2x Highpoint RocketRaid 640L cards, one of the few 4x SATA port options I could get in NZ at the time. The only problem with these cards was that they were also based on the Marvell 88SX7042 SATA chipset, but I figured it was unlikely for a chipset itself to be the cause of the issue.

I considered getting a new power supply as well, but the fact that I had not experienced issues with any disks attached to the server’s onboard AMD SATA controller and that all my graphing of the power supply showed the voltages solid, I didn’t have reason to doubt the PSU.

When back in Wellington in May, I removed all three disks that had previously exhibited problems and installed the three new replacements. I also replaced the 2x PCI-e ST-Labs controllers with the new Highpoint RocketRaid 640L controllers. Depressingly almost as soon as the hardware was changed and I started my RAID rebuild, I started experiencing problems with the system.

May  4 16:39:30 phobos kernel: md/raid:md2: raid level 5 active with 7 out of 8 devices, algorithm 2
May  4 16:39:30 phobos kernel: created bitmap (8 pages) for device md2
May  4 16:39:30 phobos kernel: md2: bitmap initialized from disk: read 1 pages, set 14903 of 14903 bits
May  4 16:39:30 phobos kernel: md2: detected capacity change from 0 to 7000493129728
May  4 16:39:30 phobos kernel: md2:
May  4 16:39:30 phobos kernel: md: recovery of RAID array md2
May  4 16:39:30 phobos kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
May  4 16:39:30 phobos kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
May  4 16:39:30 phobos kernel: md: using 128k window, over a total of 976631296k.
May  4 16:39:30 phobos kernel: unknown partition table
...
May  4 16:50:52 phobos kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:50:52 phobos kernel: ata7.00: failed command: IDENTIFY DEVICE
May  4 16:50:52 phobos kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:50:52 phobos kernel:         res 40/00:24:d8:1b:c6/00:00:02:00:00/40 Emask 0x4 (timeout)
May  4 16:50:52 phobos kernel: ata7.00: status: { DRDY }
May  4 16:50:52 phobos kernel: ata7: hard resetting link
May  4 16:50:52 phobos kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:50:52 phobos kernel: ata7.00: configured for UDMA/133
May  4 16:50:52 phobos kernel: ata7: EH complete
May  4 16:51:13 phobos kernel: ata16.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:51:13 phobos kernel: ata16.00: failed command: IDENTIFY DEVICE
May  4 16:51:13 phobos kernel: ata16.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in
May  4 16:51:13 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May  4 16:51:13 phobos kernel: ata16.00: status: { DRDY }
May  4 16:51:13 phobos kernel: ata16: hard resetting link
May  4 16:51:18 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:23 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:23 phobos kernel: ata16: hard resetting link
May  4 16:51:28 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:51:33 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:51:33 phobos kernel: ata16: hard resetting link
May  4 16:51:39 phobos kernel: ata16: link is slow to respond, please be patient (ready=0)
May  4 16:52:08 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:08 phobos kernel: ata16: limiting SATA link speed to 3.0 Gbps
May  4 16:52:08 phobos kernel: ata16: hard resetting link
May  4 16:52:13 phobos kernel: ata16: COMRESET failed (errno=-16)
May  4 16:52:13 phobos kernel: ata16: reset failed, giving up
May  4 16:52:13 phobos kernel: ata16.00: disabled
May  4 16:52:13 phobos kernel: ata16: EH complete
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c8 00 00 01 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: Disk failure on sdk, disabling device.
May  4 16:52:13 phobos kernel: md/raid:md2: Operation continuing on 6 devices.
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396112 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396280 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 c9 00 00 04 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84396288 on sdk)
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397304 on sdk).
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Unhandled error code
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:52:13 phobos kernel: sd 15:0:0:0: [sdk] CDB: Read(10): 28 00 05 07 cd 00 00 03 00 00
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84397312 on sdk).
...
May  4 16:52:13 phobos kernel: md/raid:md2: read error not correctable (sector 84398072 on sdk).
May  4 16:53:14 phobos kernel: ata18.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata18.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata18.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata18.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata18: hard resetting link
May  4 16:53:14 phobos kernel: ata17.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May  4 16:53:14 phobos kernel: ata17.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata17.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata17.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata17: hard resetting link
May  4 16:53:14 phobos kernel: ata15.00: failed command: FLUSH CACHE EXT
May  4 16:53:14 phobos kernel: ata15.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May  4 16:53:14 phobos kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
May  4 16:53:14 phobos kernel: ata15.00: status: { DRDY }
May  4 16:53:14 phobos kernel: ata15: hard resetting link
May  4 16:53:15 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:15 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:15 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:20 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:20 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata17: hard resetting link
May  4 16:53:20 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata15: hard resetting link
May  4 16:53:20 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:20 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:20 phobos kernel: ata18: hard resetting link
May  4 16:53:21 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:21 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May  4 16:53:21 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May  4 16:53:31 phobos kernel: ata15.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata17.00: qc timeout (cmd 0xec)
May  4 16:53:31 phobos kernel: ata18.00: qc timeout (cmd 0xec)
May  4 16:53:32 phobos kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata15.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata15: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata15: hard resetting link
May  4 16:53:32 phobos kernel: ata17.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata17: limiting SATA link speed to 3.0 Gbps
May  4 16:53:32 phobos kernel: ata17: hard resetting link
May  4 16:53:32 phobos kernel: ata18.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May  4 16:53:32 phobos kernel: ata18.00: revalidation failed (errno=-5)
May  4 16:53:32 phobos kernel: ata18: limiting SATA link speed to 1.5 Gbps
May  4 16:53:32 phobos kernel: ata18: hard resetting link
May  4 16:53:32 phobos kernel: ata17: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
May  4 16:53:33 phobos kernel: ata15: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:53:33 phobos kernel: ata18: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
May  4 16:54:05 phobos kernel: ata15: EH complete
May  4 16:54:05 phobos kernel: ata17: EH complete
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Unhandled error code
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 14:0:0:0: [sdj] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdj, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 5 devices.
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Unhandled error code
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 16:0:0:0: [sdl] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdl, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: ata18: EH complete
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Unhandled error code
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
May  4 16:54:05 phobos kernel: sd 17:0:0:0: [sdm] CDB: Write(10): 2a 00 00 00 00 10 00 00 05 00
May  4 16:54:05 phobos kernel: md: super_written gets error=-5, uptodate=0
May  4 16:54:05 phobos kernel: md/raid:md2: Disk failure on sdm, disabling device.
May  4 16:54:05 phobos kernel: md/raid:md2: Operation continuing on 4 devices.
May  4 16:54:05 phobos kernel: md: md2: recovery done.

The above logs demonstrate my problem perfectly. After rebuilding the array, a disk would fail and lead to a cascading failure where other disks would throw up “failed to IDENTIFY” errors until they eventually all failed one-by-one until all the disks attached to that controller had vanished from the server.

At this stage I knew the issue wasn’t caused by all the hard disk themselves, but there was the annoying fact that at the same time, I couldn’t trust all the disks – there were certainly at least a couple bad hard drives in my server triggering a fault, but many of the disks marked as bad by the server were perfectly fine when tested in a different system.

Untrustworthy Disks

Untrustworthy Disks

By watching the kernel logs and experimenting with disk combinations and rebuilds, I was able to determine that the first disk that always failed was usually faulty and did need replacing. However the subsequent disk failures were false alarms and seemed to be part of a cascading triggered failure on the controller that the bad disk was attached to.

Using this information, it would be possible for me to eliminate all the bad disks by removing the first one to fail each time till I had no more failing disks. But this would be a false fix, if I was to rebuild the RAID array in this fashion, everything would be great until the very next disk failure occurs and leads to the whole array dying again.

Therefore I had to find the root cause and make a permanent fix for the solution. Looking at what I knew about the issue, I could identify the following possible causes:

  1. A bug in the Linux kernel was triggering cascading failures of RAID rebuilds (bug in either software RAID or in AHCI driver for my Marvell controller).
  2. The power supply on the server was glitching under the load/strain of a RAID rebuild and causing brownouts to the drives.
  3. The RAID controller had some low level hardware issue and/or an issue exhibited when one failed disk sent a bad signal to the controller.

Frustratingly I didn’t have any further time to work on it whilst I was in NZ in May, so I rebuilt the minimum array possible using the onboard AMD controller to restore my critical VMs and had to revisit the issue again in October when back in the country for another visit.

Debugging hardware issues and cursing the lower levels of the OSI layers.

Debugging disks and cursing hardware.

When I got back in October, I decided to eliminate all the possible issues, no matter how unlikely…. I was getting pretty tired of having to squeeze all my stuff into only a few TB, so wanted a fix no matter what.

Kernel – My server was running the stock CentOS 6 2.6.32 kernel which was unlikely to have any major storage subsystem bugs, however as a precaution I upgraded to the latest upstream kernel at the time of 3.11.4.

This failed to make any impact, so I decided that it would be unlikely I’m the first person to discover a bug in the Linux kernel AHCI/Software-RAID layer and marked the kernel as an unlikely source of the problem. The fact that when I added a bad disk to the onboard AMD server controller I didn’t see the same cascading failure, also helped eliminate a kernel RAID bug as the cause, leaving only the possibility of a bug in the “sata_mv” kernel driver that was specific to the controllers.

Power Supply – I replaced the few year old power supply in the server with a newer higher spec good quality model. Sadly this failed to resolve anything, but it proved that the issue wasn’t power related.

SATA Enclosure – 4x of my disks were in a 5.25″ to 3.5″ hotswap enclosure. It seemed unlikely that a low level physical device could be the cause, but I wanted to eliminate all common infrastructure possibility, and all the bad disks tended to be in this enclosure. By juggling the bad disks around, I was able to confirm that the issue wasn’t specific to this one enclosure.

At this time the only hardware in common left over was the two PCI-e RAID controllers – and of course the motherboard they were connected via. I really didn’t want to replace the motherboard and half the system with it, so I decided to consider the possibility that the fact the old ST-Lab and new RocketRaid controllers both used the same Marvell 88SX7042 chipset meant that I hadn’t properly eliminated the controllers as being the cause of the issue.

After doing a fair bit of research around the Marvell 88SX7042 online, I did find a number of references to similar issues with other Marvell chipsets, such as issues with the 9123 chipset series, issues with the 6145 chipset and even reports of issues with Marvell chipsets on FreeBSD.

The Solution

Having traced something dodgy back to the Marvell controllers I decided the only course of action left was to source some new controller cards. Annoyingly I couldn’t get anything that was 4x SATA ports in NZ that wasn’t either Marvell chipset based or extremely expensive hardware RAID controllers offering many more features than I required.

I ended up ordering 2x Syba PCI Express SATA II 4-Port RAID Controller Card SY-PEX40008 which are SiI 3124 chipset based controllers that provide 4x SATA II ports on a single PCIe-1x card. It’s worth noting that these controllers are actually PCI-X cards with an onboard PCI-X to PCIe-1x converter chip which is more than fast enough for hard disks, but could be a limitation for SSDs.

Marvell Chipset-based controller (left) and replacement SiI based controller (right).

Marvell Chipset-based controller (left) and replacement SiI based controller (right). The SiI controller is much larger thanks to the PCI-X to PCI-e converter chip on it’s PCB.

I selected these cards in particular, as they’re used by the team at Backblaze (a massive online back service provider) for their storage pods, which are built entirely around giant JBOD disk arrays. I figured with the experience and testing this team does, any kit they recommend is likely to be pretty solid (check out their design/decision blog post).

# lspci | grep ATA
00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
03:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)
05:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 02)

These controllers use the “sata_sil24” kernel driver and are configured as simple JBOD controllers in the exact same fashion as the Marvell cards.

To properly test them, I rebuilt an array with a known bad disk in it. As expected, at some point during the rebuild, I got a failure of this disk:

Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x3fe SErr 0x0 action 0x6
Nov 13 10:15:19 phobos kernel: ata14.00: irq_stat 0x00020002, device error via SDB FIS
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:08:00:c5:c7/01:00:02:00:00/40 tag 1 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 41/40:00:2d:c5:c7/00:01:02:00:00/00 Emask 0x409 (media error) <F>
Nov 13 10:15:19 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/00:10:00:c6:c7/01:00:02:00:00/40 tag 2 ncq 131072 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:20:9c/00 Emask 0x2 (HSM violation)
Nov 13 10:15:19 phobos kernel: ata14.00: status: { Busy }
Nov 13 10:15:19 phobos kernel: ata14.00: error: { IDNF }
Nov 13 10:15:19 phobos kernel: ata14.00: failed command: READ FPDMA QUEUED
Nov 13 10:15:19 phobos kernel: ata14.00: cmd 60/c0:18:00:c7:c7/00:00:02:00:00/40 tag 3 ncq 98304 in
Nov 13 10:15:19 phobos kernel:         res 9c/13:04:04:00:00/00:00:00:30:9c/00 Emask 0x2 (HSM violation)
...
Nov 13 10:15:41 phobos kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 13 10:15:41 phobos kernel: ata14.00: irq_stat 0x00060002, device error via D2H FIS
Nov 13 10:15:41 phobos kernel: ata14.00: failed command: READ DMA
Nov 13 10:15:41 phobos kernel: ata14.00: cmd c8/00:00:00:c5:c7/00:00:00:00:00/e2 tag 0 dma 131072 in
Nov 13 10:15:41 phobos kernel:         res 51/40:00:2d:c5:c7/00:00:02:00:00/02 Emask 0x9 (media error)
Nov 13 10:15:41 phobos kernel: ata14.00: status: { DRDY ERR }
Nov 13 10:15:41 phobos kernel: ata14.00: error: { UNC }
Nov 13 10:15:41 phobos kernel: ata14.00: configured for UDMA/100
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Unhandled sense code
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Sense Key : Medium Error [current] [descriptor]
Nov 13 10:15:41 phobos kernel: Descriptor sense data with sense descriptors (in hex):
Nov 13 10:15:41 phobos kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Nov 13 10:15:41 phobos kernel:        02 c7 c5 2d 
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] Add. Sense: Unrecovered read error - auto reallocate failed
Nov 13 10:15:41 phobos kernel: sd 13:0:0:0: [sdk] CDB: Read(10): 28 00 02 c7 c5 00 00 01 00 00
Nov 13 10:15:41 phobos kernel: ata14: EH complete
Nov 13 10:15:41 phobos kernel: md/raid:md5: read error corrected (8 sectors at 46646648 on sdk)

The known faulty disk failed as expected, but rather than suddenly vanishing from the SATA controller, the disk remained attached and proceeded to spew out seek errors for the next few hours, slowing down the rebuild process.

Most importantly, no other disks failed at any stage following the move from the Marvell to the SiI controller. Having assured myself with the stability and reliability of the controller, I was able to add in other disks I weren’t sure had or had not failed and quickly eliminate the truly faulty disks.

My RAID array rebuild completed successfully and I was then able to restore the LVM volume and all my data and restore full services on my system.

With the data restored, time for a delicious IPA.

With the data restored successfully at last, time for a delicious local IPA to celebrate victory at long last.

Having gone through the pain and effort to recover and rebuild this system, I had to ask myself a few questions about my design decisions and what I could do to avoid a fault like this again in future.

Are the Marvell controllers shit?

Yes. Next question?

In seriousness, the fact that two completely different PCI-e card vendors exhibited the exact same problem with this controller shows that it’s not going to be the fault of the manufacturer of the controller card and must be something with the controller itself or the kernel driver operating it.

It is *possible* that the Linux kernel has a bug in the sata_mv controller, but I don’t believe that’s the likely cause due to the way the fault manifests. Whenever a disk failed, the disks would slowly disappear from the kernel entirely, with the kernel showing surprise at suddenly having lost a disk connection. Without digging into the Marvell driver code, it looks very much like the hardware is failing in some fashion and dropping the disks. The kernel just gets what the hardware reports, so when the disks vanish, it just tries to cope the best that it can.

I also consider a kernel bug to be a vendor fault anyway. If you’re a major storage hardware vendor, you have an obligation to thoroughly test the behaviour of your hardware with the world’s leading server operating system and find/fix any bugs, even if the driver has been written by the open source community, rather than your own in house developers.

Finally posts on FreeBSD mailing lists also show users reporting strange issues with failing disks on Marvell controllers, so that adds a cross-platform dimension to the problem. Further testing with the RAID vendor drivers rather than sata_mv would have been useful, but they only had binaries for old kernel versions that didn’t match the CentOS 6 kernel I was using.

Are the Seagate disks shit?

Maybe – it’s hard to be sure, since my server has been driven up in my car from Wellington to Auckland (~ 900km) and back which would not have been great for the disks (especially with my driving) and has had a couple of times that it’s gotten warmer than I would have liked thanks to the wonders of running a server in a suburban home.

I am distrustful of my Seagate drives, but I’ve had bad experienced with Western Digital disks in the past as well, which brings me to the simple conclusion that all spinning rust platters are crap and the price drops of SSDs to a level that kills hard disks can’t come soon enough.

And even if I plug in the worst possible, spec-violating SATA drive to a controller, that controller should be able to handle it properly and keep the disk isolated, even if it means disconnecting one bad disk. The Marvell 88SX7042 controller is advertised as being a 4-channel controller, I therefore do not expect issues on one channel to impact the activities on the other channels. In the same way that software developers code assuming malicious input, hardware vendors need to design for the worst possible signals from other devices.

What about Enterprise vs Consumer disks

I’ve looked into the differences between Enterprise and Consumer grade disks before.

Enterprise disks would certainly help the failures occur in a cleaner fashion. Whether it would result in correct behaviour of the Marvell controllers, I am unsure… the cleaner death *may* result in the controller glitching less, but I still wouldn’t trust it after what I’ve seen.

I also found rebuilding my RAID array with one of my bad sector disks would take around 12 hours when the disk was failing. After kicking out the disk from the array, that rebuild time dropped to 5 hours, which is a pretty compelling argument for using enterprise disks to have them die quicker and cleaner.

Lessons & Recommendations

Whilst slow, painful and frustrating, this particular experience has not been a total waste of time, there are certainly some useful lessons I’ve learnt from the exercise and things I’d do different in future. In particular:

  1. Anyone involved in IT ends up with a few bad disks after a while. Keep some of these bad disks around rather than destroying them. When you build a new storage array, install a known bad disk and run disk benchmarks to trigger a fault to ensure that all your new infrastructure handles the failure of one disk correctly. Doing this would have allowed me to identify the Marvell controllers as crap before the server went into production, as I would have seen that a bad disks triggers a cascading fault across all the good disks.
  2. Record the serial number of all your disks once you build the server. I had annoyances where bad disks would disappear from the controller entirely and I wouldn’t know what the serial number of the disk was, so had to get the serial numbers from all the good disks and remove the one disk that wasn’t on the list in a slow/annoying process of elimination.
  3. RAID is not a replacement for backup. This is a common recommendation, but I still know numerous people who treat RAID as an infallible storage solution and don’t take adequate precautions for failure of the storage array. The same applies to RAID-like filesystem technologies such as ZFS. I had backups for almost all my data, but there were a few things I hadn’t properly covered, so it was a good reminder not to succumb to hubris in one’s data redundancy setup.
  4. Don’t touch anything with a Marvell chipset in it. I’m pretty bitter about their chipsets following this whole exercise and will be staying well away from their products for the foreseeable future.
  5. Hardware can be time consuming and expensive. As a company, I’d stick to using hosted/cloud providers for almost anything these days. Sure the hourly cost can seem expensive, but not needing to deal with hardware failures is of immense value in itself. Imagine if this RAID issue had been with a colocated server, rather than a personal machine that wasn’t mission critical, the cost/time to the company would have been a large unwanted investment.

 

ELBs & Corporate Proxies

Following on from yesterday’s ELB post, it’s worth noting that there’s another common scenario where you can trigger issues when accessing ELBs – many corporates enforce the use of an HTTP proxy for all outgoing traffic, sometimes transparently, other times less so.

Having a multi-AZ ELB and accessing from your own data center isn’t too much of an issue, if each host does it’s own DNS lookup, your hosts should roughly end up with a 50/50 split across AZs, as each one resolves it’s own DNS record.

But when a proxy is added to the mix, it breaks this, since proxies tend to do their own DNS lookups and cache the results for use by other clients. Testing with Squid showed that the DNS caching for Squid would favour a particular AZ and send all traffic to that one AZ, before then flipping to the other AZ when the DNS cache expired and was refreshed – in my case, every 5mins when the TTL expired.

I'm so sick of these motherfucking ELBs in this motherfucking cloud!

Go home Amazon, you’re drunk

If you’re using Squid, there’s little you can do to work around this – whilst you can adjust the Squid DNS caching times and approaches, short of disabling DNS caching and taking the performance hit of a DNS lookup for every new outbound request, you will always end up with load jumping between both AZs and causing havoc..

There’s a few workaround options:

  • Have multiple Squid proxies for your outbound traffic and load balance between then on your network, if you load balance outgoing traffic across 4+ different outbound Squid servers, your load should end up going to different AZs a *bit* more evenly – but still not guaranteed.
  • Create an internal ELB and access via your VPC link, allowing you to bypass your company’s outbound network proxies (as traffic routes via VPN or Direct Connect) – but then you’re paying for 2x ELBs – one external for end users and one internal for your own systems.
  • Replace the ELB with something actually useful (eg a Varnish or HA-Proxy instance in Amazon).
  • Get rid of the outbound proxies please! I could write a business case for it based on the amount of money I’ve seen proxies waste at so many different companies (hint: engineers time debugging issues is much more expensive than a couple GB extra data usage).
  • Gin.

Russian roulette with ELBs and CDNs

In my day job, I look after a number of websites, all of which generally make heavy use of CDNs (Content Distribution Networks) to offload traffic to edge nodes near to an end user’s device. In our case we use Akamai, one of the largest and experienced providers in the world.

A large number of our clusters and applications now run on Amazon’s public cloud service here in Sydney, making use of EC2 instances and ELBs. Due to the important nature of our systems, we have almost all applications in active-active multi-AZ (Availability Zone) configurations. The intention of this design is that the ELB (Elastic Load Balancer) serves all incoming traffic by dividing it across each availability zone in equal proportions. If either Amazon AZ fails, the other will continue to serve requests like nothing is wrong.

It’s a nicer solution than the traditional data center approach of having an active-passive multi-site design, as with both AZs being constantly active serving requests, we know that production and “DR” are always in a functional working state, ready to handle traffic; plus your investment into DR isn’t going to waste like traditional servers sitting idle.

Unfortunately Amazon ELBs offer only the barest of no-frills features which makes them a bit stupid at times. In particular, Amazon’s multi-AZ ELBs actually consist two separate ELBs, once in each AZ. Incoming traffic selects an ELB by means of a DNS round robin and then is directed to a server in that particular AZ .

Thus, each availability zone has it’s own ELB, which adds it’s own IP address to the DNS round robin, and looks something like this:

www.example.com is an alias for www-example-com-elb.jws.elb.amazonaws.com.
www-example-com-elb.jws.elb.amazonaws.com. has address 172.16.32.1
www-example-com-elb.jws.elb.amazonaws.com. has address 192.168.0.1

The problem is that DNS round robin has no guarantee of balancing the load evenly across the two data centers. If a particular company’s proxy server caches one address, it may direct traffic for the whole company to AZ-A and deliver no traffic to AZ-B.

In reality, due to the large number of users getting assigned different IP addresses with round robin, users tend to be spread somewhat evenly across the different AZs, making the problem a somewhat moot point when you have sizeable visitor numbers.

But if you add Akamai to the mix, you can end up with interesting results – it turns out that Akamai Edge nodes in AU use a central source of DNS information, which can lead to them favouring a particular ELB IP address. And since *all* your traffic goes via the CDN, this in turn results in all your traffic going directly to a single AZ and ignoring the other one entirely.

In a real-world scenario of a 4 webserver cluster, we saw traffic jump between each AZ whenever Akamai’s edge servers updated DNS to a different IP address, as per the below graph:

Time to really test that your application is active-active!

Akamai decides to switch which ELB it’s using from A to B :-/

This swapping brings around some really nasty issues. In theory your active-active setup should be large enough to handle all your usual traffic load on just one AZ, but if that’s not the case, bad things will happen to your site performance and/or reliability.

The other nasty issue is when doing auto-scaling with Amazon, this swapping messes with your Cloud Watch metrics for autoscale policies/triggers – one AZ is complete idle, one AZ is maxed out, average stats show a half busy cluster, no need to autoscale upwards to handle the load.

And even if you’re clever and set your autoscaling to also trigger based on ELB latency/errors/throughput, you may still end up with issues, since the new host created during the autoscale may end up in the idle AZ, instead of the active AZ where you need it.

Using a smarter system for load balancing can negate the issue – for example using a pair of Varnish servers or HA-Proxy servers configured to do cross-AZ load balancing would workaround the issue, by spreading all the traffic coming into one AZ across all the servers in both AZs, but this does have increased costs (running EC2 instances, inter-AZ traffic). It also may have performance issues depending on the amount of traffic pouring into your instance.

Additionally, if you have a global audience, rather than a mostly single-country audience like us, you may not see the issue, since the different Akamai regions around the world will balance load somewhat equally across the two AZs.

To properly fix this behaviour with Akamai, you need to open a professional services request and have the SureRoute configuration adjusted so that Akamai forces the edge notes to lookup the origin IPs at the edge:

<!-- SR fix to handle multiple origin IP's -->
<forward:cache-parent.sureroute2.force-origin-ip-from-edge>on
</forward:cache-parent.sureroute2.force-origin-ip-from-edge>
<forward:cache-parent.sureroute2.round-robin.status>on
</forward:cache-parent.sureroute2.round-robin.status>

<!-- no host in sureroute stat-key -->
<forward:cache-parent.sureroute2.stat-key.host>off
</forward:cache-parent.sureroute2.stat-key.host>

With this fixed configuration, Akamai will correctly spread load evenly across our two AZs and our load graphs settled comfortably back into normality. I’m not entirely sure why this configuration isn’t default SureRoute behaviour, but like many things with Akamai, there are often mysterious adjustments that only professional services know about or can make.

Finally it’s worth noting that this issue isn’t unique to Amazon – you could get the same issue if you run active-active conventional data centers and use Akamai for offload. It may also be an issue with other CDNs by default, so double-check the behaviour of your particular vendor – it would be interesting to see if CloudFront (Amazon’s CDN) exhibits similar issues or not.

Credit to my colleague Andrew. for spotting this issue originally and having to deal with two different vendors support cases at once to get to the bottom of the root cause.

DAViCal, awkward name, great features

A reoccurring theme of this blog is that I love to be able to use open standards and open source for storing and accessing my information – biggest example is of course IMAP for email, but I also use tools such as Mozilla Sync Server for self-reliant synchronization and backup of client device information, without using external cloud providers.

I’ve been a user of Evolution for almost a decade now – sometimes criticized as the “Outlook of Linux”, Evolution provides mail, calendering, contacts and todo lists to the GNOME desktop, with a pretty large but sometimes slightly buggy feature set. For me personally, it’s always done a great job and it’s my key business productivity tool.

I moved all my mail onto an IMAP server years ago, which makes it easy to shift clients if I ever need to – in my case, pretty much just needing to access mail from both my laptop and smart phone.

However this hasn’t been the case for other key data such as calendering and contacts. A few years ago, the open source calendering solutions available weren’t that well developed, and many clients suffered limitations such as read-only functionality.

Thankfully this has been changing – most clients (*glares angrily at Microsoft Outlook*) now support CalDAV and CardDAV quite reliably, which gives us an open standard that works across different programs, platforms and device types.

  • CalDAV is an open standard for the exchange of calendering and task/todo/memo information between a client and a server.
  • CardDAV is an open standard for the exchange of contact/address book information between a client and a server.

These two standards have a number of implementations, both open source and proprietary, of note is Apple Calender Server, which is Apple’s open source implementation; and DAViCal, an open source LAMP based server solution that is becoming quite popular.

I’ve used both solutions – my employer runs an Apple Calender Server after getting fed up at not having free/busy between engineers. Whilst we ended up running a MacOS server, the Linux ports have improved and there are resources for setting it up on a Linux or even BSD host.

Apple Calender Server works reasonably well with Evolution, I never have any issue booking events, however Evolution appears unable to accept or deny meeting requests, forcing me to go to the calender server web-interface which is actually pretty horrific.

I decided I wouldn’t use it for my own personal calendering, even if I went to the effort for porting it onto my Linux servers, it wouldn’t really be the solution I ideally want as it lacks a lot of features and isn’t as easy to configure as other Linux services.

Instead I had a look at DAViCal. It’s a feature-packed calendering and contacts application developed primarily by Morphoss in Wellington NZ, started by Andrew McMillian of ex-Catalyst IT fame.

Despite having an annoyingly tricky name to type (you try typing it for the 100th time at 3am without typoing on the capitalization!!), the software itself appears reliable and worked across a number of devices when I ran tests.

It’s not perfect, I have some issues with the user interface design, which very functional and effective, it’s not that intuitive to a new user, exposing far too many options to them at the beginning, ideally have a simple/advanced option so a user who just wants to add user calenders and do basic stuff can do so, then dig into more detailed ACLs, tokens, shared calenders, etc as needed.

Naturally it’s open source, so I should stop complaining and hack up some code to demonstrate what I think might be better. Maybe if people would stop stealing my car I’d have time to get something done. :-/

Main Screen Turn On! (Maybe some more 1-2-3 clear setup flows here would be nice, the wall of text is kind of offputting for visual people like myself).

Options! All the options!

The web-based interface is only for administration, there isn’t a web-based calender app provided with DAViCal, instead choose any CalDAV client you wish to use with it, whether it’s web based or client-side.

I haven’t given DAViCal’s feature set a full work out yet, at this stage I’ve just setup my personal calendar, contacts and todo list on both Evolution and my Android ICS phone but haven’t touched meeting requests, shared calenders and free/busy information.

Partly my testing is a bit limited since I’m only running Evolution 2.30.3 (Debian Stable) which is a little outdated and it looks like there’s some functionality missing/broken that might not be an issue any more.

On the mobile side, I’m using “aCal”, an open source Android application written by the DAViCal developer, providing a CalDAV calender, todo list and read-only contact/address book synchronization.

This now means I can add, edit and delete calender and task entries on either my Android phone or Linux laptop via evolution and have it propagate to the other device – although unfortunately this is based on polling, rather than push (looks like push is possible theoretically via an extension to the standard, works with iCal).

Tasks & calendar entries in a bright sunny UI

I can also get read-only copies of all my contact information from Evolution synced through to my phone, but sadly there isn’t support for editing contacts on the Android phone just yet.

I did also consider using LDAP for my address book entries, but CardDAV looks like a better designed solution, it’s very rare that I don’t see “LDAP” and “headache” mentioned in the same sentence, and this comes from someone maintaining and supporting LDAP enterprise environments…

Essentially the main problem with LDAP, is that there isn’t an exact standard for address entries, so what works for one client, might not work 100% for another, along with limited selection of decent applications for actually managing LDAP address databases.

Also some clients treat LDAP assuming it’s going to be a million+ record store and expose different UI compared to that of smaller address books which harm user experience (*glares at Evolution*).

aCal & Android ICS address book integration - note the uneditable edit screen on the right, read-only for now :-(

The other main issue with aCal is that it doesn’t sync with the native OS calender program, but instead provides it’s own. Digging through the documentation and mailing list, this appears to be due to the native application lacking support for some of the functionality needed for a proper CalDAV implementation, so a sync solution would leave certain features missing, although I’d still like the option.

Of course these are limitations of aCal, not DAViCal or the standards themselves – there are some other CalDav & CardDAV sync programs available in the Android market under non-open licenses, which you have the option of trying.

The nice thing about using standards is that you can have multiple vendors competing to make the best product/tool for their customer’s needs, not simply using lock-in to maintain/force a customer base. :-)

Overall DAViCal seems really nice and in my testing has been quite reliable – I’m now moving on to more rigorous testing and am in the process of migrating my calender and contacts information into it, once I start using it daily in the real world the true testing begins.

Keen to take a look at what options I have around exposing some information publicly, eg sharing schedule free/busy with friends on different servers.

Google Search & Control

I’ve been using Google for search for years, however it’s the first time I’ve ever come across a DMCA takedown notice included in the results.

Possibly not helped by the fact that Google is so good at finding what I want, that I don’t tend to scroll down more than the first few entries 99% of the time, so it’s easy to miss things at the bottom of the page.

Lawyers, fuck yeah!

Turns out that Google has been doing this since around 2002 and there’s a form process you can follow with Google to file a notice to request a search result removal.

Sadly I suspect that we are going to see more and more situations like this as governments introduce tighter internet censorship laws and key internet gatekeepers like Google are going to follow along with whatever they get told to do.

Whilst people working at Google may truly subscribe to the “Don’t be evil” slogan, the fundamental fact is that Google is a US-based company that is legally required to do what’s best for the shareholders – and the best thing for the shareholders is to not try and fight the government over legalization, but to implement as needed and keep selling advertising.

In response to concerns about Google over privacy, I’ve seen a number of people to shift to new options, such as the increasingly popular and open-source friendly Duck Duck Go search engine, or even Microsoft’s Bing which isn’t too bad at getting decent results with a UI looking much more like early Google.

However these alternatives all suffer from the same fundamental problem – they’re centralized gatekeepers who can be censored or controlled – and then there’s the fact that a centralised entity can track so much about your online browsing. Replacing Google with another company will just leave us in the same position in 10 years time.

Lately I’ve been seeking to remove all the centralized providers from my online life, moving to self-run and federated services – basic stuff like running my own email, instant messaging (XMPP), but also more complex “cloud” services being delivered by federated or self-run servers for tasks such as browser syncing, avatar handling, contacts sync, avoiding URL shortners and quitting or replacing social networks.

The next big one of the list is finding an open source and federated search solution – I’m currently running tests with a search engine called YaCy, which is a peer-to-peer decentralised search engine that is made up of thousands of independent servers, sharing information between themselves.

To use YaCy, you download and run your own server, set it’s search indexing behavior and let it run and share results with other servers (it’s also possible to run it in a disconnected mode for indexing your internal private networks).

The YaCy homepage has an excellent write up of their philiosophy and design fundamentals for the application.

It’s still a bit rough, I think the search results could be better – but this is something that having more nodes will certainly help with and the idea is promising – I’m planning to setup a public instance on my server in the near future for adding all my sites to the index and providing a good test of it’s feasibility.