Tag Archives: gnu

Rescuing a corrupt tarfile

Having upgraded OS recently, I was using a poor quality sneakernet of free USB sticks to transfer some data from my previous installation. This dodgy process strangely enough managed to result in some data corruption of my .tar.bz2 file, leaving me in the position of having to go to other backups to recover my data. :-(

$ tar -xkjvf corrupt_archive.tar.bz2
....
jcarr/Pictures/fluffy_cats.jpg
jcarr/Documents/favourite_java_exceptions.txt

bzip2: Data integrity error when decompressing.
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

This is the first time I’ve ever experienced a corruption like this with .tar.bz2. The file was the expected size, so it wasn’t a case of a truncated file, the data was there but something part way through the file was corrupted and causing bzip2 to fail with decompression.

Bzip2 comes with a recovery utility, which works by rescuing each block into an individual file. We then run -t over them to identify any blocks which are clearly corrupt, and delete them accordingly.

$ bzip2recover corrupt_archive.tar.bz2
$ bzip2 -t rec*.tar.bz2

Then we can put the blocks back together in an uncompressed form of the original file (in this case tar);

$ bzip2 -dc rec*.tar.bz2 > recovered_data.tar

Finally we want to extract the actual tar file itself to get the data. However, tar might not be too happy about having lost some blocks inside it, or having other forms of corruption.

# tar -xvf recovered_data.tar
...
jcarr/Pictures/fluffy_cats.jpg
jcarr/Documents/favourite_java_exceptions.txt
tar: Skipping to next header
tar: Archive contains ‘\223%\322TGG!XہI.’ where numeric off_t value expected
tar: Exiting with failure status due to previous errors

I couldn’t figure out a way to get tar to skip over, or repair the file, however I did find a few posts online suggesting the use of the much older cpio utility that still exists on most unixes today.

$ cpio -ivd -H tar < recovered_data.tar

This worked perfectly! cpio complained about some files it couldn’t recover, but it recovered the vast majority of the damaged contents. Of course I can’t trust any files completely that I’ve restored, always possible there is some small corruption after such a restore, however if you lack backups, or your backups themselves are corrupted, this could be the way to go to get back some of your precious data.

In this case I was lucky that the header of the file was still intact  – if bzip2 or tar can’t read the file header to identify it as a tar.bz2 to being with, other measures may need to be taken. There’s heaps of suggestions online, just make a copy of the corrupted file first then try the different suggested methods till you find an approach that (hopefully) works for you.

Thunderbolt and other Macbook hardware issues with Linux

Having semi-recently switched to a Macbook Pro Retina 15″ at work, I decided to give MacOS a go. It’s been interesting, it’s not too bad an operating system and whilst it is something I could use on an ongoing basis, I quickly longed for the happy embrace of GNU/Linux where I have a bit more power and control over the system.

Generally the Linux kernel supports most of the Macbook hardware out-of-the-box (As of 3.15 anyway), but with a couple exceptions:

  • I believe support for the dual GPU mode switching is now fixed, however the model I’m using now is Intel only, so I can’t test this unfortunately.
  • The Apple Webcam does not yet have a driver. The older iSight driver doesn’t work, since the new gen of hardware is a PCIe connected device, not USB.
  • The WiFi requires a third party driver to be built for your kernel. You’ll want the latest Broadcom 802.11 STA driver in order for it to built with new kernel versions. Ubuntu users, get this version, or more recent.
  • If you’re having weird hangs where the Macbook just halts frequently waiting on on I/O, add “libata.force=noncq” kernel parameter. It seems that there is some bug with this SSD and some kernel versions that leads to weird I/O halts, which is fixed by this option.
  • Thunderbolt support is limited to only working on devices connected at boot up, no hotplug. Additionally, when using Thunderbolt, Suspend/Resume is disabled (although it works otherwise if there’s no Thunderbolt involved).

Of all these issues, the lack of Thunderbolt support was the one that was really frustrating me, since I need to use a Thunderbolt based Ethernet adaptor currently on a daily basis and I always rely on Suspend and Resume heavily.

Thankfully two kernel developers, Andreas Noever and Matthew J Garrett have been working on a series of kernel patches that introduce support for Thunderbolt hotplug and thus allow it to work on suspend and resume.

Sadly whilst this patch is awesome, it doesn't yet do wireless Thunderbolt for when the ethernet cable you want is too bloody short.

You too can now enjoy the shackles of a wired LAN connection like it’s 1990 all over again!

It doesn’t sound like it has been easy based on the posts on MJG’s blog which are well worth a read – essentially the Apple firmware does weird things with the Thunderbolt hardware when the OS doesn’t identify itself as Darwin (MacOS’s kernel) and likes to power stuff down after suspend/resume, so it’s taken some effort to debug and put in hardware-specific workarounds.
It will surely only be a matter of time before these awesome patches are merged, but if you need them right now and are happy to run rather beta kernel patches (who isn’t??) then the easiest way is to checkout their Git repo of 3.15 with all the patches applied. This repository should build cleanly via the usual means, and provide you with a new kernel module called “thunderbolt”.I’ve been testing it for a few days and it looks really good. I’ve had no kernel panics, freezes, devices failing to work or any issues with suspend/resume with these patches – the features that they claim to work, just work.  The only catches are:

  • If you boot the Macbook with the Thunderbolt device attached, it will be treated like a PCIe hotplug device… except that when you remove it, that Thunderbolt port won’t work again until the next restart. I recommend booting the Macbook with no devices attached, then hotplug once started to avoid this issue. I always remove before suspend and re-connect after resume as well (mostly because it’s a laptop and it’s easy to do so and avoid any issues).
  • The developers advise that Thunderbolt Displays don’t work at this time (however Mini DisplayPort connected screens work fine, even though they share the same socket).
  • The developers advise that chaining Thunderbolt devices is not yet supported. So stick to one device per port for now.

If you’re using Linux on a Macbook, I recommend grabbing the patched source and doing a build. Hopefully all these patches make their way into 3.16 or 3.17 and make this post irrelevant soon.

If you’re extra lazy and trust a random blogger’s binary packages, I’ve built deb packages for Ubuntu 13.10 (and should work just fine on 14.04 as well) for both the Thunderbolt enabled kernel as well as the Broadcom WiFi. You can download these packages here.

Incur the Wrath of Linux

Linux is a pretty hardy operating system that will take a lot of abuse, but there are ways to make even a Linux system unhappy and vengeful by messing with available resources.

I’ve managed to trigger all of these at least once, sometimes I even do it a few times before I finally learn, so I’ve decided to sit down and make a list for anyone interested.

 

Disk Space

Issue:

Running out of disk. This is a wonderful way to cause weird faults with services like databases, since processes will block (pause) until there is sufficient disk space available again to allow writes to complete.

This leads to some delightful errors such as websites failing to load since the dynamic pages are waiting on the database, which in return is waiting on disk. Or maybe apache can’t write anymore PHP session files to disk, so no PHP based pages load.

And mail servers love not having disk, thankfully in all the cases I’ve seen, Sendmail & dovecot just halt and retain messages in memory without causing a loss of data. (although a reboot when this is occurring could be interesting).

Resolution:

For production systems I always carefully consider the partition table structure, so that an issue such as out-of-control logging processes or tmp directories can’t impact key services such as databases, by creating separate partitions for their data.

This issue is pretty easy to fix with good monitoring, packages such as Nagios include disk usage checks in the stock versions that can alert at configurable intervals (eg 80% of disk used).

 

Disk Access

Issue:

Don’t unplug a disk whilst Linux is trying to use it. Just don’t. Really. Things get really unhappy and you get to look at nice output from ps aux showing processes blocked for disk.

The typical mistake here is unplugging devices like USB hard drives in the middle of a backup process causing the backup process to halt and typically the kernel will spewing the system logs with warnings about how naughty you’ve been.

Fortunately this is almost always recoverable, the process will eventually timeout/terminate and the storage device will work fine on the next connection, although possibly with some filesystem errors or a corrupt file if halfway through writing to disk.

Resolution:

Don’t be a muppet. Or at least educate users that they probably shouldn’t unplug the backup drive if it’s flashing away busy still.

 

Networked Storage

Issue:

When using networked storage the kernel still considers the block storage to be just as critical as local storage, so if there’s a disruption accessing data on a network file system, processes will again halt until the storage returns.

This can have mixed blessings – in a server environment where the storage should always be accessible, halting can be the best solution since your programs will wait for the storage to return and hopefully there will be no data loss.

However for a mobile environment this can cause problems to hang indefinetly waiting for storage that might not be able to be reconnected.

Resolution:

In this case, the soft option can be used when mounting network shares, which will cause the kernel to return an error to the process using the storage if it becomes unavailable so that the application (hopefully) warns the user and terminates gracefully.

Using a daemon such as autofs to automatically mount and unmount network shares on demand can help reduce this sort of headache.

 

Low Memory

Issue:

Running out of memory. I don’t just mean RAM, but swap space (pagefile for you windows users). When you run out of RAM on almost any OS, it won’t be that happy – Linux handles this situation by killing off processes using the OOM in order to free up memory gain.

This makes sense in theory (out of memory, so let’s kill things that are using it), but the problem is that it doesn’t always kill the ones you want, leading to anything from amusement to unmanageable boxes.

I’ve had some run-ins with the OOM before, killing my ssh daemon on overloaded boxes preventing me from logging into them. :-/

One the other hand, just giving your system many GB of swap space so that it doesn’t run out of memory isn’t a good fix either, swap is terribly slow and your machine will quickly grind to a near-halt.

The performance of using swap is so bad it’s sometimes difficult to even log in to a heavily swapping system.

 

 Resolution:

Buy more RAM. Ideally you shouldn’t be trying to run more than possible on a box – of course it’s possible to get by with swap space, but only to a small degree due to the performance pains.

In a virtual environment, I’m leaning towards running without swap and letting OOM just kill processes on guests if they run out of memory, usually it’s better to take the hit of a process being killed than the more painful slowdown from swap.

And with VMs, if the worst case happens, you can easily reboot and console into the systems, compared to physical hosts where you can’t afford to lose manageability at all costs.

Of course this really depends on your workload and what you’re doing, best solution is monitoring so that you don’t end up in this situation in the first place.

Sometimes it just happens due a once-off process and is difficult to always forsee memory issues.

 

Incorrect Time

Issue:

Having the incorrect time on your server may appear only a nuisance, but it can lead to many other more devious faults.

Any applications which are time-sensitive can experience weird issues, I’ve seen problems such as samba clients being unable to see newer files than the system time and having bind break for any lookups. Clock issues are WEIRD.

Resolution:

We have NTP, it works well. Turn it on and make sure the NTP process is included in your process monitoring list.

 

Authentication Source Outages

Issue:

In larger deployments it’s often common to have a central source of authentication such as LDAP, Kerberos, Radius or even Active Directory.

Linux actually does a remarkable amount of lookups against the configured authentication sources in regular operation. Aside from the need to lookup whenever a user wishes to login, Linux will lookup the user database every time the attributes of a file is viewed (user/group information) which is pretty often.

There’s some level of inbuilt caching, but unless you’re running a proper authentication caching daemon allowing off-line mode, a prolonged outage to the authentication server will make it impossible for users to login, but also break simple queries such as ls as the process will be trying to make user/group information lookups.

Resolution:

There’s a reason why we always have two or more sources for key network services such as DNS and LDAP, take advantage of the redundancy built into the design.

However this doesn’t help if the network is down entirely, in which case the best solution is having the system configured to quickly failover to local authentication or to use the local cache.

Even if failover to a secondary system is working, a lot of the timeout defaults are too high (eg 300 seconds before trying the secondary). Whilst the lookups will still complete eventually, these delays will noticely impact services, so it’s recommended to lookup the authentication methods being used and adjust the timeouts down to a couple seconds tops.

 

This is just a few of simple yet nasty ways to break Linux systems in ways that cause weird application behaviour, but not nessacarily in a form that’s easy to debug.

In most cases, decent monitoring will help you avoid and handle many of these issues better by alerting to low resource situations – if you have nothing currently, Nagios is a good start.