Monthly Archives: August 2013

WordPress & SSL Fixes

I’ve been using WordPress for this blog for a number of years now – at some point I realised that whilst writing my own code is fun, there’s no need to reinvent yet-another-fucking-blog-platform and ended up selecting WordPress to use for my content, on the basis of it’s strong and active development and community.

Generally it’s pretty good, but there are times it disappoints, such as WordPress expecting servers to have FTP for unpacking updates and plugins (it’s 2013 guys, SFTP at least!), excessively setting cookies which makes caching layers more complex and doing stupid stuff with storing full URLs inside the database for page links and image resources.

The latter has been impacting me in particular. Visitors to my site have had the option of using HTTP or HTTPS (SSL secured) access methods for some time, but annoyingly whenever I posted an article with images, WordPress includes all the images using http://. This mixed content type prevents browsers from showing the lock icon (best case) or throws up a nasty error (worst case) depending on the browser and it’s level of concern for user safety for mismatched content.

Dubious Firefox is dubious about this site.

Dubious Firefox is dubious about this site, no lock icon of security here!

Despite having accessed the site on https://, WordPress still uses http:// for my images.

Despite having accessed the site on https://, WordPress still uses http:// for my images.

I could work around this by setting the WordPress base URL for my site to be https://www.jethrocarr.com, but then images served at the unsecured http:// site would also be served via SSL, which is just adding pointless load to the server (not that SSL termination really adds much load these days, but damnit, I’m being a purist here!).

I was hoping that it was a misconfiguration of my WordPress setup, but reading online it seems that this is a known issue with WordPress and a whole bunch of modules, hacks and themes have sprung up to fix/workaround the issue…

Of course there’s an easier way – fix it at the webserver layer! Both Nginx and Apache have modules to do substitutions in page content on load, for Nginx there’s HttpSubModule and for Apache there is mod_substitute. In my case with stock Apache 2.2 on CentOS 5, I was able to fix the whole issue by adding the following to my SSL vhost configuration:

# Fix SSL URLs thanks to WordPress hardcoding http:// links to images :'(
<Location />
    AddOutputFilterByType SUBSTITUTE text/html
    Substitute "s|http://www.example.com|https://www.example.com|"
</Location>

Following this, things look much better:

The lock icon of browser approval!

The lock icon of browser approval!

All media files are now https://, not http://

All media files are now https://, not http://

Technically this substitution will have some level of performance impact, as it has to process the generated HTML content and check for strings to replace, but the impact is so low that I wasn’t able to measure it amongst the usual variation of page response times – and it’s not going to be anywhere as slow as mod_php and WordPress itself anyway. ;-)

Finally, if you haven’t already, you probably want to change the following in wp-config.php:

define('FORCE_SSL_ADMIN', true);

This forces all WordPress logins and wp-admin activities to take place under HTTPS which is a pretty good idea if you ever post to your blog from an unsecured network.

The Apache that wanted to be root

I’ve run into an issue a couple of times where some web applications on my server have broken following a restart of Apache when the application in question calls external programs..

What seems to happen is that when an administrator restarts Apache during general maintenance of that server, Apache picks up some of the unwanted environmental settings from the root user account, in particular the variable HOME ends up getting set to the home directory of the root user account (/root).

Generally it won’t be an issue for web applications, but if they call an external application (in my case, Git), that external application may use the HOME environment to try and read or write configuration files.

# tail -n1 error.log
fatal: unable to access '/root/.config/git/config': Permission denied

In my case, Git kept dying with a fatal error, which lead to a very confused sysadmin wondering why a process running as Apache is trying to read from the root user’s account…

By looking at the environmental settings for the Apache worker processes, we can see what’s happening. After a normal boot, the environmental variables look something like the below:

# ps aux | grep httpd
root     10173  0.0  1.6  27532  8496 ?        Ss   22:42   0:00 /usr/sbin/httpd
apache   10175  0.1  2.8  37560 14692 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10176  0.1  2.8  37836 14952 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10177  0.1  2.8  37332 14876 ?        S    22:42   0:01 /usr/sbin/httpd
apache   10178  0.1  2.8  37560 14692 ?        S    22:42   0:01 /usr/sbin/httpd

# cat /proc/10175/environ
TERM=dumbPATH=/sbin:/usr/sbin:/bin:/usr/binPWD=/LANG=CSHLVL=2_=/usr/sbin/httpd

Because Apache has been started by init, it has a nice clean environment. But after a restart by the root user, it’s clear that some cruft from the root user account has been pulled into the application environment variables:

# cat /proc/10175/environ

HOSTNAME=localhostSHELL=/bin/bashTERM=xtermHISTSIZE=1000USER=root:
MAIL=/var/spool/mail/rootPATH=/sbin:/usr/sbin:/bin:/usr/bin
INPUTRC=/etc/inputrcPWD=/rootLANG=CSHLVL=3HOME=/rootLOGNAME=root
LESSOPEN=|/usr/bin/lesspipe.sh %sG_BROKEN_FILENAMES=1_=/usr/sbin/httpd

Because of these settings, external programs relying on the value of HOME will try to read/write to a directory that they aren’t permitted to use.

Debian-based systems fix this issue by unsetting certain environmentals (including HOME) in the bootscript for Apache, based on the rules in /etc/apache2/envvars.

To fix the issue on a RHEL/CentOS host, you can instead just append a replacement HOME setting into /etc/sysconfig/httpd. This particular configuration file is read at server startup and isn’t overwritten when Apache gets upgraded.

cat >> /etc/sysconfig/httpd << "EOF"
# Correct Apache's home directory
HOME=/var/www
EOF

Following a restart, Apache should now show the correct HOME environmental variable and your application should function as expected.

Awstats 7.2 + extras RPMs

I’ve been a long term user of Awstats for reporting on visitor traffic to my websites. Whilst it’s a little dated, it’s simplicity and reliance only on the web server logs makes it ideal for any application, including general websites such as blog, but also more specialised sites such as my package repositories which can’t make use of more sophisticated client-side Javascript tracking methods as files are being downloaded by non-browser clients.

Simple web 1.0 goodness. No fancy AJAX graphs here son!

Simple web 1.0 goodness. No fancy AJAX graphs here son!

That repository server in particular (repos.jethrocarr.com) is now pushing 20-40GB of traffic per month to around 2500-3000 servers. Unfortunately Awstats doesn’t differentiate between general purpose file grabbers and the Yum downloader for RPM-based distributions, and it makes it difficult to see if downloads are from machines vs mirror scripts scanning and re-downloading files.

I also run dual-stack IPv4 and IPv6 – Awstats includes some useful GeoIP modules to lookup where user traffic comes from, but it doesn’t support mixed IPv4 and IPv6 by default and as my IPv6 traffic usage increases, this could be a problem as the “Unknown” country counter increases.

To fix this, I’ve written a patch for adding Yum user agent support and also merged in a patch by Sven Strickroth which adds a geoip6 module that does both IPv4 and IPv6 country lookups using the popular MaxMind GeoLite databases.

I’ve built packages for CentOS/RHEL/etc 5 & 6, which are available at my repositories at repos.jethrocarr.com. The awstats package I’ve built includes these two patches and also pulls in a current copy of MaxMind’s GeoIP database and required dependencies, so you’re all good to go immediately.

If you’re after the patches themselves, you can download them directly:

ELBs & Corporate Proxies

Following on from yesterday’s ELB post, it’s worth noting that there’s another common scenario where you can trigger issues when accessing ELBs – many corporates enforce the use of an HTTP proxy for all outgoing traffic, sometimes transparently, other times less so.

Having a multi-AZ ELB and accessing from your own data center isn’t too much of an issue, if each host does it’s own DNS lookup, your hosts should roughly end up with a 50/50 split across AZs, as each one resolves it’s own DNS record.

But when a proxy is added to the mix, it breaks this, since proxies tend to do their own DNS lookups and cache the results for use by other clients. Testing with Squid showed that the DNS caching for Squid would favour a particular AZ and send all traffic to that one AZ, before then flipping to the other AZ when the DNS cache expired and was refreshed – in my case, every 5mins when the TTL expired.

I'm so sick of these motherfucking ELBs in this motherfucking cloud!

Go home Amazon, you’re drunk

If you’re using Squid, there’s little you can do to work around this – whilst you can adjust the Squid DNS caching times and approaches, short of disabling DNS caching and taking the performance hit of a DNS lookup for every new outbound request, you will always end up with load jumping between both AZs and causing havoc..

There’s a few workaround options:

  • Have multiple Squid proxies for your outbound traffic and load balance between then on your network, if you load balance outgoing traffic across 4+ different outbound Squid servers, your load should end up going to different AZs a *bit* more evenly – but still not guaranteed.
  • Create an internal ELB and access via your VPC link, allowing you to bypass your company’s outbound network proxies (as traffic routes via VPN or Direct Connect) – but then you’re paying for 2x ELBs – one external for end users and one internal for your own systems.
  • Replace the ELB with something actually useful (eg a Varnish or HA-Proxy instance in Amazon).
  • Get rid of the outbound proxies please! I could write a business case for it based on the amount of money I’ve seen proxies waste at so many different companies (hint: engineers time debugging issues is much more expensive than a couple GB extra data usage).
  • Gin.

Russian roulette with ELBs and CDNs

In my day job, I look after a number of websites, all of which generally make heavy use of CDNs (Content Distribution Networks) to offload traffic to edge nodes near to an end user’s device. In our case we use Akamai, one of the largest and experienced providers in the world.

A large number of our clusters and applications now run on Amazon’s public cloud service here in Sydney, making use of EC2 instances and ELBs. Due to the important nature of our systems, we have almost all applications in active-active multi-AZ (Availability Zone) configurations. The intention of this design is that the ELB (Elastic Load Balancer) serves all incoming traffic by dividing it across each availability zone in equal proportions. If either Amazon AZ fails, the other will continue to serve requests like nothing is wrong.

It’s a nicer solution than the traditional data center approach of having an active-passive multi-site design, as with both AZs being constantly active serving requests, we know that production and “DR” are always in a functional working state, ready to handle traffic; plus your investment into DR isn’t going to waste like traditional servers sitting idle.

Unfortunately Amazon ELBs offer only the barest of no-frills features which makes them a bit stupid at times. In particular, Amazon’s multi-AZ ELBs actually consist two separate ELBs, once in each AZ. Incoming traffic selects an ELB by means of a DNS round robin and then is directed to a server in that particular AZ .

Thus, each availability zone has it’s own ELB, which adds it’s own IP address to the DNS round robin, and looks something like this:

www.example.com is an alias for www-example-com-elb.jws.elb.amazonaws.com.
www-example-com-elb.jws.elb.amazonaws.com. has address 172.16.32.1
www-example-com-elb.jws.elb.amazonaws.com. has address 192.168.0.1

The problem is that DNS round robin has no guarantee of balancing the load evenly across the two data centers. If a particular company’s proxy server caches one address, it may direct traffic for the whole company to AZ-A and deliver no traffic to AZ-B.

In reality, due to the large number of users getting assigned different IP addresses with round robin, users tend to be spread somewhat evenly across the different AZs, making the problem a somewhat moot point when you have sizeable visitor numbers.

But if you add Akamai to the mix, you can end up with interesting results – it turns out that Akamai Edge nodes in AU use a central source of DNS information, which can lead to them favouring a particular ELB IP address. And since *all* your traffic goes via the CDN, this in turn results in all your traffic going directly to a single AZ and ignoring the other one entirely.

In a real-world scenario of a 4 webserver cluster, we saw traffic jump between each AZ whenever Akamai’s edge servers updated DNS to a different IP address, as per the below graph:

Time to really test that your application is active-active!

Akamai decides to switch which ELB it’s using from A to B :-/

This swapping brings around some really nasty issues. In theory your active-active setup should be large enough to handle all your usual traffic load on just one AZ, but if that’s not the case, bad things will happen to your site performance and/or reliability.

The other nasty issue is when doing auto-scaling with Amazon, this swapping messes with your Cloud Watch metrics for autoscale policies/triggers – one AZ is complete idle, one AZ is maxed out, average stats show a half busy cluster, no need to autoscale upwards to handle the load.

And even if you’re clever and set your autoscaling to also trigger based on ELB latency/errors/throughput, you may still end up with issues, since the new host created during the autoscale may end up in the idle AZ, instead of the active AZ where you need it.

Using a smarter system for load balancing can negate the issue – for example using a pair of Varnish servers or HA-Proxy servers configured to do cross-AZ load balancing would workaround the issue, by spreading all the traffic coming into one AZ across all the servers in both AZs, but this does have increased costs (running EC2 instances, inter-AZ traffic). It also may have performance issues depending on the amount of traffic pouring into your instance.

Additionally, if you have a global audience, rather than a mostly single-country audience like us, you may not see the issue, since the different Akamai regions around the world will balance load somewhat equally across the two AZs.

To properly fix this behaviour with Akamai, you need to open a professional services request and have the SureRoute configuration adjusted so that Akamai forces the edge notes to lookup the origin IPs at the edge:

<!-- SR fix to handle multiple origin IP's -->
<forward:cache-parent.sureroute2.force-origin-ip-from-edge>on
</forward:cache-parent.sureroute2.force-origin-ip-from-edge>
<forward:cache-parent.sureroute2.round-robin.status>on
</forward:cache-parent.sureroute2.round-robin.status>

<!-- no host in sureroute stat-key -->
<forward:cache-parent.sureroute2.stat-key.host>off
</forward:cache-parent.sureroute2.stat-key.host>

With this fixed configuration, Akamai will correctly spread load evenly across our two AZs and our load graphs settled comfortably back into normality. I’m not entirely sure why this configuration isn’t default SureRoute behaviour, but like many things with Akamai, there are often mysterious adjustments that only professional services know about or can make.

Finally it’s worth noting that this issue isn’t unique to Amazon – you could get the same issue if you run active-active conventional data centers and use Akamai for offload. It may also be an issue with other CDNs by default, so double-check the behaviour of your particular vendor – it would be interesting to see if CloudFront (Amazon’s CDN) exhibits similar issues or not.

Credit to my colleague Andrew. for spotting this issue originally and having to deal with two different vendors support cases at once to get to the bottom of the root cause.

SSL Intermediate CA Bundles with Amazon

When configuring SSL services, generally you need to set a certificate, a private key and the CA bundle containing the intermediate certificate(s), which is often a bundle of several different certificates.

For example, https://www.jethrocarr.com‘s configuration looks like:

SSLEngine on
SSLCertificateFile jethrocarr.com.crt
SSLCertificateKeyFile jethrocarr.com.key
SSLCertificateChainFile startssl.intermediate.ca.crt

When your browser connects, it doesn’t trust jethrocarr.com.crt, but it checks it against the certificates that have signed it in startssl.intermediate.ca.crt – and those certificates are signed by the CA that your browser trusts.

This means that the CAs can protect their root certificates which are trusted by the browser much more securely and sign their certificates with intermediates than can be revoked and easily(ish) replaced should the need arise.

Generally this works fine from the end user perspective, although there are sometimes issues when an sysadmin forgets to add the intermediate CA bundle and doesn’t immediately notice as sometimes some browsers work fine whilst others fail depending whether or not they already trust the intermediates.

 

Today I ran into a different new issue where Amazon Web Services is fussy about the order of the certificates in that bundle when adding a certificate to an Elastic Load Balancer for SSL termination.

Any attempt to upload my certificate was met with “Invalid Public Key Certificate”, which didn’t make a lot of sense as I was certain that my certificates were OK. It was easy to verify and prove this, using OpenSSL:

$ openssl rsa -noout -modulus -in example.com.key | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl x509 -noout -modulus -in example.com.crt | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl verify -verbose -CAfile cabundle.crt example.com.crt 
example.com.crt: OK

This proved that my certificates were all correct so the fault was Amazon-side. A post on their forums helped me “fix” the issue, by adjusting the order of my CA bundle, which subsequently fixed the error.

So is this a bug with Amazon? It’s tricky to say – there are several posts online which state that the order is important for some systems, but not for all. Clearly anything based around OpenSSL doesn’t care, as it was able to verify my out-of-order CA bundle happily enough.

As one does with issues like this, I dug into RFC 3280 which details how the certificate path validation should occur. Section 6.1 (Basic Path Validation) details that the path validation process is actually outside the specification, but then goes on and defines how the validation could occur, with the order of the certificates being implied, but not stated outright.

The primary goal of path validation is to verify the binding between
a subject distinguished name or a subject alternative name and
subject public key, as represented in the end entity certificate,
based on the public key of the trust anchor.  This requires obtaining
a sequence of certificates that support that binding.  The procedure
performed to obtain this sequence of certificates is outside the
scope of this specification.

To meet this goal, the path validation process verifies, among other
things, that a prospective certification path (a sequence of n
certificates) satisfies the following conditions:

   (a)  for all x in {1, ..., n-1}, the subject of certificate x is
   the issuer of certificate x+1;

   (b)  certificate 1 is issued by the trust anchor;

   (c)  certificate n is the certificate to be validated; and

   (d)  for all x in {1, ..., n}, the certificate was valid at the
   time in question.

Following the above, the specification goes on into detail different ways the path can be validated, which also imply that the certificates should be read in and then sorted by software, but it doesn’t actually state exactly.

Sadly with the way that this specification is written it’s not clear, which means the only 100% certain way to ensure nothing is unhappy is to order the CA bundles file in the correct order, which is something I would expect the SSL provider to do when they provide you with the files.