Monthly Archives: August 2013

WordPress & SSL Fixes

I’ve been using WordPress for this blog for a number of years now – at some point I realised that whilst writing my own code is fun, there’s no need to reinvent yet-another-fucking-blog-platform and ended up selecting WordPress to use for my content, on the basis of it’s strong and active development and community.

Generally it’s pretty good, but there are times it disappoints, such as WordPress expecting servers to have FTP for unpacking updates and plugins (it’s 2013 guys, SFTP at least!), excessively setting cookies which makes caching layers more complex and doing stupid stuff with storing full URLs inside the database for page links and image resources.

The latter has been impacting me in particular. Visitors to my site have had the option of using HTTP or HTTPS (SSL secured) access methods for some time, but annoyingly whenever I posted an article with images, WordPress includes all the images using http://. This mixed content type prevents browsers from showing the lock icon (best case) or throws up a nasty error (worst case) depending on the browser and it’s level of concern for user safety for mismatched content.

Dubious Firefox is dubious about this site, no lock icon of security here!

Despite having accessed the site on https://, WordPress still uses http:// for my images.

I could work around this by setting the WordPress base URL for my site to be https://www.jethrocarr.com, but then images served at the unsecured http:// site would also be served via SSL, which is just adding pointless load to the server (not that SSL termination really adds much load these days, but damnit, I’m being a purist here!).

I was hoping that it was a misconfiguration of my WordPress setup, but reading online it seems that this is a known issue with WordPress and a whole bunch of modules, hacks and themes have sprung up to fix/workaround the issue…

Of course there’s an easier way – fix it at the webserver layer! Both Nginx and Apache have modules to do substitutions in page content on load, for Nginx there’s HttpSubModule and for Apache there is mod_substitute. In my case with stock Apache 2.2 on CentOS 5, I was able to fix the whole issue by adding the following to my SSL vhost configuration:

# Fix SSL URLs thanks to WordPress hardcoding http:// links to images :'(
<Location />
    AddOutputFilterByType SUBSTITUTE text/html
    Substitute "s|http://www.example.com|https://www.example.com|"
</Location>

Following this, things look much better:

The lock icon of browser approval!

All media files are now https://, not http://

Technically this substitution will have some level of performance impact, as it has to process the generated HTML content and check for strings to replace, but the impact is so low that I wasn’t able to measure it amongst the usual variation of page response times – and it’s not going to be anywhere as slow as mod_php and WordPress itself anyway. ;-)

Finally, if you haven’t already, you probably want to change the following in wp-config.php:

define('FORCE_SSL_ADMIN', true);

This forces all WordPress logins and wp-admin activities to take place under HTTPS which is a pretty good idea if you ever post to your blog from an unsecured network.

The Apache that wanted to be root

Awstats 7.2 + extras RPMs

ELBs & Corporate Proxies

2 Replies

Following on from yesterday’s ELB post, it’s worth noting that there’s another common scenario where you can trigger issues when accessing ELBs – many corporates enforce the use of an HTTP proxy for all outgoing traffic, sometimes transparently, other times less so.

Having a multi-AZ ELB and accessing from your own data center isn’t too much of an issue, if each host does it’s own DNS lookup, your hosts should roughly end up with a 50/50 split across AZs, as each one resolves it’s own DNS record.

But when a proxy is added to the mix, it breaks this, since proxies tend to do their own DNS lookups and cache the results for use by other clients. Testing with Squid showed that the DNS caching for Squid would favour a particular AZ and send all traffic to that one AZ, before then flipping to the other AZ when the DNS cache expired and was refreshed – in my case, every 5mins when the TTL expired.

I'm so sick of these motherfucking ELBs in this motherfucking cloud!

Go home Amazon, you’re drunk

If you’re using Squid, there’s little you can do to work around this – whilst you can adjust the Squid DNS caching times and approaches, short of disabling DNS caching and taking the performance hit of a DNS lookup for every new outbound request, you will always end up with load jumping between both AZs and causing havoc..

There’s a few workaround options:

Have multiple Squid proxies for your outbound traffic and load balance between then on your network, if you load balance outgoing traffic across 4+ different outbound Squid servers, your load should end up going to different AZs a *bit* more evenly – but still not guaranteed.
Create an internal ELB and access via your VPC link, allowing you to bypass your company’s outbound network proxies (as traffic routes via VPN or Direct Connect) – but then you’re paying for 2x ELBs – one external for end users and one internal for your own systems.
Replace the ELB with something actually useful (eg a Varnish or HA-Proxy instance in Amazon).
Get rid of the outbound proxies please! I could write a business case for it based on the amount of money I’ve seen proxies waste at so many different companies (hint: engineers time debugging issues is much more expensive than a couple GB extra data usage).
Gin.

Russian roulette with ELBs and CDNs

2 Replies

In my day job, I look after a number of websites, all of which generally make heavy use of CDNs (Content Distribution Networks) to offload traffic to edge nodes near to an end user’s device. In our case we use Akamai, one of the largest and experienced providers in the world.

A large number of our clusters and applications now run on Amazon’s public cloud service here in Sydney, making use of EC2 instances and ELBs. Due to the important nature of our systems, we have almost all applications in active-active multi-AZ (Availability Zone) configurations. The intention of this design is that the ELB (Elastic Load Balancer) serves all incoming traffic by dividing it across each availability zone in equal proportions. If either Amazon AZ fails, the other will continue to serve requests like nothing is wrong.

It’s a nicer solution than the traditional data center approach of having an active-passive multi-site design, as with both AZs being constantly active serving requests, we know that production and “DR” are always in a functional working state, ready to handle traffic; plus your investment into DR isn’t going to waste like traditional servers sitting idle.

Unfortunately Amazon ELBs offer only the barest of no-frills features which makes them a bit stupid at times. In particular, Amazon’s multi-AZ ELBs actually consist two separate ELBs, once in each AZ. Incoming traffic selects an ELB by means of a DNS round robin and then is directed to a server in that particular AZ .

Thus, each availability zone has it’s own ELB, which adds it’s own IP address to the DNS round robin, and looks something like this:

www.example.com is an alias for www-example-com-elb.jws.elb.amazonaws.com.
www-example-com-elb.jws.elb.amazonaws.com. has address 172.16.32.1
www-example-com-elb.jws.elb.amazonaws.com. has address 192.168.0.1

The problem is that DNS round robin has no guarantee of balancing the load evenly across the two data centers. If a particular company’s proxy server caches one address, it may direct traffic for the whole company to AZ-A and deliver no traffic to AZ-B.

In reality, due to the large number of users getting assigned different IP addresses with round robin, users tend to be spread somewhat evenly across the different AZs, making the problem a somewhat moot point when you have sizeable visitor numbers.

But if you add Akamai to the mix, you can end up with interesting results – it turns out that Akamai Edge nodes in AU use a central source of DNS information, which can lead to them favouring a particular ELB IP address. And since *all* your traffic goes via the CDN, this in turn results in all your traffic going directly to a single AZ and ignoring the other one entirely.

In a real-world scenario of a 4 webserver cluster, we saw traffic jump between each AZ whenever Akamai’s edge servers updated DNS to a different IP address, as per the below graph:

Time to really test that your application is active-active!

Akamai decides to switch which ELB it’s using from A to B :-/

This swapping brings around some really nasty issues. In theory your active-active setup should be large enough to handle all your usual traffic load on just one AZ, but if that’s not the case, bad things will happen to your site performance and/or reliability.

The other nasty issue is when doing auto-scaling with Amazon, this swapping messes with your Cloud Watch metrics for autoscale policies/triggers – one AZ is complete idle, one AZ is maxed out, average stats show a half busy cluster, no need to autoscale upwards to handle the load.

And even if you’re clever and set your autoscaling to also trigger based on ELB latency/errors/throughput, you may still end up with issues, since the new host created during the autoscale may end up in the idle AZ, instead of the active AZ where you need it.

Using a smarter system for load balancing can negate the issue – for example using a pair of Varnish servers or HA-Proxy servers configured to do cross-AZ load balancing would workaround the issue, by spreading all the traffic coming into one AZ across all the servers in both AZs, but this does have increased costs (running EC2 instances, inter-AZ traffic). It also may have performance issues depending on the amount of traffic pouring into your instance.

Additionally, if you have a global audience, rather than a mostly single-country audience like us, you may not see the issue, since the different Akamai regions around the world will balance load somewhat equally across the two AZs.

To properly fix this behaviour with Akamai, you need to open a professional services request and have the SureRoute configuration adjusted so that Akamai forces the edge notes to lookup the origin IPs at the edge:

<!-- SR fix to handle multiple origin IP's -->
<forward:cache-parent.sureroute2.force-origin-ip-from-edge>on
</forward:cache-parent.sureroute2.force-origin-ip-from-edge>
<forward:cache-parent.sureroute2.round-robin.status>on
</forward:cache-parent.sureroute2.round-robin.status>

<!-- no host in sureroute stat-key -->
<forward:cache-parent.sureroute2.stat-key.host>off
</forward:cache-parent.sureroute2.stat-key.host>

With this fixed configuration, Akamai will correctly spread load evenly across our two AZs and our load graphs settled comfortably back into normality. I’m not entirely sure why this configuration isn’t default SureRoute behaviour, but like many things with Akamai, there are often mysterious adjustments that only professional services know about or can make.

Finally it’s worth noting that this issue isn’t unique to Amazon – you could get the same issue if you run active-active conventional data centers and use Akamai for offload. It may also be an issue with other CDNs by default, so double-check the behaviour of your particular vendor – it would be interesting to see if CloudFront (Amazon’s CDN) exhibits similar issues or not.

Credit to my colleague Andrew. for spotting this issue originally and having to deal with two different vendors support cases at once to get to the bottom of the root cause.

SSL Intermediate CA Bundles with Amazon

Leave a reply

When configuring SSL services, generally you need to set a certificate, a private key and the CA bundle containing the intermediate certificate(s), which is often a bundle of several different certificates.

For example, https://www.jethrocarr.com‘s configuration looks like:

SSLEngine on
SSLCertificateFile jethrocarr.com.crt
SSLCertificateKeyFile jethrocarr.com.key
SSLCertificateChainFile startssl.intermediate.ca.crt

When your browser connects, it doesn’t trust jethrocarr.com.crt, but it checks it against the certificates that have signed it in startssl.intermediate.ca.crt – and those certificates are signed by the CA that your browser trusts.

This means that the CAs can protect their root certificates which are trusted by the browser much more securely and sign their certificates with intermediates than can be revoked and easily(ish) replaced should the need arise.

Generally this works fine from the end user perspective, although there are sometimes issues when an sysadmin forgets to add the intermediate CA bundle and doesn’t immediately notice as sometimes some browsers work fine whilst others fail depending whether or not they already trust the intermediates.

Today I ran into a different new issue where Amazon Web Services is fussy about the order of the certificates in that bundle when adding a certificate to an Elastic Load Balancer for SSL termination.

Any attempt to upload my certificate was met with “Invalid Public Key Certificate”, which didn’t make a lot of sense as I was certain that my certificates were OK. It was easy to verify and prove this, using OpenSSL:

$ openssl rsa -noout -modulus -in example.com.key | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl x509 -noout -modulus -in example.com.crt | openssl md5
(stdin)= 30e1b6cb4168117b7923392ca536c701

$ openssl verify -verbose -CAfile cabundle.crt example.com.crt 
example.com.crt: OK

This proved that my certificates were all correct so the fault was Amazon-side. A post on their forums helped me “fix” the issue, by adjusting the order of my CA bundle, which subsequently fixed the error.

So is this a bug with Amazon? It’s tricky to say – there are several posts online which state that the order is important for some systems, but not for all. Clearly anything based around OpenSSL doesn’t care, as it was able to verify my out-of-order CA bundle happily enough.

As one does with issues like this, I dug into RFC 3280 which details how the certificate path validation should occur. Section 6.1 (Basic Path Validation) details that the path validation process is actually outside the specification, but then goes on and defines how the validation could occur, with the order of the certificates being implied, but not stated outright.

The primary goal of path validation is to verify the binding between
a subject distinguished name or a subject alternative name and
subject public key, as represented in the end entity certificate,
based on the public key of the trust anchor.  This requires obtaining
a sequence of certificates that support that binding.  The procedure
performed to obtain this sequence of certificates is outside the
scope of this specification.

To meet this goal, the path validation process verifies, among other
things, that a prospective certification path (a sequence of n
certificates) satisfies the following conditions:

   (a)  for all x in {1, ..., n-1}, the subject of certificate x is
   the issuer of certificate x+1;

   (b)  certificate 1 is issued by the trust anchor;

   (c)  certificate n is the certificate to be validated; and

   (d)  for all x in {1, ..., n}, the certificate was valid at the
   time in question.

Following the above, the specification goes on into detail different ways the path can be validated, which also imply that the certificates should be read in and then sorted by software, but it doesn’t actually state exactly.

Sadly with the way that this specification is written it’s not clear, which means the only 100% certain way to ensure nothing is unhappy is to order the CA bundles file in the correct order, which is something I would expect the SSL provider to do when they provide you with the files.

Jethro Carr

Personal blog of geekiness

Monthly Archives: August 2013

WordPress & SSL Fixes

The Apache that wanted to be root

Awstats 7.2 + extras RPMs

SSL Intermediate CA Bundles with Amazon