Tag Archives: devops

Puppet Autosigning & Cloud Recommendations

I was over in Sydney this week attending linux.conf.au 2018 and made a short presentation at the Sysadmin miniconf regarding deploying Puppet in cloud environments.

The majority of this talk covers the Puppet autosigning process which is a big potential security headache if misconfigured. If you’re deploying Puppet (or even some other config management system) into the cloud, I recommend checking this one out (~15mins) and making sure your own setup doesn’t have any issues.

 

DevOpsDays NZ 2016

I recently spoke at the inaugural DevOpsDays NZ in Wellington. The team whom put together the conference did an amazing job and it’s one of the few conferences that I’ve really enjoyed recently. If they put together a subsequent conference next year, I recommend attending if possible.

I presented about our DevOps practises and tooling at Fairfax Media / stuff.co.nz which you can find at the recording below:

 

Whilst the vast majority of the content of the conference was really good, the following were clear standouts to me that I recommend watching:

You can find these (and other) presentations from the conference on this Youtube page.

Fairfax’s Cloud Journey at Auckland AWS Summit 2016

I recently presented at the 2016 AWS Summit Auckland about Fairfax’s cloud journey as part of the business stream “Key Steps for Setting up your AWS Journey for Success” alongside two excellent Amazon engineers. It’s a bit different from my usual talks, in that this one was specifically focused on a business audience, rather than a technical one.

My segment was just part of a talk full of excellent content from Amazon themselves, so you can checkout the full presentation here and all the other recorded presentations at the AWS Summit Auckland on-demand site.

Puppet modules

I’m in the middle of doing a migration of my personal server infrastructure from a 2006-era colocation server onto modern cloud hosting providers.

As part of this migration, I’m rebuilding everything properly using Puppet (use it heavily at work so it’s a good fit here) with the intention of being able to complete server builds without requiring any manual effort.

Along the way I’m finding gaps where the available modules don’t quite cut it or nobody seems to have done it before, so I’ve been writing a few modules and putting them up on GitHub for others to benefit/suffer from.

 

puppet-hostname

https://github.com/jethrocarr/puppet-hostname

Trying to do anything consistently with host naming is always fun, since every organisation or individual has their own special naming scheme and approach to dealing with the issue of naming things.

I decided to take a different approach. Essentially every cloud provider will give you a source of information that could be used to name your instance whether it’s the AWS Instance ID, or a VPS provider passing through the name you gave the machine at creation. Given I want to treat my instances like cattle, an automatic soulless generated name is perfect!

Where they fall down, is that they don’t tend to setup the FQDN properly. I’ve seen a number of solution to this including user data setup scripts, but I’m trying to avoid putting anything in user data that isn’t 100% critical and sticking to my Pupistry bootstrap so I wanted to set my FQDN via Puppet itself.

(It’s even possible to set the hostname itself if desired, you can use logic such as tags or other values passed in as facts to define what role a machine has and then generate/set a hostname entirely within Puppet).

Hence puppet-hostname provides a handy way to easily set FQDN (optionally including the hostname itself) and then trigger reloads on name-dependent services such as syslog.

None of this is revolutionary, but it’s nice getting it into a proper structure instead of relying on yet-another-bunch-of-userdata that’s specific to my systems. The next step is to look into having it execute functions to do DNS changes on providers like Route53 so there’s no longer any need for user data scripts being run to set DNS records at startup.

 

puppet-rirs

https://github.com/jethrocarr/puppet-rirs

There are various parts of my website that I want to be publicly reachable, such as the WordPress login/admin sections, but at the same time I also don’t want them accessible by any muppet with a bot to try and break their way in.

I could put up a portal of some kind, but this then breaks stuff like apps that want to talk with those endpoints since they can’t handle the authentication steps. What I can do, is setup a GeoIP rule that restricts access to the sections to the countries I’m actually in, which is generally just NZ or AU, to dramatically reduce the amount of noise and attempts people send my way, especially given most of the attacks come from more questionable countries or service providers.

I started doing this with mod_geoip2, but it’s honestly a buggy POS and it really doesn’t work properly if you have both IPv4 and IPv6 connections (one or another is OK). Plus it doesn’t help me for applications that support IP ACLs, but don’t offer a specific GeoIP plugin.

So instead of using GeoIP, I’ve written a custom Puppet function that pulls down the IP assignment lists from the various Regional Internet Registries and generate IP/CIDR lists for both IPv4 and IPv6 on a per-country basis.

I then use those lists to populate configurations like Apache, but it’s also quite possible to use it for other purposes such as iptables firewalling since the generated lists can be turned into Puppet resources. To keep performance sane, I cache the processed output for 24 hours and merge any continuous assignment blocks.

Basically, it’s GeoIP for Puppet with support for anything Puppet can configure. :-)

 

puppet-digitalocean

https://github.com/jethrocarr/puppet-digitalocean

Provides a fact which exposes details from the Digital Ocean instance API about the instance – similar to how you get values automatically about Amazon EC2 systems.

 

puppet-initfact

https://github.com/jethrocarr/puppet-initfact

The great thing about the open source world is how we can never agree so we end up with a proliferation of tools doing the same job. Even init systems are not immune, with anything tha intends to run on the major Linux distributions needing to support systemd, Upstart and SysVinit at least for the next few years.

Unfortunately the way that I see most Puppet module authors “deal” with this is that they simply write an init config/file that suits their distribution of choice and conveniently forget the other distributions. The number of times I’ve come across Puppet modules that claim support for Red Hat and Amazon Linux but only ship an Upstart file…. >:-(

Part of the issue is that it’s a pain to even figure out what distribution should be using what type of init configuration. So to solve this, I’ve written a custom Fact called “initsystem” which exposes the primary/best init system on the specific system it’s running on.

It operates in two modes – there is a curated list for specific known systems and then fallback to automatic detection where we don’t have a specific curated result handy.

It supports (or should) all major Linux distributions & derivatives plus FreeBSD and MacOS. Pull requests for others welcome, could do with more BSD support plus maybe even support for Windows if you’re feeling brave.

 

puppet-yas3fs

https://github.com/pcfens/puppet-yas3fs/commit/27af462f1ce2fe0610012a508236062e65017b5f

Not my module, but I recently submitted a PR to it (subsequently merged) which introduces support for a number of different distributions via use of my initfact module so it should now run on most distributions rather than just Ubuntu.

If you’re not familiar with yas3fs, it’s a FUSE driver that turns S3+SNS+SQS into a shared filesystem between multiple servers. Ideal for dealing with legacy applications that demand state on disk, but don’t require high I/O performance, I’m in the process of doing a proof-of-concept with it and it looks like it should work OK for low activity sites such as WordPress, although with no locking I’d advise against putting MySQL on it anytime soon :-)

 

These modules can all be found on GitHub, as well as the Puppet Forge. Hopefully someone other than myself finds them useful. :-)

Russian roulette with ELBs and CDNs

In my day job, I look after a number of websites, all of which generally make heavy use of CDNs (Content Distribution Networks) to offload traffic to edge nodes near to an end user’s device. In our case we use Akamai, one of the largest and experienced providers in the world.

A large number of our clusters and applications now run on Amazon’s public cloud service here in Sydney, making use of EC2 instances and ELBs. Due to the important nature of our systems, we have almost all applications in active-active multi-AZ (Availability Zone) configurations. The intention of this design is that the ELB (Elastic Load Balancer) serves all incoming traffic by dividing it across each availability zone in equal proportions. If either Amazon AZ fails, the other will continue to serve requests like nothing is wrong.

It’s a nicer solution than the traditional data center approach of having an active-passive multi-site design, as with both AZs being constantly active serving requests, we know that production and “DR” are always in a functional working state, ready to handle traffic; plus your investment into DR isn’t going to waste like traditional servers sitting idle.

Unfortunately Amazon ELBs offer only the barest of no-frills features which makes them a bit stupid at times. In particular, Amazon’s multi-AZ ELBs actually consist two separate ELBs, once in each AZ. Incoming traffic selects an ELB by means of a DNS round robin and then is directed to a server in that particular AZ .

Thus, each availability zone has it’s own ELB, which adds it’s own IP address to the DNS round robin, and looks something like this:

www.example.com is an alias for www-example-com-elb.jws.elb.amazonaws.com.
www-example-com-elb.jws.elb.amazonaws.com. has address 172.16.32.1
www-example-com-elb.jws.elb.amazonaws.com. has address 192.168.0.1

The problem is that DNS round robin has no guarantee of balancing the load evenly across the two data centers. If a particular company’s proxy server caches one address, it may direct traffic for the whole company to AZ-A and deliver no traffic to AZ-B.

In reality, due to the large number of users getting assigned different IP addresses with round robin, users tend to be spread somewhat evenly across the different AZs, making the problem a somewhat moot point when you have sizeable visitor numbers.

But if you add Akamai to the mix, you can end up with interesting results – it turns out that Akamai Edge nodes in AU use a central source of DNS information, which can lead to them favouring a particular ELB IP address. And since *all* your traffic goes via the CDN, this in turn results in all your traffic going directly to a single AZ and ignoring the other one entirely.

In a real-world scenario of a 4 webserver cluster, we saw traffic jump between each AZ whenever Akamai’s edge servers updated DNS to a different IP address, as per the below graph:

Time to really test that your application is active-active!

Akamai decides to switch which ELB it’s using from A to B :-/

This swapping brings around some really nasty issues. In theory your active-active setup should be large enough to handle all your usual traffic load on just one AZ, but if that’s not the case, bad things will happen to your site performance and/or reliability.

The other nasty issue is when doing auto-scaling with Amazon, this swapping messes with your Cloud Watch metrics for autoscale policies/triggers – one AZ is complete idle, one AZ is maxed out, average stats show a half busy cluster, no need to autoscale upwards to handle the load.

And even if you’re clever and set your autoscaling to also trigger based on ELB latency/errors/throughput, you may still end up with issues, since the new host created during the autoscale may end up in the idle AZ, instead of the active AZ where you need it.

Using a smarter system for load balancing can negate the issue – for example using a pair of Varnish servers or HA-Proxy servers configured to do cross-AZ load balancing would workaround the issue, by spreading all the traffic coming into one AZ across all the servers in both AZs, but this does have increased costs (running EC2 instances, inter-AZ traffic). It also may have performance issues depending on the amount of traffic pouring into your instance.

Additionally, if you have a global audience, rather than a mostly single-country audience like us, you may not see the issue, since the different Akamai regions around the world will balance load somewhat equally across the two AZs.

To properly fix this behaviour with Akamai, you need to open a professional services request and have the SureRoute configuration adjusted so that Akamai forces the edge notes to lookup the origin IPs at the edge:

<!-- SR fix to handle multiple origin IP's -->
<forward:cache-parent.sureroute2.force-origin-ip-from-edge>on
</forward:cache-parent.sureroute2.force-origin-ip-from-edge>
<forward:cache-parent.sureroute2.round-robin.status>on
</forward:cache-parent.sureroute2.round-robin.status>

<!-- no host in sureroute stat-key -->
<forward:cache-parent.sureroute2.stat-key.host>off
</forward:cache-parent.sureroute2.stat-key.host>

With this fixed configuration, Akamai will correctly spread load evenly across our two AZs and our load graphs settled comfortably back into normality. I’m not entirely sure why this configuration isn’t default SureRoute behaviour, but like many things with Akamai, there are often mysterious adjustments that only professional services know about or can make.

Finally it’s worth noting that this issue isn’t unique to Amazon – you could get the same issue if you run active-active conventional data centers and use Akamai for offload. It may also be an issue with other CDNs by default, so double-check the behaviour of your particular vendor – it would be interesting to see if CloudFront (Amazon’s CDN) exhibits similar issues or not.

Credit to my colleague Andrew. for spotting this issue originally and having to deal with two different vendors support cases at once to get to the bottom of the root cause.