Tag Archives: cloud

Automatically restarting GNU/Linux hosts upon hung storage

We recently experienced a weird and frustrating problem with storage reliability on our RabbitMQ cluster running on AWS c5-series EC2 instances with EBS storage and Ubuntu LTS 16.04.

  1. One of our three-node RabbitMQ cluster instances will experience an issue that results in it being unable to persist anything to disk, on any mounted volume on the instance.
  2. When this happens, the instance is *supposed* to remove itself from the cluster as an unhealthy member and have the remaining two instances take over all responsibilities with zero downtime to operational systems.
  3. Sadly for some unknown reason, the way this issue impacts RabbitMQ does not result in the instance being evicted from the cluster. Instead, it remains in the cluster exchanging healthy status messages with other members, but (and this is the critical bit) it manages to then jam up queues across the entire cluster, bringing down the two healthy instances along with the one unhealthy.
  4. Operations (me) gets paged to solve a critical outage on the platform that’s going to impact customers.

The problem is super weird in that it occurs somewhat randomly – no obvious correlation to load, time of day – but it does tend to happen after the instance has been running for at least a few weeks. It also occurs on any of the three RabbitMQ instances, so it’s not something specifically weird about any one instance in the fleet.

The one thing we do know, is that the issue is storage related. Firstly nothing is persisted in the logs (RabbitMQ or system/kernel) from the time the issue occurs and secondly we can see a large spike in disk I/O wait time in our Datadog monitoring for the instance, showing that the instance is stuck with processes waiting for the disk to respond.

Why RabbitMQ is impacted in this manner is unclear. It makes sense that the cluster quorum and status negotiation wouldn’t require working disk to keep running, but in every test we made where we deliberately broke the storage, the RabbitMQ process would correctly detect something was wrong on the host and go into an unhealthy state, removing it from the cluster. Tested ripping out EBS storage whilst still mounted, corrupting with dd, force unmounting, etc… nothing could trigger the exact same behaviour.

Reviewing what differs about production was difficult since it didn’t persist any of the kernel or RabbitMQ logs, however we did manage to extract some information from the AWS instance console for one of the impacted systems before we restarted it:

Ubuntu 16.04.4 LTS localhost ttyS0
[349442.682614] Not tainted 4.4.0-1062-aws #71-Ubuntu
[349442.684363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[349442.686890] INFO: myprocess:1956 blocked for more than 120 seconds.

Essentially the Linux kernel is proceeding to log a number of different processes (basically everything on the box that does anything) as being blocked for over 120 seconds, thanks to the storage failing and being unable to do anything about it to unblock the processes

Given we have been unable to identify the exact fault or reproduce the behaviour (could be something in the Linux Kernel, could be something in AWS c5 or EBS…), we needed a solution that would at least help us by terminating any instance that experienced this storage issue.

The solution is helpfully identified by the kernel log lines above. We can use the hung task panic feature in the Linux kernel to force a host to immediately reboot itself if processes are hung for too long.

We do this using two different sysctl configuration changes (note – you need to set these up in /etc/sysctl.conf to survive reboots):

# Panic if a hung task was found
sysctl kernel.hung_task_panic=1

# Reboot 5 seconds after panic
sysctl kernel.panic=5

The first instructs the kernel to panic if a hung task (any task blocked for more than 120 seconds) occurs. The second, instructs it to reboot shortly after this occurs. We set it to 5 seconds, to give time for any logging to persist or be delivered about the hung task before it’s reboot, although in this particular situation, with all storage being busted its of very limited benefit.

This has been in place for several months now and is working beautifully. Every so often an instance experiences this fault and instead of causing any disruption, it is quickly self-terminated and replaced. Because it terminates completely, the RabbitMQ cluster negotiation is successfully able to re-assign responsibilities to the other instances in the cluster.

In theory there is a 2-minute period where the unhealthy instance is still running, however reviewing the production metrics, it appears that when the fault occurs, RabbitMQ doesn’t immediately break – sometimes it continues to run for 15mins or more before jamming up the cluster. So having it run for 2 minutes has turned out to be just fine.

Ideally going forwards we need to setup a network logging endpoint for these hosts to see if we can capture anything like a stack trace from a kernel driver. It seems likely that it’s a Linux kernel bug rather than an AWS EBS bug, since the issue is resolved with a soft reboot, rather than a force stop-start of the instance via the AWS API, meaning it’s still running on the same hypervisor host, etc. But until then, this kernel configuration parameter means we are not going to disrupt operational services when the fault does occur.

Scaling backend infrastructure to handle millions of phones (Mobile Refresh 2018)

I recently had the pleasure to speak at the 2018 Mobile Refresh conference held here in Wellington and did a talk introducing how we run some parts of the Sailthru Mobile platform, along with recommendations and advice to anyone else also building backends to support their mobile applications.

It’s more entry-level than some of my other infrastructure talks as it’s focused on people that are primarily mobile developers with maybe a limited set of infrastructure awareness.

Deep Dive into ECS

I spent a fair bit of time in 2017 re-architecting the carnival.io platform onto Amazon ECS, including working to handle some tricky autoscaling challenges brought on by the nature of the sudden high-load spikes experienced when we deliver push messages to customers.

I’ve now summed up these learnings into a deep dive talk on the Amazon ECS architecture that I presented at the Wellington AWS Users Group on February 12th 2018.

This talk explains what container orchestration is, some key fundamentals about ECS, how we’ve tackled CI/CD with ECS and going into details around some of the unique autoscaling challenges caused by millions of cellphones sending home telemetry all at once.

This talk is technical, but includes content appropriate for both beginners wanting to know how ECS functions and experts wanting to see just what can be accomplished with the platform.

 

Puppet Autosigning & Cloud Recommendations

I was over in Sydney this week attending linux.conf.au 2018 and made a short presentation at the Sysadmin miniconf regarding deploying Puppet in cloud environments.

The majority of this talk covers the Puppet autosigning process which is a big potential security headache if misconfigured. If you’re deploying Puppet (or even some other config management system) into the cloud, I recommend checking this one out (~15mins) and making sure your own setup doesn’t have any issues.

 

Detectatron

I recently installed security cameras around my house which are doing an awesome job of recording all the events that take place around my house and grounds (generally of the feline variety).

Unfortunately the motion capture tends to be overly trigger happy and I end up with heaps of recordings of trees waving, clouds moving or insects flying past. It’s not a problem from a security perspective as I’m not missing any events, but it makes it harder to check the feed for noteworthy events during the day.

I decided I’d like to write some logic for processing the videos being generated and decided to write a proof of concept that sucks video out of the Ubiquiti Unifi Video server and then analyses it with Amazon Web Services new AI product “Rekognition” to identify interesting videos worthy of note.

What this means, is that I can now filter out all the noise from my motion recordings by doing image recognition and flagging the specific videos that feature events I consider interesting, such as footage featuring people or cats doing crazy things.

I’ve got a 20 minute talk about this system which you can watch below, introducing it’s capabilities and how I’m using the AWS Rekognition service to solve this problem. The talk was for the Wellington AWS Users Group, so it focuses a bit more on the AWS aspects of Rekognition and AWS architecture rather than the Unifi video integration side of things.

The software I wrote has two parts – “Detectatron” which is the backend Java service for processing each video and storing it in S3 after processing and the connector I wrote for integration with the Unifi Video service. These can be found at:

https://github.com/jethrocarr/detectatron
https://github.com/jethrocarr/detectatron-connector-unifi

The code quality is rather poor right now – insufficient unit tests, bad structure and in need of a good refactor, but I wanted to get it up sooner rather than later… since perfection is always the enemy of just shipping something.

Note that whilst I’ve only added support for the product I use (Ubiquiti’s Unifi Video), I’ve designed it so that it’s pretty trivial to build other connectors for other platforms. I’d love to see contributions like connectors for Zone Minder and other popular open source or commercial platforms.

If you’re using Unifi Video, my connector will automatically mark any videos it deems as interesting as locked videos, for easy filtering using the native Unifi Video apps and web interface.

It also includes an S3 upload feature – given that I integrated with the Unifi Video software, it was a trivial step to extend it to also upload every video the system records into S3 within a few seconds for off-site retention. This performs really well, my on-prem NVR really struggled to keep up with uploads when using inotify + awscli to upload footage, but using my connector and Detectatron it has no issues keeping up with even high video rates.

Fairfax’s Cloud Journey at Auckland AWS Summit 2016

I recently presented at the 2016 AWS Summit Auckland about Fairfax’s cloud journey as part of the business stream “Key Steps for Setting up your AWS Journey for Success” alongside two excellent Amazon engineers. It’s a bit different from my usual talks, in that this one was specifically focused on a business audience, rather than a technical one.

My segment was just part of a talk full of excellent content from Amazon themselves, so you can checkout the full presentation here and all the other recorded presentations at the AWS Summit Auckland on-demand site.

AWS Cost Control at Fairfax

Earlier this month I was invited to speak at the AWS Wellington User Group around how we’ve been handling cost control at Fairfax including our use of spot pricing. I’ve now processed the video and got a recording up online for anyone interested in watching.

The video isn’t great since we took it in dim light using a cellphone and a webcam in a red lit bar, but the audio came through pretty good.

 

How much swap should I use on my VM?

Lately a couple people have asked me about how much swap space is “right” for their servers – especially in the context of running low spec machines like AWS t2.nano/t2.micro or Digital Ocean boxes with low allocations like 1GB or 512MB RAM.

The old fashioned advice was always “your swap space should be double your RAM” but this doesn’t actually make a lot of sense any more. Really swap should be considered a tool of last resort – a hack even – to squeeze a bit more performance out of systems and should be used sparingly where it makes sense.

I tend to look after two different types of systems:

  1. Small systems running a specific dedicated service (eg microservices). These systems might do nothing more than run Nginx/Apache or something like PHP-FPM or Unicorn with a few workers. They typically have 512MB-1GB of RAM.
  2. Big heavy servers running heavy weight applications, typically Java. These systems will be configured with large memory allocations (eg 16GB) and be configured to allocate a specific amount of memory to the application (eg 10GB Java Heap) and to keep the rest free for disk cache and background apps.

The latter doesn’t need swap. There’s no time I would ever want my massive apps getting pushed into swap for a couple reasons:

  • Performance of these systems is critical. We’ve paid good money to allocate them specific amounts of memory which is essentially guaranteed – we know how much the heap needs, how much disk cache we need and how much to allocate to the background apps.
  • If something does go wrong and starts consuming too much RAM, rather than having performance degrade as the server tries to swap, I want it to die – and die fast. If Puppet has decided it wants 7 GB of RAM, I want the OOM to step in and slaughter it. If I have swap, I risk everything on the server being slowed down as it moves tasks into the horribly slow (even on SSD) swap space.
  • If you’re paying for 16GB of RAM, why do you want to try and get an extra 512MB out of some swap space? It’s false economy.

For this reason, our big boxes are all swapless. But what about the former example, the small microservice type boxes, or your small personal VPS type systems?

Like many things in IT, “it depends”.

If you’re running stateless clusters, provided that the peak usage fits within the memory allocation, you don’t need swap. In this scenario, your workload is sized appropriately and if anything goes wrong due to an unexpected issue, the machine will either kill the errant process or die and get removed from the pool entirely.

I run a lot of web app workers this way – for example a 1GB t2.micro can happily run 4 Ruby Unicorn workers averaging around 128MB each, plus have space for Puppet, monitoring and delayed jobs. If something goes astray, the process gets killed and the usual automated recovery processes handle things.

However you may need some swap if you’re running stateful systems (pets) where it’s better for them to go slow than to die entirely, or if you’re running a system where the peak usage won’t fit within the memory allocation due to tight budget constraints.

For an example of tight budget constraints – I run this blog on a small machine with only 512MB RAM. With an allocation this small, there’s just not enough memory to run applications like Apache and also be able to handle the needs of background daemons and Puppet runs which can use several hundred MB just by themselves.

The approach I took was to create a small swap volume and size the worker counts in Apache so that the max workers at average size would just fit within the real memory allocation. However any background or system tasks, would have to fight over the swap space.

Screen Shot 2016-03-03 at 23.52.58

What you can see from the above is that I’m consuming quite a bit of swap – but my disk I/O is basically nothing. That’s because most of what’s in swap on this machine isn’t needed regularly and the active workload, i.e. the apps actually using/freeing RAM constantly, fit within the available amount of real memory.

In this case, using swap allows me to get better value for money, than using the next size up machine – I’m paying just enough to run Apache and squeezing in the management tools and background jobs onto the otherwise underutilized SSD storage. This means I can spend $5 to run this blog, vs $10. Excellent!

In respects to sizing, I’m running with a 1GB swap on a 512MB RAM server which is compliant with the traditional “twice your RAM” approach to sizing. That being said, I wouldn’t extend past this, even if the system had more RAM (eg 2GB) you should only ever use swap as a hack to squeeze a bit more out of a system. Basically don’t assume swap will scale linearly as memory scales.

Given I’m running on various cloud/VPS environments, I don’t have a traditional swap partition – instead I create an image file on the root filesystem and format it as swap space – I use a third party Puppet module (https://forge.puppetlabs.com/petems/swap_file) to do this:

swap_file::files { 'default':
  ensure            => present,
  swapfile         => '/tmp/swapfile',
  swapfilesize  => '1000 MB’
}

The performance impact of using a swap file ontop of a filesystem is almost nothing and this dramatically simplifies management and allocation of swap space. Just make sure you’re not using tmpfs for that /tmp path or you’ll find that memory benefit isn’t as good as it seems.

Your cloud pricing isn’t webscale

Thankfully in 2015 most (but not all) proprietary software providers have moved away from the archaic ideology of software being licensed by the CPU core – a concept that reflected the value and importance of systems back when you were buying physical hardware, but rendered completely meaningless by cloud and virtualisation providers.

Taking it’s place came the subscription model, popularised by Software-as-a-Service (or “cloud”) products. The benefits are attractive – regular income via customer renewal payments, flexibility for customers wanting to change the level of product or number of systems covered and no CAPEX headaches in acquiring new products to use.

Clients win, vendors win, everyone is happy!

Or maybe not.

 

Whilst the horrible price-by-CPU model has died off, a new model has emerged – price by server. This model assumes that the more servers a customer has, the bigger they are and the more we should charge them.

The model makes some sense in a traditional virtualised environment (think VMWare) where boxes are sliced up and a client runs only as many as they need. You might only have a total of two servers for your enterprise application – primary and DR – each spec’ed appropriately to handle the max volume of user requests.

But the model fails horribly when clients start proper cloud adoption. Suddenly that one big server gets sliced up into 10 small servers which come and go by the hour as they’re needed to supply demand.

DevOps techniques such as configuration management suddenly turns the effort of running dozens of servers into the same as running a single machine, there’s no longer any reason to want to constrain yourself to a single machine.

It gets worse if the client decides to adopt microservices, where each application gets split off into it’s own server (or container aka Docker/CoreOS). And it’s going to get very weird when we start using compute-less computing more with services like Lambda and Hoist because who knows how many server licenses you need to run an application that doesn’t even run on a server that you control?

 

Really the per-server model for pricing is as bad as the per-core model, because it no longer has any reflection on the size of an organisation, the amount they’re using a product and most important, the value they’ve obtaining from the product.

So what’s the alternative? SaaS products tend to charge per-user, but the model doesn’t always work well for infrastructure tools. You could be running monitoring for a large company with 1,000 servers but only have 3 user accounts for a small sysadmin team, which doesn’t really work for the vendor.

Some products can charge based on volume or API calls, but even this is risky. A heavy micro-service architecture would result in large number of HTTP calls between applications, so you can hardly say an app with 10,000 req/min is getting 4x the value compared to a client with a 2,500 req/min application – it could be all internal API calls.

 

To give an example of how painful the current world of subscription licensing is with modern computing, let’s conduct a thought exercise and have a look at the current pricing model of some popular platforms.

Let’s go with creating a startup. I’m going to run a small SaaS app in my spare time, so I need a bit of compute, but also need business-level tools for monitoring and debugging so I can ensure quality as my startup grows and get notified if something breaks.

First up I need compute. Modern cloud compute providers *understand* subscription pricing. Their models are brilliantly engineered to offer a price point for everyone. Whether you want a random jump box for $2/month or a $2000/month massive high compute monster to crunch your big-data-peak-hipster-NoSQL dataset, they can deliver the product at the price point you want.

Let’s grab a basic Digital Ocean box. Well actually let’s grab 2, since we’re trying to make a redundant SaaS product. But we’re a cheap-as-chips startup, so let’s grab 2x $5/mo box.

Screen Shot 2015-11-03 at 21.46.40

Ok, so far we’ve spent $10/month for our two servers. And whilst Digital Ocean is pretty awesome our code is going to be pretty crap since we used a bunch of high/drunk (maybe both?) interns to write our PHP code. So we should get a real time application monitoring product, like Newrelic APM.

Screen Shot 2015-11-03 at 21.37.46

Woot! Newrelic have a free tier, that’s great news for our SaaS application – but actually it’s not really that useful, it can’t do much tracing and only keeps 24 hours history. Certainly not enough to debug anything more serious than my WordPress blog.

I’ll need the pro account to get anything useful, so let’s add a whopping $149/mo – but actually make that $298/mo since we have two servers. Great value really. :-/

 

Next we probably need some kind of paging for oncall when our app blows up horribly at 4am like it will undoubtably do. PagerDuty is one of the popular market leaders currently with a good reputation, let’s roll with them.

Screen Shot 2015-11-03 at 21.52.57

Hmm I guess that $9/mo isn’t too bad, although it’s essentially what I’m paying ($10/mo) for the compute itself. Except that it’s kinda useless since it’s USA and their friendly neighbour only and excludes us down under. So let’s go with the $29/mo plan to get something that actually works. $29/mo is a bit much for a $10/mo compute box really, but hey, it looks great next to NewRelic’s pricing…

 

Remembering that my SaaS app is going to be buggier than Windows Vista, I should probably get some error handling setup. That $298/mo Newrelic APM doesn’t include any kind of good error handler, so we should also go get another market leader, Raygun, for our error reporting and tracking.

Screen Shot 2015-11-03 at 22.00.54

For a small company this isn’t bad value really given you get 5 different apps and any number of muppets working with you can get onboard. But it’s still looking ridiculous compared to my $10/mo compute cost.

So what’s the total damage:

Compute: $10/month
Monitoring: $371/month

Ouch! Now maybe as a startup, I’ll churn up that extra money as an investment into getting a good quality product, but it’s a far cry from the day when someone could launch a new product on a shoestring budget in their spare time from their uni laptop.

 

Let’s look at the same thing from the perspective of a large enterprise. I’ve got a mission critical business application and it requires a 20 core machine with 64GB of RAM. And of course I need two of them for when Java inevitably runs out of heap because the business let muppets feed garbage from their IDE directly into the JVM and expected some kind of software to actually appear as a result.

That compute is going to cost me $640/mo per machine – so $1280/mo total. And all the other bits, Newrelic, Raygun, PagerDuty? Still that same $371/mo!

Compute: $1280/month
Monitoring: $371/month

It’s not hard to imagine that the large enterprise is getting much more value out of those services than the small startup and can clearly afford to pay for that in relation to the compute they’re consuming. But the pricing model doesn’t make that distinction.

 

So given that we know know that per-core pricing is terrible and per-server pricing is terrible and (at least for infrastructure tools) per-user pricing is terrible what’s the solution?

“Cloud Spend Licensing” [1]

[1] A term I’ve just made up, but sounds like something Gartner spits out.

With Cloud Spend Licensing, the amount charged reflects the amount you spend on compute – this is a much more accurate indicator of the size of an organisation and value being derived from a product than cores or servers or users.

But how does a vendor know what this spend is? This problem will be solving itself thanks to compute consumers starting to cluster around a few major public cloud players, the top three being Amazon (AWS), Microsoft (Azure) and Google (Compute Engine).

It would not be technically complicated to implement support for these major providers (and maybe a smattering of smaller ones like Heroku, Digital Ocean and Linode) to use their APIs to suck down service consumption/usage data and figure out a client’s compute spend in the past month.

For customers whom can’t (still on VMWare?) or don’t want to provide this data, there can always be the fallback to a more traditional pricing model, whether it be cores, servers or some other negotiation (“enterprise deal”).

 

 

How would this look?

In our above example, for our enterprise compute bill ($1280/mo) the equivalent amount spent on the monitoring products was 23% for Newrelic, 3% for Raygun and 2.2% for PagerDuty (total of 28.2%). Let’s make the assumption this pricing is reasonable for the value of the products gained for the sake of demonstration (glares at Newrelic).

When applied to our $10/month SaaS startup, the bill for this products would be an additional $2.82/month. This may seem so cheap there will be incentive to set a minimum price, but it’s vital to avoid doing so:

  • $2.82/mo means anyone starting up a new service uses your product. Because why not, it’s pocket change. That uni student working on the next big thing will use you. The receptionist writing her next mobile app success in her spare time will use you. An engineer in a massive enterprise will use you to quickly POC a future product on their personal credit card.
  • $2.82/mo might only just cover the cost of the service, but you’re not making any profit if they couldn’t afford to use it in the first place. The next best thing to profit is market share – provided that market share has a conversion path to profit in future (something some startups seem to forget, eh Twitter?).
  • $2.82/mo means IT pros use your product on their home servers for fun and then take their learning back to the enterprise. Every one of the providers above should have a ~ $10/year offering for IT pros to use and get hooked on their product, but they don’t. Newrelic is the closest with their free tier. No prizes if you guess which product I use on my personal servers. Definitely no prizes if you guess which product I can sell the benefits of the most to management at work.

 

But what about real earnings?

As our startup grows and gets bigger, it doesn’t matter if we add more servers, or upsize the existing ones to add bigger servers – the amount we pay for the related support applications is always proportionate.

It also caters for the emerging trend of running systems for limited hours or using spot prices – clients and vendor don’t have to worry about figuring out how it fits into the pricing model, instead the scale of your compute consumption sets the price of the servers.

Suddenly that $2.82/mo becomes $56.40/mo when the startup starts getting successful and starts running a few computers with actual specs. One day it becomes $371 when they’re running $1280/mo of compute tier like the big enterprise. And it goes up from there.

 

I’m not a business analyst and “Cloud Spend Licensing” may not be the best solution, but goddamn there has to be a more sensible approach than believing someone will spend $371/mo for their $10/mo compute setup. And I’d like to get to that future sooner rather than later please, because there’s a lot of cool stuff out there that I’d like to experiment with more in my own time – and that’s good for both myself and vendors.

 

Other thoughts:

  • I don’t want vendors to see all my compute spend details” – This would be easily solved by cloud provider exposing the right kind of APIs for this purpose eg, “grant vendor XYZ ability to see sum compute cost per month, but no details on what it is“.
  • I’ll split my compute into lots of accounts and only pay for services where I need it to keep my costs low” – Nothing different to the current situation where users selectively install agents on specific systems.
  • This one client with an ultra efficient, high profit, low compute app will take advantage of us.” – Nothing different to the per-server/per-core model then other than the min spend. Your client probably deserves the cheaper price as a reward for not writing the usual terrible inefficient code people churn out.
  • “This doesn’t work for my app” – This model is very specific to applications that support infrastructure, I don’t expect to see it suddenly being used for end user products/services.

Puppet modules

I’m in the middle of doing a migration of my personal server infrastructure from a 2006-era colocation server onto modern cloud hosting providers.

As part of this migration, I’m rebuilding everything properly using Puppet (use it heavily at work so it’s a good fit here) with the intention of being able to complete server builds without requiring any manual effort.

Along the way I’m finding gaps where the available modules don’t quite cut it or nobody seems to have done it before, so I’ve been writing a few modules and putting them up on GitHub for others to benefit/suffer from.

 

puppet-hostname

https://github.com/jethrocarr/puppet-hostname

Trying to do anything consistently with host naming is always fun, since every organisation or individual has their own special naming scheme and approach to dealing with the issue of naming things.

I decided to take a different approach. Essentially every cloud provider will give you a source of information that could be used to name your instance whether it’s the AWS Instance ID, or a VPS provider passing through the name you gave the machine at creation. Given I want to treat my instances like cattle, an automatic soulless generated name is perfect!

Where they fall down, is that they don’t tend to setup the FQDN properly. I’ve seen a number of solution to this including user data setup scripts, but I’m trying to avoid putting anything in user data that isn’t 100% critical and sticking to my Pupistry bootstrap so I wanted to set my FQDN via Puppet itself.

(It’s even possible to set the hostname itself if desired, you can use logic such as tags or other values passed in as facts to define what role a machine has and then generate/set a hostname entirely within Puppet).

Hence puppet-hostname provides a handy way to easily set FQDN (optionally including the hostname itself) and then trigger reloads on name-dependent services such as syslog.

None of this is revolutionary, but it’s nice getting it into a proper structure instead of relying on yet-another-bunch-of-userdata that’s specific to my systems. The next step is to look into having it execute functions to do DNS changes on providers like Route53 so there’s no longer any need for user data scripts being run to set DNS records at startup.

 

puppet-rirs

https://github.com/jethrocarr/puppet-rirs

There are various parts of my website that I want to be publicly reachable, such as the WordPress login/admin sections, but at the same time I also don’t want them accessible by any muppet with a bot to try and break their way in.

I could put up a portal of some kind, but this then breaks stuff like apps that want to talk with those endpoints since they can’t handle the authentication steps. What I can do, is setup a GeoIP rule that restricts access to the sections to the countries I’m actually in, which is generally just NZ or AU, to dramatically reduce the amount of noise and attempts people send my way, especially given most of the attacks come from more questionable countries or service providers.

I started doing this with mod_geoip2, but it’s honestly a buggy POS and it really doesn’t work properly if you have both IPv4 and IPv6 connections (one or another is OK). Plus it doesn’t help me for applications that support IP ACLs, but don’t offer a specific GeoIP plugin.

So instead of using GeoIP, I’ve written a custom Puppet function that pulls down the IP assignment lists from the various Regional Internet Registries and generate IP/CIDR lists for both IPv4 and IPv6 on a per-country basis.

I then use those lists to populate configurations like Apache, but it’s also quite possible to use it for other purposes such as iptables firewalling since the generated lists can be turned into Puppet resources. To keep performance sane, I cache the processed output for 24 hours and merge any continuous assignment blocks.

Basically, it’s GeoIP for Puppet with support for anything Puppet can configure. :-)

 

puppet-digitalocean

https://github.com/jethrocarr/puppet-digitalocean

Provides a fact which exposes details from the Digital Ocean instance API about the instance – similar to how you get values automatically about Amazon EC2 systems.

 

puppet-initfact

https://github.com/jethrocarr/puppet-initfact

The great thing about the open source world is how we can never agree so we end up with a proliferation of tools doing the same job. Even init systems are not immune, with anything tha intends to run on the major Linux distributions needing to support systemd, Upstart and SysVinit at least for the next few years.

Unfortunately the way that I see most Puppet module authors “deal” with this is that they simply write an init config/file that suits their distribution of choice and conveniently forget the other distributions. The number of times I’ve come across Puppet modules that claim support for Red Hat and Amazon Linux but only ship an Upstart file…. >:-(

Part of the issue is that it’s a pain to even figure out what distribution should be using what type of init configuration. So to solve this, I’ve written a custom Fact called “initsystem” which exposes the primary/best init system on the specific system it’s running on.

It operates in two modes – there is a curated list for specific known systems and then fallback to automatic detection where we don’t have a specific curated result handy.

It supports (or should) all major Linux distributions & derivatives plus FreeBSD and MacOS. Pull requests for others welcome, could do with more BSD support plus maybe even support for Windows if you’re feeling brave.

 

puppet-yas3fs

https://github.com/pcfens/puppet-yas3fs/commit/27af462f1ce2fe0610012a508236062e65017b5f

Not my module, but I recently submitted a PR to it (subsequently merged) which introduces support for a number of different distributions via use of my initfact module so it should now run on most distributions rather than just Ubuntu.

If you’re not familiar with yas3fs, it’s a FUSE driver that turns S3+SNS+SQS into a shared filesystem between multiple servers. Ideal for dealing with legacy applications that demand state on disk, but don’t require high I/O performance, I’m in the process of doing a proof-of-concept with it and it looks like it should work OK for low activity sites such as WordPress, although with no locking I’d advise against putting MySQL on it anytime soon :-)

 

These modules can all be found on GitHub, as well as the Puppet Forge. Hopefully someone other than myself finds them useful. :-)