Deep Dive into ECS

I spent a fair bit of time in 2017 re-architecting the carnival.io platform onto Amazon ECS, including working to handle some tricky autoscaling challenges brought on by the nature of the sudden high-load spikes experienced when we deliver push messages to customers.

I’ve now summed up these learnings into a deep dive talk on the Amazon ECS architecture that I presented at the Wellington AWS Users Group on February 12th 2018.

This talk explains what container orchestration is, some key fundamentals about ECS, how we’ve tackled CI/CD with ECS and going into details around some of the unique autoscaling challenges caused by millions of cellphones sending home telemetry all at once.

This talk is technical, but includes content appropriate for both beginners wanting to know how ECS functions and experts wanting to see just what can be accomplished with the platform.

 

Puppet Autosigning & Cloud Recommendations

I was over in Sydney this week attending linux.conf.au 2018 and made a short presentation at the Sysadmin miniconf regarding deploying Puppet in cloud environments.

The majority of this talk covers the Puppet autosigning process which is a big potential security headache if misconfigured. If you’re deploying Puppet (or even some other config management system) into the cloud, I recommend checking this one out (~15mins) and making sure your own setup doesn’t have any issues.

 

Firebase FCM upstream with Swift on iOS

I’ve been learning a bit of Swift lately in order to write an iOS app for my alarm system. I’m not very good at it yet, but figured I’d write some notes to help anyone else playing with the murky world of Firebase Cloud Messaging/FCM and iOS.

One of the key parts of the design is that I wanted the alarm app and the alarm server to communicate directly with each other without needing public facing endpoints, rather than the conventional design when the app interacts via an HTTP API.

The intention of this design is that it means I can dump all the alarm software onto a small embedded computer and as long as that computer has outbound internet access, it just works™️. No headaches about discovering the endpoint of the service and much more simplified security as there’s no public-facing web server.

Given I need to deliver push notifications to the app, I implemented Google Firebase Cloud Messaging (FCM) – formerly GCM – for push delivery to both iOS and Android apps.

Whilst FCM is commonly used for pushing to devices, it also supports pushing messages back upstream to the server from the device. In order to do this, the server must be implemented as an XMPP server and the FCM SDK be embedded into the app.

The server was reasonably straight forwards, I’ve written a small Java daemon that uses a reference XMPP client implementation and wraps some additional logic to work with HowAlarming.

The client side was a bit more tricky. Google has some docs covering how to implement upstream messaging in the iOS app, but I had a few issues to solve that weren’t clearly detailed there.

 

Handling failure of FCM upstream message delivery

Firstly, it’s important to have some logic in place to handle/report back if a message can not be sent upstream – otherwise you have no way to tell if it’s worked. To do this in swift, I added a notification observer for .MessagingSendError which is thrown by the FCM SDK if it’s unable to send upstream.

class AppDelegate: UIResponder, UIApplicationDelegate, MessagingDelegate {

 func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
   ...
   // Trigger if we fail to send a message upstream for any reason.
   NotificationCenter.default.addObserver(self, selector: #selector(onMessagingUpstreamFailure(_:)), name: .MessagingSendError, object: nil)
   ...
 }

 @objc
 func onMessagingUpstreamFailure(_ notification: Notification) {
   // FCM tends not to give us any kind of useful message here, but
   // at least we now know it failed for when we start debugging it.
   print("A failure occurred when attempting to send a message upstream via FCM")
 }
}

Unfortunately I’m yet to see a useful error code back from FCM in response to any failures to send message upstream – seem to just get back a 501 error to anything that has gone wrong which isn’t overly helpful… especially since in web programming land, any 5xx series error implies it’s the remote server’s fault rather than the client’s.

 

Getting the GCM Sender ID

In order to send messages upstream, you need the GCM Sender ID. This is available in the GoogleService-Info.plist file that is included in the app build, but I couldn’t figure out a way to extract this easily from the FCM SDK. There probably is a better/nice way of doing this, but the following hack works:

// Here we are extracting out the GCM SENDER ID from the Google
// plist file. There used to be an easy way to get this with GCM, but
// it's non-obvious with FCM so here's a hacky approach instead.
if let path = Bundle.main.path(forResource: "GoogleService-Info", ofType: "plist") {
  let dictRoot = NSDictionary(contentsOfFile: path)
  if let dict = dictRoot {
    if let gcmSenderId = dict["GCM_SENDER_ID"] as? String {
       self.gcmSenderId = gcmSenderId // make available on AppDelegate to whole app
    }
  }
}

And yes, although we’re all about FCM now, this part hasn’t been rebranded from the old GCM product, so enjoy having yet another acronym in your app.

 

Ensuring the FCM direct channel is established

Finally the biggest cause I had for upstream message delivery failing, is that I was often trying to send an upstream message before FCM had finished establishing the direct channel.

This happens for you automatically by the SDK whenever the app is loaded into foreground, provided that you have shouldEstablishDirectChannel set to true. This can take up to several seconds after application launch for it to actually complete – which means if you try to send upstream too early, the connection isn’t ready, and your send fails with an obscure 501 error.

The best solution I found was to use an observer to listen to .MessagingConnectionStateChanged which is triggered whenever the FCM direct channel connects or disconnects. By listening to this notification, you know when FCM is ready and capable of delivering upstream messages.

An additional bonus of this observer, is that when it indicates the FCM direct channel is established, by that time the FCM token for the device is available to your app to use if needed.

So my approach is to:

  1. Setup FCM with shouldEstablishDirectChannel set to true (otherwise you won’t be going upstream at all!).
  2. Setup an observer on .MessagingConnectionStateChanged
  3. When triggered, use Messaging.messaging().isDirectChannelEstablished to see if we have a connection ready for us to use.
  4. If so, pull the FCM token (device token) and the GCM Sender ID and retain in AppDelegate for other parts of the app to use at any point.
  5. Dispatch the message to upstream with whatever you want in messageData.

My implementation looks a bit like this:

class AppDelegate: UIResponder, UIApplicationDelegate, MessagingDelegate {

 func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
  ...
  // Configure FCM and other Firebase APIs with a single call.
  FirebaseApp.configure()

  // Setup FCM messaging
  Messaging.messaging().delegate = self
  Messaging.messaging().shouldEstablishDirectChannel = true

  // Trigger when FCM establishes it's direct connection. We want to know this to avoid race conditions where we
  // try to post upstream messages before the direct connection is ready... which kind of sucks.
  NotificationCenter.default.addObserver(self, selector: #selector(onMessagingDirectChannelStateChanged(_:)), name: .MessagingConnectionStateChanged, object: nil)
  ...
 }

 @objc
 func onMessagingDirectChannelStateChanged(_ notification: Notification) {
  // This is our own function listen for the direct connection to be established.
  print("Is FCM Direct Channel Established: \(Messaging.messaging().isDirectChannelEstablished)")

  if (Messaging.messaging().isDirectChannelEstablished) {
   // Set the FCM token. Given that a direct channel has been established, it kind of implies that this
   // must be available to us..
   if self.registrationToken == nil {
    if let fcmToken = Messaging.messaging().fcmToken {
     self.registrationToken = fcmToken
     print("Firebase registration token: \(fcmToken)")
    }
   }

   // Here we are extracting out the GCM SENDER ID from the Google PList file. There used to be an easy way
   // to get this with GCM, but it's non-obvious with FCM so we're just going to read the plist file.
   if let path = Bundle.main.path(forResource: "GoogleService-Info", ofType: "plist") {
    let dictRoot = NSDictionary(contentsOfFile: path)
     if let dict = dictRoot {
      if let gcmSenderId = dict["GCM_SENDER_ID"] as? String {
       self.gcmSenderID = gcmSenderId
     }
    }
   }

  // Send an upstream message
  let messageId = ProcessInfo().globallyUniqueString
  let messageData: [String: String] = [
   "registration_token": self.registrationToken!, // In my use case, I want to know which device sent us the message
   "marco": "polo"
  ]
  let messageTo: String = self.gcmSenderID! + "@gcm.googleapis.com"
  let ttl: Int64 = 0 // Seconds. 0 means "do immediately or throw away"

  print("Sending message to FCM server: \(messageTo)")

  Messaging.messaging().sendMessage(messageData, to: messageTo, withMessageID: messageId, timeToLive: ttl)
  }
 }

 ...
}

For a full FCM downstream and upstream implementation example, you can take a look at the HowAlarming iOS app source code on Github and if you need a server reference, take a look at the HowAlarming GCM server in Java.

 

Learnings

It’s been an interesting exercise – I wouldn’t particularly recommend this architecture for anyone building real world apps, the main headaches I ran into were:

  1. FCM SDK just seems a bit buggy. I had a lot of trouble with the GCM SDK and the move to FCM did improve stuff a bit, but there’s still a number of issues that occur from time to time. For example: occasionally a FCM Direct Channel isn’t established for no clear reason until the app is terminated and restarted.
  2. Needing to do things like making sure FCM Direct Channel is ready before sending upstream messages should probably be handled transparently by the SDK rather than by the app developer.
  3. I have still yet to get background code execution on notifications working properly. I get the push notification without a problem, but seem to be unable to trigger my app to execute code even with content-available == 1 . Maybe a bug in my code, or FCM might be complicating the mix in some way, vs using pure APNS. Probably my code.
  4. It’s tricky using FCM messages alone to populate the app data, occasionally have issues such as messages arriving out of order, not arriving at all, or occasionally ending up with duplicates. This requires the app code to process, sort and re-populate the table view controller which isn’t a lot of fun. I suspect it would be a lot easier to simply re-populate the view controller on load from an HTTP endpoint and simply use FCM messages to trigger refreshes of the data if the user taps on a notification.

So my view for other projects in future would be to use FCM purely for server->app message delivery (ie: “tell the user there’s a reason to open the app”) and then rely entirely on a classic app client and HTTP API model for all further interactions back to the server.

MongoDB document depth headache

We ran into a weird problem recently where we were unable to sync a replica set running MongoDB 3.4 when adding new members to the replica set.

The sync would begin, but at some point during the sync it would always fail with:

[replication-0] collection clone for 'database.collection' failed due to Overflow:
While cloning collection 'database.collection' there was an error
'While querying collection 'database.collection' there was an error 
'BSONObj exceeded maximum nested object depth: 200''

(For extra annoyance the sync would continue with syncing all the other databases and collections on the replica set, before then only realising it had actually failed earlier at the very end of the sync and then restarting the sync from the beginning again).

 

The error means that one or more documents has a max depth over 200. This could be a chain of objects, or a chain of arrays in a document – a mistake that isn’t too tricky to cause with a buggy loop or ORM.

But how is it possible that this document could be in the database in the first case? Surely it should have been refused at time of insert? Well the nested document limit size and enforcement has changed at various times in past versions and a long-lived database such as ours from early MongoDB 2.x days may have had these bad documents inserted before the max depth limit was enforced and only now when we try to use the document do the limits become a problem.

In our case the document was old, but didn’t have any issues syncing back on Mongo 3.0 but now failed with Mongo 3.4.

Finding the document is tricky – the replication process helpfully does not log the document ID, so you can’t go and purge it from the collection to resolve the issue.

With input from my skilled colleagues with better Mongo skills than I, we figured out three queries that allowed us to identify the bad documents.

1. This query finds any documents that have a long chain of nested objects inside them.

db.collection.find({ $where: function() { return tojsononeline(this).indexOf("} } } } } } } } }") != -1 } })

2. This query finds any documents that have a long chain of nested arrays. This was the specific issue in our case and this query successfully identified all the bad documents.

db.collection.find({ $where: function() { return tojsononeline(this).indexOf("] ] ] ] ] ] ]") != -1 } })

3. And if you get really stuck, you can find any bad document (for whatever reason) by reading the document and then re-writing it back out to another collection. This ensures the document gets all the limits applied at write time and will identify their ID, regardless of the specific reason for them being refused.

db.collection.find({}).forEach(function(d) { print(d["_id"]); db.new_collection.insert(d) });

Note that all of these queries tend to be performance impacting since you’re asking your database to read every single document. And the last one, copying collections, could take considerable time and space to complete.

I recommend restoring the replica set to a test system and performing the operation there where you know it’s not going to impact production if you have any data of notable size.

Once you find your bad document, you can display it with:

db.collection.find({ _id: ObjectId("54492129902178d6f600004f") });

And delete it entirely (assuming nothing important in it!) with:

db.collection.deleteOne({ _id: ObjectId("54492129902178d6f600004f") });

MacOS High Sierra unable to free disk space

I recently ran out of disk space on my iMac. After migrating a considerable amount of undesirable data to either the file server or /dev/null, I found that despite my efforts, the amount of free disk space had not increased.

I was worried it was an issue with the new APFS file system introduced to all SSD-using Macs as of High Sierra, but in this case it turns out the issue is that Time Machine retains local snapshots on disk, in addition to the full backup history that is retained on the network time machine device.

Apple state that they automatically remove local snapshots when disk space is low, but their definition of low is apparently only 5GB of free space remaining – not really much free working space in 2017 when you might want scratch space of 22GB for 1 hour of 4k 30FPS footage.

On older MacOS releases, it was possible to disable the local snapshot feature entirely, this doesn’t seem to be the case with High Sierra – but it does appear to be possible to force an immediate purge of local snapshots with the following command:

sudo tmutil thinLocalSnapshots / 10000000000 4

For example;

Back into the time vortex with you, filthy snapshots!

Note that this snapshot usage is not visible as a distinct item in the Disk Utility or Storage Management application.

In my case, all the snapshots appeared to be within the last 24 hours, so if I hadn’t urgently needed the disk space, I suspect the local snapshots would have flushed themselves after a 24 hour period restoring considerable disk space.

The fact this isn’t an opt-in user-accessible feature is a shame. It adds convenience for a user of not having to get physical access to the backup drive or time capsule-like-thingy in order to restore data, but any users of systems with SSD-only storage are likely to be a bit precious about how every GB is used and there’s almost no transparency about how much space is being consumed. Especially annoying when you urgently need more space and are stuck wondering why nothing is freeing up room…

Access Route53 private zones cross account

Using Route53 private zones can be a great way to maintain a private internal zone for your server infrastructure. However sometimes you may need to share this zone with another VPC in the same or in another AWS account.

The first situation is easy – a Route53 zone can be associated with any number of VPCs within a single AWS account using the AWS console.

The second is more tricky but is doable by creating a VPC association authorization request in the account with the zone, then accepting it from the other account.

# Run against the account with the zone to be shared.
aws route53 \
create-vpc-association-authorization \
--hosted-zone-id abc123 \
--vpc VPCRegion=us-east-1,VPCId=vpc-xyz123 

# Run against the account that needs access to the private zone.
aws route53 \
associate-vpc-with-hosted-zone \
--hosted-zone-id abc123 \
--vpc VPCRegion=us-east-1,VPCId=vpc-xyz123 \
--comment "Example Internal DNS Zone"

# List authori(z|s)ations once done
aws route53 \
list-vpc-association-authorizations \
--hosted-zone-id abc123

This doesn’t even require VPC peering since it works behind the scenes, with the associated zone now being resolvable using the default VPC DNS server on each zone that has been associated.

Note that the one catch is that this does not help you if you’re linking to a non-AWS VPC environment, such as an on-prem data centre via IPSec VPN or Direct Connect. Even though you can route to the VPC and systems inside it, the AWS DNS resolver for the VPC will refuse requests from IP space outside of the VPC itself.

So the only option is have an EC2 instance acting as a DNS forwarder inside the VPC, which is reachable from the linked data centre and yet since it’s in the VPC, can use the resolver.

FailberryPi – Diverse carrier links for your home data center

Given the amount of internet connected things I now rely on at home, I’ve been considering redundant internet links for a while. And thanks to the affordability of 3G/4G connectivity, it’s easier than ever to have a completely diverse carrier at extremely low cost.

I’m using 2degrees which has a data SIM sharing service that allows me to have up to 5 other devices sharing the one data plan, so it literally costs me nothing to have the additional connection available 24×7.

My requirements were to:

  1. Handle the loss of the wired internet connection.
  2. Ensure that I can always VPN into the house network.
  3. Ensure that the security cameras can always upload footage to AWS S3.
  4. Ensure that the IoT house alarm can always dispatch events and alerts.

I ended up building three distinct components to build a failover solution that supports flipping between my wired (VDSL) and wireless (3G) connection:

  1. A small embedded GNU/Linux system that can bridge a USB 3G modem and an ethernet connection, with smarts to recover from various faults (like crashed 3G stick).
  2. A dynamic DNS solution, since my mobile telco certainly isn’t going to give me a static IP address, but I need inbound traffic.
  3. A DNS failover solution so I can redirect inbound requests (eg home VPN) to the currently active endpoint automatically when a failure has occurred.

 

The Hardware

I considered using a Mikrotik with USB for the 3G link – it is a supported feature, but I decided to avoid this route since I would need to replace my perfectly fine router for one with a USB port, plus I know from experience that USB 3G modems are fickle beasts that would be likely to need some scripting to workaround various issues.

For the same reason I excluded some 3G/4G router products available that take a USB modem and then provide ethernet or WiFi. I’m very dubious about how fault tolerant these products are (or how secure if consumer routers are anything to go by).

I started off the project using a very old embedded GNU/Linux board and 3G USB modem I had in the spare parts box, but unfortunately whilst I did eventually recycle this hardware into a working setup, the old embedded hardware had a very poor USB controller and was throttling my 3G connection to around 512kbps. :-(

Initial approach – Not a bomb, actually an ancient Gumstix Verdex with 3G modem.

So I started again, this time using the very popular Raspberry Pi 2B hardware as the base for my setup. This is actually the first time I’ve played with a Raspberry Pi and I actually really enjoyed the experience.

The requirements for the router are extremely low – move packets between two interfaces, dial a modem and run some scripts. It actually feels wasteful using a whole Raspberry Pi with it’s whole 1GB of RAM and Quad Core ARM CPU, but they’re so accessible and cost affordable, it’s not worth the time messing around with any more obscure embedded boards.

Pie ingredients

It took me all of 5 mins to assemble and boot an OS on this thing and have a full Debian install ready for work. For this speed and convenience I’ll happily pay a small price premium for the Raspberry Pi than some other random embedded vendors with much more painful install and upgrade processes.

Baked!

It’s important to get a good power supply – 3G/4G modems tend to consume the full 500mW available to them. I kept getting under voltage warnings (the red light on the Pi turns off) with a 2.1 Amp phone charger I was using. Ended up buying the official 2.5 Amp Raspberry Pi charger, which powers the Raspberry Pi 2 + the 3G modem perfectly.

I brought the smallest (& cheapest) class 10 Micro SDHC card possible – 16GB. Of course this is way more than you actually need for a router, 4GB would have been plenty.

The ZTE MF180 USB 3G modem I used is a tricky beast on Linux, thanks to the kernel seeing it as a SCSI CDROM drive initially which masks the USB modem features. Whilst Linux has usb_modeswitch shipping as standard these days, I decided to completely disable the SCSI CDROM feature as per this blog post to avoid the issue entirely.

 

The Software

The Raspberry Pi I was given (thanks Calcinite! ?) had a faulty GPU so the HDMI didn’t work. Fortunately Raspberry Pi doesn’t let such a small issue like no display hold it back – it’s trivial to flash an image to the SD card from another machine and boot a headless installation.

  1. Download Raspbian minimal/lite (Debian + Raspberry Pi goodness).
  2. Installed image to the SD card using the very awesome Etcher.io (think “safe dd” for noobs) as per the install instructions using my iMac.
  3. Enable SSH as per instructions: “SSH can be enabled by placing a file named ssh, without any extension, onto the boot partition of the SD card. When the Pi boots, it looks for the ssh file. If it is found, SSH is enabled, and the file is deleted. The content of the file does not matter: it could contain text, or nothing at all.”
  4. Login with username “pi” and password “raspberry”.
  5. Change the password immediately before you put it online!
  6. Upgrade the Pi and enable automated updates in future with:
    apt-get update && apt-get -y upgrade
    apt-get install -y unattended-upgrades

The rest is somewhat specific to your setup, but my process was roughly:

  1. Install apps needed – wvdial for establishing the 3G connection via AT commands + PPP, iptables-persistent for firewalling, libusb-dev for building hub-ctrl and jq for parsing JSON responses.
    apt-get install -y wvdial iptables-persistent libusb-dev jq
  2. Configure a firewall. This is very specific to your network, but you’ll want both ipv4 and ipv6 rules in /etc/iptables/rules.* Generally you’d want something like:
    1. Masquerade (NAT) traffic going out of the ppp+ and eth0 interfaces.
    2. Permit forwarding traffic between the interfaces.
    3. Permit traffic in on port 9000 for the health check server.
  3. Enable IP forwarding (net.ipv4.ip_forward=1) in /etc/sysctl.conf.
  4. Build hub-ctrl. This utility allows the power cycling of the USB controller + attached devices in the Raspberry Pi, which is extremely useful if your 3G modem has terrible firmware (like mine) and sometimes crashes hard.
    wget https://raw.githubusercontent.com/codazoda/hub-ctrl.c/master/hub-ctrl.c
    gcc -o hub-ctrl hub-ctrl.c -lusb
  5. Build pinghttpserver. This is a tiny C-based webserver which we can use to check if the Raspberry Pi is up (Can’t use ICMP as detailed further on).
    wget -O pinghttpserver.c https://gist.githubusercontent.com/jethrocarr/c56cecbf111af8c29791f89a2c30b978/raw/9c53f66fbed609d09652b8c4ceff0194876c05a3/gistfile1.txt
    make pinghttpserver
  6. Configure /etc/wvdial.conf. This will vary by the type of 3G/4G modem and also the ISP in use. One key value is the APN that you use. In my case, I had to set it to “direct” to ensure I got a real public IP address with no firewalling, instead of getting a CGNAT IP, or a public IP with inbound firewalling enabled. This will vary by carrier!
    [Dialer Defaults]
    Init1 = ATZ
    Init2 = ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0
    Init3 = AT+CGDCONT=1,"IP","direct"
    Stupid Mode = 1
    Modem Type = Analog Modem
    Phone = *99#
    Modem = /dev/ttyUSB2
    Username = { }
    Password = { }
    New PPPD = yes
  7. Edit /etc/ppp/peers/wvdial to enable “defaultroute” and “replacedefaultroute” – we want the wireless connection to always be the default gateway when connected!
  8. Create a launcher script and (once tested) call it from /etc/rc.local at boot. This will start up the 3G connection at boot and launch various processes we need. (this could be nicer and be a collection of systemd services, but damnit I was lazy ok?). It also handles reboots and powercycling USB if problems are encountered for (an attempt) at automated recovery.
    wget -O 3g_failover_launcher.sh https://gist.githubusercontent.com/jethrocarr/a5dae9fe8523cf74d30a065d77d74876/raw/57b5860a9b3f6a048b02b245f3628ee60ea766dc/3g_failover_launcher.sh

At this point, you should be left with a Raspberry Pi that gets a DHCP lease on it’s eth0, dials up a connection with your wireless telco and routes all traffic it receives on eth0 to the ppp interface.

In my case, I setup my Mikrotik router to have a default GW route to the Raspberry Pi and the ability to failover based on distance weightings. If the wired connection drops, the Mikrotik will shovel packets at the Raspberry Pi, which will happily NAT them to the internet.

 

The DNS Failover

The work above got me an outbound failover solution, but it’s no good for inbound traffic without a failover DNS record that flips between the wired and wireless connections for the VPN to target.

Because the wireless link would be getting a dynamic IP addresses, the first requirement was a dynamic DNS service. There are various companies around offering free or commercial products for this, but I chose to use a solution built around AWS Lambda that can be granted access directly to my DNS hosted inside Route53.

AWS have a nice reference dynamic DNS solution available here that I ended up using (Sadly not using the Serverless framework so there’s a bit more point+click setup than I’d like, but hey).

Once configured and a small client script installed on the Raspberry Pi, I had reliable dynamic DNS running.

The last bit we need is DNS failover. The solution I used was the native AWS Route53 Health Check feature, where AWS adjust a DNS record based on the health of monitored endpoints.

I setup a CNAME with the wired connection as the “primary” and the wireless connection as the “secondary”. The DNS CNAME will always point to the primary/wired connection, unless it’s health check fails, in which case the CNAME will point to the secondary/wireless connection. If both fail, it fails-safe to the primary.

A small webserver (pinghttpserver) that we built earlier is used to measure connectivity – the Route53 Health Check feature unfortunately lacks support for ICMP connectivity tests hence the need to write a tiny server for checking accessibility.

This webserver runs on the Raspberry Pi, but I do a dst port NAT to it on both the wired and wireless connections. If the Pi should crash, the connection will always fail safe to the primary/wired connection since both health checks will fail at once.

There is a degree of flexibility to the Route53 health checks. You can use a CloudWatch alarm instead of the HTTP check if desired. In my case, I’m using a Lambda I wrote called “lambda-ping” (creative I know) which is a Lambda that does HTTP “pings” to remote endpoints and recording the response code, plus latency. (Annoyingly it’s not possible to do ICMP pings with Lambda either, since the container that Lambda execute inside of lack the CAP_NET_RAW kernel capability, hence the “ping-like” behaviour).

lambda-ping in action

I use this, since it gives me information for more than just my failover internet links (eg my blog, sites, etc) and acts like my Pingdom / Newrelic Synthetics alternative.

 

Final Result

After setting it all up and testing, I’ve installed the Raspberry Pi into the comms cabinet. I was a bit worried that all the metal casing would create a faraday cage, but it seems to be working OK (I also placed it so that the 3G modem sticks out of the cabinet surrounds).

So far so good, but if I get spotty performance or other issues I might need to consider locating the FailberryPi elsewhere where it can get clear access to the cell towers without disruption (maybe sealed ABS box on the roof?). For my use case, it doesn’t need to be ultra fast (otherwise I’d spend some $ and upgrade to 4G), but it does need to be somewhat consistent and reliable.

Installed on a shelf in the comms cabinet, along side the main Mikrotik router and the VDSL modem

So far it’s working well – the outbound failover could do with some tweaking to better handle partial failures (eg VDSL link up, but no international transit), but the failover for the inbound works extremely well.

Few remaining considerations/recommendations for anyone considering a setup like this:

  1. If using the one telco for both the wireless and the wired connection, you’re still at risk of a common fault taking out both services since most ISPs will share infrastructure at some level – eg the international gateway. Use a completely different provider for each service.
  2. Using two wired ISPs (eg Fibre with VDSL failover)  is probably a bit pointless, they’re probably both going back to the same exchange or along the some conduit waiting for a  single backhoe to take them both out at once.
  3. It’s kind of pointless if you don’t put this behind a UPS, otherwise you’ll still be offline when the power goes out. Strongly recommend having your entire comms cabinet on UPS so your wifi, routing and failover all continue to work during outages.
  4. If you failover, be careful about data usage. Your computers won’t know they’re on an expensive mobile connection with limited data and they’ll happily download updates, steam games, backups, etc…. One approach is using a firewall to whitelist select systems only for failover (eg IoT devices, alarm, cameras) and leaving other devices like laptops blocked to prevent too much billshock.
  5. Partial ISP outages are still a PITA. Eg, if routing is broken to some NZ ISPs, but international is fine, the failover checks from ap-southeast-2 won’t trigger. Additional ping scripts could help here (eg check various ISP gateways from the Pi), but that’s getting rather complex and tries to solve a problem that’s never completely fixable.
  6. Just buy a Raspberry Pi. Don’t waste time/effort trying to hack some ancient crap together it wastes far too much time and often falls flat. And don’t use an old laptop/desktop, there’s too much to fail on them like fans, HDDs, etc. The Pi is solid embedded electronics.
  7. Remember that your Pi is essentially a server attached to the public internet. Make sure you configure firewalls and automatic patching and any other hardening you deem appropriate for such a system. Lock down SSH to keys only, IP restrict, etc.

Easy APT repo in S3

When running a number of Ubuntu or Debian servers, it can be extremely useful to have a custom APT repo for uploading your own packages, or third party packages that lack their own good repositories to subscribe to.

I recently found a nice Ruby utility called deb-s3 which allows easy uploading of dpkg files into an S3-hosted APT repository. It’s much easier than messing around with tools like reprepro and having to s3 cp or sync files up from a local disk into S3.

One main warning: This will create a *public* repo by default since it works out-of-the-box with the stock OS and (in my case) all the packages I’m serving are public open source programs that don’t need to be secured. If you want a *private* repo, you will need to use apt-transport-s3 to support authenticating with S3 to download files and configure deb-s3 for private upload.

Install like any other Ruby Gem:

gem install deb-s3

Adding packages is easy. First make sure your aws-cli is working OK and an S3 bucket has been created, then upload with:

deb-s3 upload \
--bucket example \
--codename codename \
--preserve-versions \
mypackage.deb

You can then add the repo to a Ubuntu or Debian server with:

# We trust HTTPS rather than GPG for this repo - but you can config
# GPG signing if you prefer.
cat > /etc/apt/sources.list.d/myrepo.list << EOF
deb [trusted=yes] https://example.s3.amazonaws.com codename main
EOF

# and ensure you update the package info on the server
apt-get update

Alternatively, here’s an example of how to add the repo with Puppet:

apt::source { 'myrepo':
 comment        => 'This is our own APT repo',
 location       => 'https://example.s3.amazonaws.com',
 release        => $::os["distro"]["codename"],
 repos          => 'main',
 allow_unsigned => true, # We don't GPG sign, HTTPS only
 notify_update  => true, # triggers apt-get update
}

Detectatron

I recently installed security cameras around my house which are doing an awesome job of recording all the events that take place around my house and grounds (generally of the feline variety).

Unfortunately the motion capture tends to be overly trigger happy and I end up with heaps of recordings of trees waving, clouds moving or insects flying past. It’s not a problem from a security perspective as I’m not missing any events, but it makes it harder to check the feed for noteworthy events during the day.

I decided I’d like to write some logic for processing the videos being generated and decided to write a proof of concept that sucks video out of the Ubiquiti Unifi Video server and then analyses it with Amazon Web Services new AI product “Rekognition” to identify interesting videos worthy of note.

What this means, is that I can now filter out all the noise from my motion recordings by doing image recognition and flagging the specific videos that feature events I consider interesting, such as footage featuring people or cats doing crazy things.

I’ve got a 20 minute talk about this system which you can watch below, introducing it’s capabilities and how I’m using the AWS Rekognition service to solve this problem. The talk was for the Wellington AWS Users Group, so it focuses a bit more on the AWS aspects of Rekognition and AWS architecture rather than the Unifi video integration side of things.

The software I wrote has two parts – “Detectatron” which is the backend Java service for processing each video and storing it in S3 after processing and the connector I wrote for integration with the Unifi Video service. These can be found at:

https://github.com/jethrocarr/detectatron
https://github.com/jethrocarr/detectatron-connector-unifi

The code quality is rather poor right now – insufficient unit tests, bad structure and in need of a good refactor, but I wanted to get it up sooner rather than later… since perfection is always the enemy of just shipping something.

Note that whilst I’ve only added support for the product I use (Ubiquiti’s Unifi Video), I’ve designed it so that it’s pretty trivial to build other connectors for other platforms. I’d love to see contributions like connectors for Zone Minder and other popular open source or commercial platforms.

If you’re using Unifi Video, my connector will automatically mark any videos it deems as interesting as locked videos, for easy filtering using the native Unifi Video apps and web interface.

It also includes an S3 upload feature – given that I integrated with the Unifi Video software, it was a trivial step to extend it to also upload every video the system records into S3 within a few seconds for off-site retention. This performs really well, my on-prem NVR really struggled to keep up with uploads when using inotify + awscli to upload footage, but using my connector and Detectatron it has no issues keeping up with even high video rates.

Surveillance State “at home” Edition

A number of months ago I purchased a series of Ubiquiti UniFi video surveillance cameras. These are standard IP ethernet cameras and uses a free (as-in-beer) server agent that runs happily on GNU/Linux to manage the recording and motion detection, which makes them a much more attractive offering than other proprietary systems that use their own specific NVRs.

Once I first got them I hooked them up in the house to test with the intention of installing properly on the outside of the house. This plan got delayed somewhat when we adopted two lovely kittens which immediately removed any incentive I had to actually install them properly since it was just too much fun watching the cats rather than keeping an eye out for axe murderers roaming the property.

I had originally ordered the 720p model, but during this time of kitten watching, Ubiquiti brought out a new 1080p “g3” model which provides better resolution as well as also offering a much nicer looking and easier to install form factor – so I now have a mix of both generations.

The following video shows some footage taken from the older 720p model:

During this test phase we also captured the November 2016 Wellington earthquake on the cameras using a mix of both generation of camera:

Finally with the New Year break, I got the time and motivation to get back up into the attic and install the cameras properly. This wasn’t a technically challenging task – mostly just a case of running cabling, but it’s a right PITA due to the difficulty of moving around in my attic thanks to heaps of water pipes, electrical wires, data wires and joists all hidden under a good foot or two of insulation.

 

 

On the plus side, the technical requirements for the cameras are pretty simple. Each camera is a Power-over-Ethernet (PoE) device, which means it gets both data and power via a single cable, which makes installation simple – no mains electrical wiring, just need to get a single cat6 cable to wherever you want the camera to sit. The camera then connects to the switch and of course the server running the included software.

I am aware of some vendors selling wireless cameras that use WiFi with a battery that needs to be recharged every so often. I can see the use and appeal for renters, but as a home owner, a hard wired system is going to be much easier and more reliable in the long term.

Ubiquiti sell the camera either with or without a PoE adaptor. Using the included PoE adaptor means you can connect them to essentially any existing switch, but if installing a number of cameras this can create a cable management nightmare. I’d strongly recommend a PoE switch if installing more than 5 cameras, even taking into account their higher cost.

A PoE switch suddenly didn’t seem like such an expensive investment…

The easiest installation was the remote shed camera. Conveniently the shed has mains electrical wiring, but I needed to install a wireless AP to connect back to the house as running ethernet out there is just a bit too difficult.

I used Ubiquiti’s airGW-LR product which is a low cost access point that is designed to clip to their standard PoE supply. End result is a really tidy setup with a single power supply for both devices and with both devices mounted on a robust bracket for easy installation.

720p camera + airGW + PoE supply

The house cameras were a bit more work. It took me roughly a day to run cabling through the attic – my house isn’t easy to move in the roof or floor space so it takes longer than some others. Also tip – it’s much easier running cabling *before* the insulation is installed, so if you’re thinking of doing both, install the ethernet in advance.

High ceilings and a small attic entrance is just the start of the hassles of running cabling.

The annoying moment when you drill into a stud and end up with a hole that needs filling again. (with solid hardwood walls and ceilings, stud finders don’t work well at my place)

Once the cable run had been completed, I crimped the outside ends with RJ45 connectors for the cameras and then proceeded to take apart the existing patch panel, which also required removing most of the gear in the comms cabinet to free up room to work.

Couple tips for anyone else doing this:

  • I left plenty of excess cable on my ethernet runs. This allowed me to crimp the camera end whilst standing comfortably on the ground, then when I installed the camera I just pushed up all the excess into attic. Ethernet cable is cheap compared to one’s time messing around up at the tops of ladders.
  • The same applies at the patch panel – make sure to leave enough slack to allow you to easily take the patch panel off and work on it in the future – you can see from the picture below I have a good length spare that comes out of the wall.
  • Remember to wire the RJ45 connectors and the patch panel to the same standard – I managed to do T568B at the camera end and T568A at the patch panel on my first attempt.
  • Test each cable as you complete the wiring. Because of this I caught the above issue on the first camera and it saved me a lot of pain in future. A cheap ethernet tester can be found online for ~$10 and is worth having in your tool kit.

Down to only 4/24 ports free on the patch panel! I expect the last 4 will be consumed by WiGig/802.11ad in future, since it will require an AP per-room in order to get high performance, I might even need a second patch panel in future… good thing I brought the large wall mounted cabinet.

 

With the cabling done, I connected all the PoE adaptors. These are a bit of a PITA if you’re using a rack – you could get a small rackmount shelf with holes and cable tie down, but I went for cable tying them to the outside of the cabinet.

I also colour coded the output from the PoE adaptors. You need to be careful with passive PoE adaptors, you can potentially damage computers and network equipment if you connect them to the adaptor by mistake so I used the colour coding to make it very clear what cables are what.

Finished cabling installation. About as tidy as I can get it in here without moving to using custom length patch cables…. but crimping 30+ patch cables by hand isn’t my idea of a good time.

 

Having completed the cabling and putting together the networking gear and PoE adaptors, I could finally install the cameras themselves. This isn’t particularly hard, basically just need to be able to screw something to the side of the house and then aim the camera in the right position.

The older 720p model is the most annoying to install as it requires adjusting everything using an allen key, plus the cable must be exposed with a drip loop. It’s also more of an eyesore which is a mixed bag – you get better deterrence aspect, but it can look a bit ugly on the house.

The newer model is more aesthetically pleasing, but it’s possible some people might not realise it’s a camera which could be a downside for deterrence.

That being said, they look OK when installed on the house – certainly no worse than the ugly alarm and sensor lights you get on many houses. I even ended up putting one inside to give me complete visibility of the hallway linking every room in the house and it’s not much more visible than a large alarm PIR sensor.

Some additional features worth noting:

  • All the cameras have built in IR, which means they provide decent footage, even at night time. The cameras switch an IR filter on/off automatically as required.
  • All the cameras have built in microphones. Whilst they capture a lot of background wind noise, they’re also quite good at picking up conversations even when outside – it’s a handy tool for gathering intel on any unwanted guests.

 

With all the hardware completed, onto the software. Ubiquiti supply their server software free-of-charge. It’s easy enough to download and install, but if you have Puppetised your home server (of course you have right?) I have a Puppet module here for you.

 

Generally I’ve found the software solution (including the iOS mobile app) to be pretty good, but there are two main issues to be aware of with it:

  1. First is that the motion detection is pretty dumb and works on percentage of image changed. This means windy areas with lots of greenery get lots of unwanted recordings made. It doesn’t causing technical issues, but it does make for a noisy set of recordings – don’t expect it to *only* record events of note, you’ll get all the burglars and axe murderers, but also every neighbourhood cat and the nearby trees on windy days. Oh and night time you get lots of footage of moths when they fly close to the camera with the IR night vision on.

  2. Second is that I found a software bug in the mobile apps where they did not validate SSL certs properly and got a very poor response from Ubiquiti. That being said one of their reps recently claimed they’ve hired more security staff to deal with their poor responsiveness, so let’s see what happens on this front.

 

 

One feature which is strangely absent, is the lack of support for automatically uploading recordings to a cloud storage service. It’s not possible for everyone, but if on a fast connection (eg VDSL, UFB) it’s worth uploading all recordings to something like Amazon S3 so that an attacker can’t subsequently break in and remove the recording hardware.

My approach was setting up lsyncd to listen to inotify events from Linux every time a video file is written to disk and then quickly copy that file up into Amazon S3 where it remains for a prolonged period.

If you can’t achieve this due to poor internet performance, your best bet is to put the video recording server in a difficult to find and/or access location, sufficient to prevent the casual intruder from finding it. If you have a proper monitored alarm system they shouldn’t be lingering long enough to find it.

 

Stability seems good. I’ve been running these cameras since April and have never had the server agent or the cameras crash or fail to record. I’m using a Mac Mini for the camera server but you can always buy an embedded black-box NVR solution from Ubiquiti themselves. If you’re on a budget, a second hand Mac Mini or Intel NUC might be better value for money – just make sure it’s 64bit, not an older gen 32bit device.