Deep Dive into ECS

I spent a fair bit of time in 2017 re-architecting the carnival.io platform onto Amazon ECS, including working to handle some tricky autoscaling challenges brought on by the nature of the sudden high-load spikes experienced when we deliver push messages to customers.

I’ve now summed up these learnings into a deep dive talk on the Amazon ECS architecture that I presented at the Wellington AWS Users Group on February 12th 2018.

This talk explains what container orchestration is, some key fundamentals about ECS, how we’ve tackled CI/CD with ECS and going into details around some of the unique autoscaling challenges caused by millions of cellphones sending home telemetry all at once.

This talk is technical, but includes content appropriate for both beginners wanting to know how ECS functions and experts wanting to see just what can be accomplished with the platform.

 

Posted in Uncategorized | Tagged , , , , , , , | Leave a comment

Puppet Autosigning & Cloud Recommendations

I was over in Sydney this week attending linux.conf.au 2018 and made a short presentation at the Sysadmin miniconf regarding deploying Puppet in cloud environments.

The majority of this talk covers the Puppet autosigning process which is a big potential security headache if misconfigured. If you’re deploying Puppet (or even some other config management system) into the cloud, I recommend checking this one out (~15mins) and making sure your own setup doesn’t have any issues.

 

Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Firebase FCM upstream with Swift on iOS

I’ve been learning a bit of Swift lately in order to write an iOS app for my alarm system. I’m not very good at it yet, but figured I’d write some notes to help anyone else playing with the murky world of Firebase Cloud Messaging/FCM and iOS.

One of the key parts of the design is that I wanted the alarm app and the alarm server to communicate directly with each other without needing public facing endpoints, rather than the conventional design when the app interacts via an HTTP API.

The intention of this design is that it means I can dump all the alarm software onto a small embedded computer and as long as that computer has outbound internet access, it just works™️. No headaches about discovering the endpoint of the service and much more simplified security as there’s no public-facing web server.

Given I need to deliver push notifications to the app, I implemented Google Firebase Cloud Messaging (FCM) – formerly GCM – for push delivery to both iOS and Android apps.

Whilst FCM is commonly used for pushing to devices, it also supports pushing messages back upstream to the server from the device. In order to do this, the server must be implemented as an XMPP server and the FCM SDK be embedded into the app.

The server was reasonably straight forwards, I’ve written a small Java daemon that uses a reference XMPP client implementation and wraps some additional logic to work with HowAlarming.

The client side was a bit more tricky. Google has some docs covering how to implement upstream messaging in the iOS app, but I had a few issues to solve that weren’t clearly detailed there.

 

Handling failure of FCM upstream message delivery

Firstly, it’s important to have some logic in place to handle/report back if a message can not be sent upstream – otherwise you have no way to tell if it’s worked. To do this in swift, I added a notification observer for .MessagingSendError which is thrown by the FCM SDK if it’s unable to send upstream.

class AppDelegate: UIResponder, UIApplicationDelegate, MessagingDelegate {

 func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
   ...
   // Trigger if we fail to send a message upstream for any reason.
   NotificationCenter.default.addObserver(self, selector: #selector(onMessagingUpstreamFailure(_:)), name: .MessagingSendError, object: nil)
   ...
 }

 @objc
 func onMessagingUpstreamFailure(_ notification: Notification) {
   // FCM tends not to give us any kind of useful message here, but
   // at least we now know it failed for when we start debugging it.
   print("A failure occurred when attempting to send a message upstream via FCM")
 }
}

Unfortunately I’m yet to see a useful error code back from FCM in response to any failures to send message upstream – seem to just get back a 501 error to anything that has gone wrong which isn’t overly helpful… especially since in web programming land, any 5xx series error implies it’s the remote server’s fault rather than the client’s.

 

Getting the GCM Sender ID

In order to send messages upstream, you need the GCM Sender ID. This is available in the GoogleService-Info.plist file that is included in the app build, but I couldn’t figure out a way to extract this easily from the FCM SDK. There probably is a better/nice way of doing this, but the following hack works:

// Here we are extracting out the GCM SENDER ID from the Google
// plist file. There used to be an easy way to get this with GCM, but
// it's non-obvious with FCM so here's a hacky approach instead.
if let path = Bundle.main.path(forResource: "GoogleService-Info", ofType: "plist") {
  let dictRoot = NSDictionary(contentsOfFile: path)
  if let dict = dictRoot {
    if let gcmSenderId = dict["GCM_SENDER_ID"] as? String {
       self.gcmSenderId = gcmSenderId // make available on AppDelegate to whole app
    }
  }
}

And yes, although we’re all about FCM now, this part hasn’t been rebranded from the old GCM product, so enjoy having yet another acronym in your app.

 

Ensuring the FCM direct channel is established

Finally the biggest cause I had for upstream message delivery failing, is that I was often trying to send an upstream message before FCM had finished establishing the direct channel.

This happens for you automatically by the SDK whenever the app is loaded into foreground, provided that you have shouldEstablishDirectChannel set to true. This can take up to several seconds after application launch for it to actually complete – which means if you try to send upstream too early, the connection isn’t ready, and your send fails with an obscure 501 error.

The best solution I found was to use an observer to listen to .MessagingConnectionStateChanged which is triggered whenever the FCM direct channel connects or disconnects. By listening to this notification, you know when FCM is ready and capable of delivering upstream messages.

An additional bonus of this observer, is that when it indicates the FCM direct channel is established, by that time the FCM token for the device is available to your app to use if needed.

So my approach is to:

  1. Setup FCM with shouldEstablishDirectChannel set to true (otherwise you won’t be going upstream at all!).
  2. Setup an observer on .MessagingConnectionStateChanged
  3. When triggered, use Messaging.messaging().isDirectChannelEstablished to see if we have a connection ready for us to use.
  4. If so, pull the FCM token (device token) and the GCM Sender ID and retain in AppDelegate for other parts of the app to use at any point.
  5. Dispatch the message to upstream with whatever you want in messageData.

My implementation looks a bit like this:

class AppDelegate: UIResponder, UIApplicationDelegate, MessagingDelegate {

 func application(_ application: UIApplication, didFinishLaunchingWithOptions launchOptions: [UIApplicationLaunchOptionsKey: Any]?) -> Bool {
  ...
  // Configure FCM and other Firebase APIs with a single call.
  FirebaseApp.configure()

  // Setup FCM messaging
  Messaging.messaging().delegate = self
  Messaging.messaging().shouldEstablishDirectChannel = true

  // Trigger when FCM establishes it's direct connection. We want to know this to avoid race conditions where we
  // try to post upstream messages before the direct connection is ready... which kind of sucks.
  NotificationCenter.default.addObserver(self, selector: #selector(onMessagingDirectChannelStateChanged(_:)), name: .MessagingConnectionStateChanged, object: nil)
  ...
 }

 @objc
 func onMessagingDirectChannelStateChanged(_ notification: Notification) {
  // This is our own function listen for the direct connection to be established.
  print("Is FCM Direct Channel Established: \(Messaging.messaging().isDirectChannelEstablished)")

  if (Messaging.messaging().isDirectChannelEstablished) {
   // Set the FCM token. Given that a direct channel has been established, it kind of implies that this
   // must be available to us..
   if self.registrationToken == nil {
    if let fcmToken = Messaging.messaging().fcmToken {
     self.registrationToken = fcmToken
     print("Firebase registration token: \(fcmToken)")
    }
   }

   // Here we are extracting out the GCM SENDER ID from the Google PList file. There used to be an easy way
   // to get this with GCM, but it's non-obvious with FCM so we're just going to read the plist file.
   if let path = Bundle.main.path(forResource: "GoogleService-Info", ofType: "plist") {
    let dictRoot = NSDictionary(contentsOfFile: path)
     if let dict = dictRoot {
      if let gcmSenderId = dict["GCM_SENDER_ID"] as? String {
       self.gcmSenderID = gcmSenderId
     }
    }
   }

  // Send an upstream message
  let messageId = ProcessInfo().globallyUniqueString
  let messageData: [String: String] = [
   "registration_token": self.registrationToken!, // In my use case, I want to know which device sent us the message
   "marco": "polo"
  ]
  let messageTo: String = self.gcmSenderID! + "@gcm.googleapis.com"
  let ttl: Int64 = 0 // Seconds. 0 means "do immediately or throw away"

  print("Sending message to FCM server: \(messageTo)")

  Messaging.messaging().sendMessage(messageData, to: messageTo, withMessageID: messageId, timeToLive: ttl)
  }
 }

 ...
}

For a full FCM downstream and upstream implementation example, you can take a look at the HowAlarming iOS app source code on Github and if you need a server reference, take a look at the HowAlarming GCM server in Java.

 

Learnings

It’s been an interesting exercise – I wouldn’t particularly recommend this architecture for anyone building real world apps, the main headaches I ran into were:

  1. FCM SDK just seems a bit buggy. I had a lot of trouble with the GCM SDK and the move to FCM did improve stuff a bit, but there’s still a number of issues that occur from time to time. For example: occasionally a FCM Direct Channel isn’t established for no clear reason until the app is terminated and restarted.
  2. Needing to do things like making sure FCM Direct Channel is ready before sending upstream messages should probably be handled transparently by the SDK rather than by the app developer.
  3. I have still yet to get background code execution on notifications working properly. I get the push notification without a problem, but seem to be unable to trigger my app to execute code even with content-available == 1 . Maybe a bug in my code, or FCM might be complicating the mix in some way, vs using pure APNS. Probably my code.
  4. It’s tricky using FCM messages alone to populate the app data, occasionally have issues such as messages arriving out of order, not arriving at all, or occasionally ending up with duplicates. This requires the app code to process, sort and re-populate the table view controller which isn’t a lot of fun. I suspect it would be a lot easier to simply re-populate the view controller on load from an HTTP endpoint and simply use FCM messages to trigger refreshes of the data if the user taps on a notification.

So my view for other projects in future would be to use FCM purely for server->app message delivery (ie: “tell the user there’s a reason to open the app”) and then rely entirely on a classic app client and HTTP API model for all further interactions back to the server.

Posted in Uncategorized | Tagged , , , , , , , , , , | Leave a comment

MongoDB document depth headache

We ran into a weird problem recently where we were unable to sync a replica set running MongoDB 3.4 when adding new members to the replica set.

The sync would begin, but at some point during the sync it would always fail with:

[replication-0] collection clone for 'database.collection' failed due to Overflow:
While cloning collection 'database.collection' there was an error
'While querying collection 'database.collection' there was an error 
'BSONObj exceeded maximum nested object depth: 200''

(For extra annoyance the sync would continue with syncing all the other databases and collections on the replica set, before then only realising it had actually failed earlier at the very end of the sync and then restarting the sync from the beginning again).

 

The error means that one or more documents has a max depth over 200. This could be a chain of objects, or a chain of arrays in a document – a mistake that isn’t too tricky to cause with a buggy loop or ORM.

But how is it possible that this document could be in the database in the first case? Surely it should have been refused at time of insert? Well the nested document limit size and enforcement has changed at various times in past versions and a long-lived database such as ours from early MongoDB 2.x days may have had these bad documents inserted before the max depth limit was enforced and only now when we try to use the document do the limits become a problem.

In our case the document was old, but didn’t have any issues syncing back on Mongo 3.0 but now failed with Mongo 3.4.

Finding the document is tricky – the replication process helpfully does not log the document ID, so you can’t go and purge it from the collection to resolve the issue.

With input from my skilled colleagues with better Mongo skills than I, we figured out three queries that allowed us to identify the bad documents.

1. This query finds any documents that have a long chain of nested objects inside them.

db.collection.find({ $where: function() { return tojsononeline(this).indexOf("} } } } } } } } }") != -1 } })

2. This query finds any documents that have a long chain of nested arrays. This was the specific issue in our case and this query successfully identified all the bad documents.

db.collection.find({ $where: function() { return tojsononeline(this).indexOf("] ] ] ] ] ] ]") != -1 } })

3. And if you get really stuck, you can find any bad document (for whatever reason) by reading the document and then re-writing it back out to another collection. This ensures the document gets all the limits applied at write time and will identify their ID, regardless of the specific reason for them being refused.

db.collection.find({}).forEach(function(d) { print(d["_id"]); db.new_collection.insert(d) });

Note that all of these queries tend to be performance impacting since you’re asking your database to read every single document. And the last one, copying collections, could take considerable time and space to complete.

I recommend restoring the replica set to a test system and performing the operation there where you know it’s not going to impact production if you have any data of notable size.

Once you find your bad document, you can display it with:

db.collection.find({ _id: ObjectId("54492129902178d6f600004f") });

And delete it entirely (assuming nothing important in it!) with:

db.collection.deleteOne({ _id: ObjectId("54492129902178d6f600004f") });
Posted in Uncategorized | Tagged , , | 2 Comments

MacOS High Sierra unable to free disk space

I recently ran out of disk space on my iMac. After migrating a considerable amount of undesirable data to either the file server or /dev/null, I found that despite my efforts, the amount of free disk space had not increased.

I was worried it was an issue with the new APFS file system introduced to all SSD-using Macs as of High Sierra, but in this case it turns out the issue is that Time Machine retains local snapshots on disk, in addition to the full backup history that is retained on the network time machine device.

Apple state that they automatically remove local snapshots when disk space is low, but their definition of low is apparently only 5GB of free space remaining – not really much free working space in 2017 when you might want scratch space of 22GB for 1 hour of 4k 30FPS footage.

On older MacOS releases, it was possible to disable the local snapshot feature entirely, this doesn’t seem to be the case with High Sierra – but it does appear to be possible to force an immediate purge of local snapshots with the following command:

sudo tmutil thinLocalSnapshots / 10000000000 4

For example;

Back into the time vortex with you, filthy snapshots!

Note that this snapshot usage is not visible as a distinct item in the Disk Utility or Storage Management application.

In my case, all the snapshots appeared to be within the last 24 hours, so if I hadn’t urgently needed the disk space, I suspect the local snapshots would have flushed themselves after a 24 hour period restoring considerable disk space.

The fact this isn’t an opt-in user-accessible feature is a shame. It adds convenience for a user of not having to get physical access to the backup drive or time capsule-like-thingy in order to restore data, but any users of systems with SSD-only storage are likely to be a bit precious about how every GB is used and there’s almost no transparency about how much space is being consumed. Especially annoying when you urgently need more space and are stuck wondering why nothing is freeing up room…

Posted in Uncategorized | Tagged , , , , , | 4 Comments

Access Route53 private zones cross account

Using Route53 private zones can be a great way to maintain a private internal zone for your server infrastructure. However sometimes you may need to share this zone with another VPC in the same or in another AWS account.

The first situation is easy – a Route53 zone can be associated with any number of VPCs within a single AWS account using the AWS console.

The second is more tricky but is doable by creating a VPC association authorization request in the account with the zone, then accepting it from the other account.

# Run against the account with the zone to be shared.
aws route53 \
create-vpc-association-authorization \
--hosted-zone-id abc123 \
--vpc VPCRegion=us-east-1,VPCId=vpc-xyz123 

# Run against the account that needs access to the private zone.
aws route53 \
associate-vpc-with-hosted-zone \
--hosted-zone-id abc123 \
--vpc VPCRegion=us-east-1,VPCId=vpc-xyz123 \
--comment "Example Internal DNS Zone"

# List authori(z|s)ations once done
aws route53 \
list-vpc-association-authorizations \
--hosted-zone-id abc123

This doesn’t even require VPC peering since it works behind the scenes, with the associated zone now being resolvable using the default VPC DNS server on each zone that has been associated.

Note that the one catch is that this does not help you if you’re linking to a non-AWS VPC environment, such as an on-prem data centre via IPSec VPN or Direct Connect. Even though you can route to the VPC and systems inside it, the AWS DNS resolver for the VPC will refuse requests from IP space outside of the VPC itself.

So the only option is have an EC2 instance acting as a DNS forwarder inside the VPC, which is reachable from the linked data centre and yet since it’s in the VPC, can use the resolver.

Posted in Uncategorized | Tagged , , , , | 3 Comments

FailberryPi – Diverse carrier links for your home data center

Given the amount of internet connected things I now rely on at home, I’ve been considering redundant internet links for a while. And thanks to the affordability of 3G/4G connectivity, it’s easier than ever to have a completely diverse carrier at extremely low cost.

I’m using 2degrees which has a data SIM sharing service that allows me to have up to 5 other devices sharing the one data plan, so it literally costs me nothing to have the additional connection available 24×7.

My requirements were to:

  1. Handle the loss of the wired internet connection.
  2. Ensure that I can always VPN into the house network.
  3. Ensure that the security cameras can always upload footage to AWS S3.
  4. Ensure that the IoT house alarm can always dispatch events and alerts.

I ended up building three distinct components to build a failover solution that supports flipping between my wired (VDSL) and wireless (3G) connection:

  1. A small embedded GNU/Linux system that can bridge a USB 3G modem and an ethernet connection, with smarts to recover from various faults (like crashed 3G stick).
  2. A dynamic DNS solution, since my mobile telco certainly isn’t going to give me a static IP address, but I need inbound traffic.
  3. A DNS failover solution so I can redirect inbound requests (eg home VPN) to the currently active endpoint automatically when a failure has occurred.

 

The Hardware

I considered using a Mikrotik with USB for the 3G link – it is a supported feature, but I decided to avoid this route since I would need to replace my perfectly fine router for one with a USB port, plus I know from experience that USB 3G modems are fickle beasts that would be likely to need some scripting to workaround various issues.

For the same reason I excluded some 3G/4G router products available that take a USB modem and then provide ethernet or WiFi. I’m very dubious about how fault tolerant these products are (or how secure if consumer routers are anything to go by).

I started off the project using a very old embedded GNU/Linux board and 3G USB modem I had in the spare parts box, but unfortunately whilst I did eventually recycle this hardware into a working setup, the old embedded hardware had a very poor USB controller and was throttling my 3G connection to around 512kbps. :-(

Initial approach – Not a bomb, actually an ancient Gumstix Verdex with 3G modem.

So I started again, this time using the very popular Raspberry Pi 2B hardware as the base for my setup. This is actually the first time I’ve played with a Raspberry Pi and I actually really enjoyed the experience.

The requirements for the router are extremely low – move packets between two interfaces, dial a modem and run some scripts. It actually feels wasteful using a whole Raspberry Pi with it’s whole 1GB of RAM and Quad Core ARM CPU, but they’re so accessible and cost affordable, it’s not worth the time messing around with any more obscure embedded boards.

Pie ingredients

It took me all of 5 mins to assemble and boot an OS on this thing and have a full Debian install ready for work. For this speed and convenience I’ll happily pay a small price premium for the Raspberry Pi than some other random embedded vendors with much more painful install and upgrade processes.

Baked!

It’s important to get a good power supply – 3G/4G modems tend to consume the full 500mW available to them. I kept getting under voltage warnings (the red light on the Pi turns off) with a 2.1 Amp phone charger I was using. Ended up buying the official 2.5 Amp Raspberry Pi charger, which powers the Raspberry Pi 2 + the 3G modem perfectly.

I brought the smallest (& cheapest) class 10 Micro SDHC card possible – 16GB. Of course this is way more than you actually need for a router, 4GB would have been plenty.

The ZTE MF180 USB 3G modem I used is a tricky beast on Linux, thanks to the kernel seeing it as a SCSI CDROM drive initially which masks the USB modem features. Whilst Linux has usb_modeswitch shipping as standard these days, I decided to completely disable the SCSI CDROM feature as per this blog post to avoid the issue entirely.

 

The Software

The Raspberry Pi I was given (thanks Calcinite! ?) had a faulty GPU so the HDMI didn’t work. Fortunately Raspberry Pi doesn’t let such a small issue like no display hold it back – it’s trivial to flash an image to the SD card from another machine and boot a headless installation.

  1. Download Raspbian minimal/lite (Debian + Raspberry Pi goodness).
  2. Installed image to the SD card using the very awesome Etcher.io (think “safe dd” for noobs) as per the install instructions using my iMac.
  3. Enable SSH as per instructions: “SSH can be enabled by placing a file named ssh, without any extension, onto the boot partition of the SD card. When the Pi boots, it looks for the ssh file. If it is found, SSH is enabled, and the file is deleted. The content of the file does not matter: it could contain text, or nothing at all.”
  4. Login with username “pi” and password “raspberry”.
  5. Change the password immediately before you put it online!
  6. Upgrade the Pi and enable automated updates in future with:
    apt-get update && apt-get -y upgrade
    apt-get install -y unattended-upgrades

The rest is somewhat specific to your setup, but my process was roughly:

  1. Install apps needed – wvdial for establishing the 3G connection via AT commands + PPP, iptables-persistent for firewalling, libusb-dev for building hub-ctrl and jq for parsing JSON responses.
    apt-get install -y wvdial iptables-persistent libusb-dev jq
  2. Configure a firewall. This is very specific to your network, but you’ll want both ipv4 and ipv6 rules in /etc/iptables/rules.* Generally you’d want something like:
    1. Masquerade (NAT) traffic going out of the ppp+ and eth0 interfaces.
    2. Permit forwarding traffic between the interfaces.
    3. Permit traffic in on port 9000 for the health check server.
  3. Enable IP forwarding (net.ipv4.ip_forward=1) in /etc/sysctl.conf.
  4. Build hub-ctrl. This utility allows the power cycling of the USB controller + attached devices in the Raspberry Pi, which is extremely useful if your 3G modem has terrible firmware (like mine) and sometimes crashes hard.
    wget https://raw.githubusercontent.com/codazoda/hub-ctrl.c/master/hub-ctrl.c
    gcc -o hub-ctrl hub-ctrl.c -lusb
  5. Build pinghttpserver. This is a tiny C-based webserver which we can use to check if the Raspberry Pi is up (Can’t use ICMP as detailed further on).
    wget -O pinghttpserver.c https://gist.githubusercontent.com/jethrocarr/c56cecbf111af8c29791f89a2c30b978/raw/9c53f66fbed609d09652b8c4ceff0194876c05a3/gistfile1.txt
    make pinghttpserver
  6. Configure /etc/wvdial.conf. This will vary by the type of 3G/4G modem and also the ISP in use. One key value is the APN that you use. In my case, I had to set it to “direct” to ensure I got a real public IP address with no firewalling, instead of getting a CGNAT IP, or a public IP with inbound firewalling enabled. This will vary by carrier!
    [Dialer Defaults]
    Init1 = ATZ
    Init2 = ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0
    Init3 = AT+CGDCONT=1,"IP","direct"
    Stupid Mode = 1
    Modem Type = Analog Modem
    Phone = *99#
    Modem = /dev/ttyUSB2
    Username = { }
    Password = { }
    New PPPD = yes
  7. Edit /etc/ppp/peers/wvdial to enable “defaultroute” and “replacedefaultroute” – we want the wireless connection to always be the default gateway when connected!
  8. Create a launcher script and (once tested) call it from /etc/rc.local at boot. This will start up the 3G connection at boot and launch various processes we need. (this could be nicer and be a collection of systemd services, but damnit I was lazy ok?). It also handles reboots and powercycling USB if problems are encountered for (an attempt) at automated recovery.
    wget -O 3g_failover_launcher.sh https://gist.githubusercontent.com/jethrocarr/a5dae9fe8523cf74d30a065d77d74876/raw/57b5860a9b3f6a048b02b245f3628ee60ea766dc/3g_failover_launcher.sh

At this point, you should be left with a Raspberry Pi that gets a DHCP lease on it’s eth0, dials up a connection with your wireless telco and routes all traffic it receives on eth0 to the ppp interface.

In my case, I setup my Mikrotik router to have a default GW route to the Raspberry Pi and the ability to failover based on distance weightings. If the wired connection drops, the Mikrotik will shovel packets at the Raspberry Pi, which will happily NAT them to the internet.

 

The DNS Failover

The work above got me an outbound failover solution, but it’s no good for inbound traffic without a failover DNS record that flips between the wired and wireless connections for the VPN to target.

Because the wireless link would be getting a dynamic IP addresses, the first requirement was a dynamic DNS service. There are various companies around offering free or commercial products for this, but I chose to use a solution built around AWS Lambda that can be granted access directly to my DNS hosted inside Route53.

AWS have a nice reference dynamic DNS solution available here that I ended up using (Sadly not using the Serverless framework so there’s a bit more point+click setup than I’d like, but hey).

Once configured and a small client script installed on the Raspberry Pi, I had reliable dynamic DNS running.

The last bit we need is DNS failover. The solution I used was the native AWS Route53 Health Check feature, where AWS adjust a DNS record based on the health of monitored endpoints.

I setup a CNAME with the wired connection as the “primary” and the wireless connection as the “secondary”. The DNS CNAME will always point to the primary/wired connection, unless it’s health check fails, in which case the CNAME will point to the secondary/wireless connection. If both fail, it fails-safe to the primary.

A small webserver (pinghttpserver) that we built earlier is used to measure connectivity – the Route53 Health Check feature unfortunately lacks support for ICMP connectivity tests hence the need to write a tiny server for checking accessibility.

This webserver runs on the Raspberry Pi, but I do a dst port NAT to it on both the wired and wireless connections. If the Pi should crash, the connection will always fail safe to the primary/wired connection since both health checks will fail at once.

There is a degree of flexibility to the Route53 health checks. You can use a CloudWatch alarm instead of the HTTP check if desired. In my case, I’m using a Lambda I wrote called “lambda-ping” (creative I know) which is a Lambda that does HTTP “pings” to remote endpoints and recording the response code, plus latency. (Annoyingly it’s not possible to do ICMP pings with Lambda either, since the container that Lambda execute inside of lack the CAP_NET_RAW kernel capability, hence the “ping-like” behaviour).

lambda-ping in action

I use this, since it gives me information for more than just my failover internet links (eg my blog, sites, etc) and acts like my Pingdom / Newrelic Synthetics alternative.

 

Final Result

After setting it all up and testing, I’ve installed the Raspberry Pi into the comms cabinet. I was a bit worried that all the metal casing would create a faraday cage, but it seems to be working OK (I also placed it so that the 3G modem sticks out of the cabinet surrounds).

So far so good, but if I get spotty performance or other issues I might need to consider locating the FailberryPi elsewhere where it can get clear access to the cell towers without disruption (maybe sealed ABS box on the roof?). For my use case, it doesn’t need to be ultra fast (otherwise I’d spend some $ and upgrade to 4G), but it does need to be somewhat consistent and reliable.

Installed on a shelf in the comms cabinet, along side the main Mikrotik router and the VDSL modem

So far it’s working well – the outbound failover could do with some tweaking to better handle partial failures (eg VDSL link up, but no international transit), but the failover for the inbound works extremely well.

Few remaining considerations/recommendations for anyone considering a setup like this:

  1. If using the one telco for both the wireless and the wired connection, you’re still at risk of a common fault taking out both services since most ISPs will share infrastructure at some level – eg the international gateway. Use a completely different provider for each service.
  2. Using two wired ISPs (eg Fibre with VDSL failover)  is probably a bit pointless, they’re probably both going back to the same exchange or along the some conduit waiting for a  single backhoe to take them both out at once.
  3. It’s kind of pointless if you don’t put this behind a UPS, otherwise you’ll still be offline when the power goes out. Strongly recommend having your entire comms cabinet on UPS so your wifi, routing and failover all continue to work during outages.
  4. If you failover, be careful about data usage. Your computers won’t know they’re on an expensive mobile connection with limited data and they’ll happily download updates, steam games, backups, etc…. One approach is using a firewall to whitelist select systems only for failover (eg IoT devices, alarm, cameras) and leaving other devices like laptops blocked to prevent too much billshock.
  5. Partial ISP outages are still a PITA. Eg, if routing is broken to some NZ ISPs, but international is fine, the failover checks from ap-southeast-2 won’t trigger. Additional ping scripts could help here (eg check various ISP gateways from the Pi), but that’s getting rather complex and tries to solve a problem that’s never completely fixable.
  6. Just buy a Raspberry Pi. Don’t waste time/effort trying to hack some ancient crap together it wastes far too much time and often falls flat. And don’t use an old laptop/desktop, there’s too much to fail on them like fans, HDDs, etc. The Pi is solid embedded electronics.
  7. Remember that your Pi is essentially a server attached to the public internet. Make sure you configure firewalls and automatic patching and any other hardening you deem appropriate for such a system. Lock down SSH to keys only, IP restrict, etc.
Posted in Uncategorized | Tagged , , , , , , , , , | 2 Comments

Easy APT repo in S3

When running a number of Ubuntu or Debian servers, it can be extremely useful to have a custom APT repo for uploading your own packages, or third party packages that lack their own good repositories to subscribe to.

I recently found a nice Ruby utility called deb-s3 which allows easy uploading of dpkg files into an S3-hosted APT repository. It’s much easier than messing around with tools like reprepro and having to s3 cp or sync files up from a local disk into S3.

One main warning: This will create a *public* repo by default since it works out-of-the-box with the stock OS and (in my case) all the packages I’m serving are public open source programs that don’t need to be secured. If you want a *private* repo, you will need to use apt-transport-s3 to support authenticating with S3 to download files and configure deb-s3 for private upload.

Install like any other Ruby Gem:

gem install deb-s3

Adding packages is easy. First make sure your aws-cli is working OK and an S3 bucket has been created, then upload with:

deb-s3 upload \
--bucket example \
--codename codename \
--preserve-versions \
mypackage.deb

You can then add the repo to a Ubuntu or Debian server with:

# We trust HTTPS rather than GPG for this repo - but you can config
# GPG signing if you prefer.
cat > /etc/apt/sources.list.d/myrepo.list << EOF
deb [trusted=yes] https://example.s3.amazonaws.com codename main
EOF

# and ensure you update the package info on the server
apt-get update

Alternatively, here’s an example of how to add the repo with Puppet:

apt::source { 'myrepo':
 comment        => 'This is our own APT repo',
 location       => 'https://example.s3.amazonaws.com',
 release        => $::os["distro"]["codename"],
 repos          => 'main',
 allow_unsigned => true, # We don't GPG sign, HTTPS only
 notify_update  => true, # triggers apt-get update
}
Posted in Uncategorized | Tagged , , , , , , , | 1 Comment

StuffMe? Or Just Stuffed?

The big news today is that the NZ Commerce Commission has declined the NZME/Fairfax merger. There’s plenty of coverage from the various news companies around NZ, but does it really matter? Either way NZME and Fairfax are stuffed – merger or no merger – unless they can actually create a viable business. The merger would only have changed how long the decline would have lasted.

The fundamental issue is that both Fairfax and NZME were built on the cash cow that (was) classified advertising and print advertising. Classifieds were lost long ago to the likes of TradeMe (although you can argue that Fairfax let that one slip away) and advertising (of both print and digital form) has been on a steady decline as readers move to a range of other medias (social networking, online advertising, TV advertising, etc).

If Fairfax and NZME want to survive they can’t fix their business model by simply cutting headcount to reduce costs, or trying to diversify into unrelated ventures like fibre internet or a daily deals website. They need a fundamental redesign of their business and I doubt that either company is going to be prepared or brave enough to build a senior leadership team that can make this happen.

So what should a media company in 2017 struggling to survive do? Adopt a playbook from the technology and startup world. Cut all the noise out and focus on what the core product should be. Simplify. Dump legacy. And stop operating a structure that suits massive enterprises.

So what does this mean? What should these companies do?

  1. Firstly recognise you’ll never be the massive financial behemoths that media companies were back in the golden era. The money just isn’t there with reader and advertisers attentions now split across so many medias and consumables. Instead of trying to regain glory days, focus on being a leaner company with smaller revenues, but making good profits and keeping the important role of a free press going.
  2. Print is dead. We all know it. It’s just a question of time until the revenue still made from print advertising and subscriptions can no longer cover it’s costs for production. So treat it as a legacy product. Stop investing in it. Do the absolute bare minimum to keep it ticking over until the end. And when that end comes, be ruthless. Kill it.
  3. Move your best people away from any legacy projects – you’re incurring massive lost opportunity costs by preventing them from working on more long term investments.
  4. Avoid the side ventures. You’re not an investment company. The only skill/resource a media company can bring to other industries is free advertising. And then you’re devaluing your advertising product offering by flooding it with your own ads.
  5. Strip the company overhead. You can no longer be a big corporate with layers of management. You need to be a small lean business. Remove layers of management that doesn’t directly create more value.
  6. Drop Outbrain. It isn’t worth the money, it’s a drain on your reading experience and website quality and ruins any quality aspirations that you have. Promote your own stories instead, expose hidden evergreen content and give it longer life thus get more value out of producing that content in the first place.
  7. Does the massive cost of serving your video content pay for itself? Enterprise transcoding tech and data transfer is ridiculously expensive, especially for small NZ players whom are buying data by the hundreds of terabytes rather than hundreds of petabytes. It may be you’re paying more to have control of your own videos than you actually make from all the revenue around them. Think like a startup – upload all your video into Youtube and take advantage of their ad revenue sharing system. And yes it may only be 50% or so revenue share, but it’s 50% + getting all your data transit for free + getting high def 4k capable video serving infrastructure. And your journalists know how to use Youtube and the built in editing tools. Hell, they can upload to it directly from a phone in the field.
  8. Stop writing your own in-house CMS solutions or buying awful not-fit-for-purpose CMS solutions sold by companies with no understanding of the media and news website business and technical requirements. Build something light ontop of open source bones (never underestimate WordPress with a theme and some plugins) or buy a solution that’s specifically designed to meet media requirements like Arc (which is what NZME is doing).
  9. As an extension of the above – you can’t afford to build everything you need. You’re a tiny NZ local news provider, you need to focus on your core business and find ready-to-use off-the-shelf (or off-the-github) solutions.
  10. If you keep finding that you “just have to build our own tech since nothing else does what we want”, ask yourself the question of whether it’s your business workflow at fault – it’s generally cheaper to change processes than to write all your own technology.

And the big one – kill your mobile website. 99% of the smart phones being used are running either Android or iOS. Sorry to the people out there running Windows Mobile or FirefoxOS, you simply don’t have the statistics to justify any kind of investment into your needs. Building and maintaining a mobile website is a hugely wasteful investment to cater to 1% of users.

Instead pour all the mobile budget into developing beautiful apps for Android and iOS that customers actually want and enjoy using – they shouldn’t feel sad that the mobile site has gone, they should be elated that the app experience is so good.

By pushing mobile traffic exclusively to apps, suddenly some interesting capabilities reveal themselves:

  1. You can deliver tailored push messages with breaking news and updates that are actually relevant to the user’s interest- and measure this using an off-the-shelf push message analytics platform (like my current employer offers).
  2. Paywall introduction suddenly becomes trivial. Android and iOS both support in-app purchases. What could be easier than paying $4.99 for one month of full access to all the premium stories and an ad-free experience? You don’t even need to invest in payment and paywall infrastructure, it’s built right into the goddamn operating system. Unhappy about Apple’s and Google taking 15-25%? Doing nothing means taking a 100% cut. And it costs a bloody fortune building reliable and secure payment and subscription infrastructure yourself, don’t think you can do it cheaper unless you really know what you’re doing. And the biggest issue with your own platform is getting users motivated to actually get that credit card out of their wallet. With in-app purchases, it’s trivial since Apple/Google already have their card details – they just need a thumb print to authorise it.
  3. Some people will always be unwilling to part with any amount of cash for subscriptions. That’s fine. Offer up the main headlines and the low cost wire and/or soft-content stories (some might call this “click-bait”) for free and use advertising to drive revenue – just don’t expect it to ever equal print.
  4. And that advertising – suddenly it’s controlled in-app and no longer subject to a browser plugin blocking it. And since you dumped unrelated side ventures, you’re now reducing the volume of advertising so the ads that you DO run are more pronounced, with better engagement. And if you can drive higher engagement, you can get a higher price. Offer an advertising product focused on premium quality, not quantity. And using app location targeting you can do very, very precise local advertising campaigns.
  5. Advertising is an interesting one actually, since so often sales think “banner ads” – but it doesn’t have (and maybe shouldn’t) be just traditional impression or click-through advertising. It’s now trivial to setup an online store with a service like Shopify and just drop their SDK into your own app to sell real world items directly from the phone. Don’t just advertise tickets to that show, SELL the tickets to that show, directly from your app. And offer quality sponsored content – advertorials are awful and should die, but good quality sponsored content relevant to the readers interests has quite successful engagement rates – TheSpinoff is an NZ example doing this quite well. Stuff does it pretty well too when it’s not just promoting their fibre product (stuff bran?) or Neighbourly endlessly.

And don’t think that this mobile-app native strategy will necessarily alienate older print-loving subscribers.  Travel through any international airport and every other elderly traveller has an iPad in their hand. They’re the ultimate old-person computer. Simple, easy to use and feature built in text size zooming for the reader whom struggles to keep up with the font size of the print edition. A quality app experience is actually better than a newspaper ever was. And iPads are everywhere in the older generations now, they *understand* it in a way that they never did with traditional computers or mobile phones.

I do think a good outcome is achievable in the media space, but only if media companies like Fairfax and NZME can apply their funding and learn to actually innovate. There is a space for quality media and content, especially with decent nation-wide coverage – you don’t get that with the international news sources or the smaller players.

During my time working at Fairfax I met many hard working and passionate people in the company whom strongly care about the role of media in society. The editorial team I saw works hard, cares about what they do and can produce some great content. If they can couple it with a proper business plan, the right technology choices and be prepared to make some hard decisions, there is some hope for them. I just fear that it won’t happen.

 

I worked for Fairfax in the technology team for 4 years in both AU and NZ. I now work for a push messaging and analytics company. This post is my personal opinion and does not necessarily represent the views of my former and/or current employer. It could represent the views of a future employer, but only if you’re a media company that actually wants me to come and apply technology-driven innovation to your business.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 2 Comments

Detectatron

I recently installed security cameras around my house which are doing an awesome job of recording all the events that take place around my house and grounds (generally of the feline variety).

Unfortunately the motion capture tends to be overly trigger happy and I end up with heaps of recordings of trees waving, clouds moving or insects flying past. It’s not a problem from a security perspective as I’m not missing any events, but it makes it harder to check the feed for noteworthy events during the day.

I decided I’d like to write some logic for processing the videos being generated and decided to write a proof of concept that sucks video out of the Ubiquiti Unifi Video server and then analyses it with Amazon Web Services new AI product “Rekognition” to identify interesting videos worthy of note.

What this means, is that I can now filter out all the noise from my motion recordings by doing image recognition and flagging the specific videos that feature events I consider interesting, such as footage featuring people or cats doing crazy things.

I’ve got a 20 minute talk about this system which you can watch below, introducing it’s capabilities and how I’m using the AWS Rekognition service to solve this problem. The talk was for the Wellington AWS Users Group, so it focuses a bit more on the AWS aspects of Rekognition and AWS architecture rather than the Unifi video integration side of things.

The software I wrote has two parts – “Detectatron” which is the backend Java service for processing each video and storing it in S3 after processing and the connector I wrote for integration with the Unifi Video service. These can be found at:

https://github.com/jethrocarr/detectatron
https://github.com/jethrocarr/detectatron-connector-unifi

The code quality is rather poor right now – insufficient unit tests, bad structure and in need of a good refactor, but I wanted to get it up sooner rather than later… since perfection is always the enemy of just shipping something.

Note that whilst I’ve only added support for the product I use (Ubiquiti’s Unifi Video), I’ve designed it so that it’s pretty trivial to build other connectors for other platforms. I’d love to see contributions like connectors for Zone Minder and other popular open source or commercial platforms.

If you’re using Unifi Video, my connector will automatically mark any videos it deems as interesting as locked videos, for easy filtering using the native Unifi Video apps and web interface.

It also includes an S3 upload feature – given that I integrated with the Unifi Video software, it was a trivial step to extend it to also upload every video the system records into S3 within a few seconds for off-site retention. This performs really well, my on-prem NVR really struggled to keep up with uploads when using inotify + awscli to upload footage, but using my connector and Detectatron it has no issues keeping up with even high video rates.

Posted in Uncategorized | Tagged , , , , , , , | 1 Comment