Behind the Curtain: Clustering Across The World

In my last post, I discussed the obstacles introduced by restructuring our services into a more modular and scalable structure.

One obstacle this has introduced is how even a small amount of latency between the backend and interfaces could dramatically increase the time it takes to fulfill a complex request.
This becomes a very real possibility when considering that we are planning to distribute our services into multiple datacenters across the world in order to provide constant uptime and availability, even in cases of local natural disaster.

To do this, our shards are provisioned in such a way as to group geographically similar objects together on the same shard. This means that, for the most part, a single agency, as well as agencies which are close together, will likely be on the same shard.

By incorporating these shard identifiers into the new 64-bit identifiers themselves, we are able to direct requests to a particular shard cluster just by using a DNS prefix for each shard identifiers.
This allows us to integrate our services vertically within each cluster, from the frontend apis all the way down to the database backing each shard.

There will be some objects which may be sharded to a location geographically distant from their original agency, for example when someone moves across the country and switches to an agency in their new city. Thus, there will be some connections between distant clusters, but we expect this will be far and few between.

In essence, each cluster will be able to serve a distinct geographic area with minimal connections to other shards, allowing consistently low latency between the layers of our backend infrastructure.

We will be migrating data to the new system slowly over the course of the next year, which should create minimal impact on our current system. This concludes our preliminary look at our new sharded data infrastructure, please join me next time as I present an overview of where Active911 is today and what we hope to achieve through the end of 2015 and into 2016.

Behind the Curtain: Sharding Over Time

I’ll explore my first topic to do with sharding in a bit more depth today.

The first obstacle to tackle has been increasing the number of requests needed to fulfill some of our API calls.
For example, whenever a device calls home for an update, it asks for all of the agencies and devices which are related to it, as well as all of the alerts that the device hasn’t read yet.
In our current single database structure, this is basically 3 database calls, albeit fairly heavy ones.
Within a sharded system however, a call must be made for each individual related object since each could live on an entirely different server located hundreds of miles from the original device.

Now, the heavy complex query may each take 140ms to establish a connection, 200-400ms to run each query, and 100ms to transmit the response.
Making 3 such calls equates to about 2 seconds.
Comparatively, in a sharded system, a database call would need to be made, in the worst case, once for each related device and agency.
The average device belongs to about 3 agencies, each agency contains on average 25 devices.
That means a single call home would require at least 80 database calls.
In a naive implementation, that would be 140ms for each database connection, 5ms to run each query, and 5ms to transmit each response.
As you can see, this adds up to about 12 seconds for a single call home, a drastic increase.

Of course, there are methods by which we can reduce this. For example, taking advantage of the fact that, since most people will be part of geographically similar agencies, the likelihood of them being all on the same server cluster is high. This allows us to implement things like connection pooling and multiple select queries when we are able to anticipate records being on the same shard server.
Connection pooling alone would likely decrease the above example from around 12 seconds to about 2.4 seconds, already bringing it closer in line with our current performance.

The one thing that makes all of this even more complicated is the fact that, the more we shard things, the more prone to latency issues it will be. A 25ms delay between the interface and the database in our current system equates to about 75ms longer api calls, a decent but mostly unnoticeable difference.
In the world of shards, this would translate to a whopping 6 second increase, almost quadrupling the average use case.
That leads us to the next obstacle, latency caused by widespread geographic distribution of our services, which we will tackle in my next blog post about clustering of services.

Behind the Curtain: Scalability

As Paul noted in his post, we’ve been working on building up our backend to be more scalable, especially our Databases. As it currently stands, we have a single database which is loaded into memory on startup. This allows all of our queries to be super fast, event when traversing hundreds of thousands of rows.

There are a few downsides to our current deployment. First, since we’re preloading all the data into memory, the spool up time for this database is huge. Since the cost is so high, taking it down for maintenance is something we do very infrequently. While this is not a bad thing in and of itself, it does mean that we cannot make large changes to our database structure more than a couple of times a year.

Second, since all the data is on a single server, the only way to scale up is to throw more resources at the server. This is only feasible as long as our growth is at or below the pace which technological power increases. As we grow faster and faster, this is becoming less and less feasible, necessitating the need to more toward a more distributed datastore system.

Third, serving all of our data from a single datacenter means that connectivity issues in that area would adversely affect service nationwide.

Sharding solves all 3 of these issues.

First, since each shard contains just a fraction of the data, it takes much less time to warm up each server. It also allows us affect only a small set of our users if we need to take a server down for maintenance.

Second, sharding allows us to gradually bring up more servers as needed to serve the growing load. This is especially useful in allowing high volume areas to have their own dedicated cluster of shards.

Third, by splitting the data up across multiple shards, we can begin widely distributing our servers. We have long discussed and wanted to implement a way for us to host the servers away from the areas they service. For example, New York would be serviced mainly by servers hosted outside of the Northeast. Not only would this resolve the single point of failure issue that Paul outlined, it would also decrease the likelihood that our service would go down during a natural disaster, since the servers are outside of the affected area.

Of course, such a huge change will take time, and is not without its risks and potential downsides. We are making this change gradually, starting with the new PARS board feature that is coming soon. Check back often as we explore the wonder and mysteries of sharding.

Behind the Curtain: A Re-Introduction

It’s been a year since I’ve lasted posted here, and quite a lot has changed.

As we continue to grow, we’ve been reviewing our external communication channels, and started exploring new ways to engage with our customers. There aren’t many direct means of contacting our developers, but we do read the forums weekly.

One thing we introduced late last year are developer Q&A sessions. We have had 2 so far in November and February, and our next one is coming up in May. We are planning for the Q&A sessions will coincide with our app updates, which we are aiming to release at the beginning of each quarter.

Aside from the Q&A, we’re going to begin posting regularly on this blog each week, rotating between developers. We’ve added a few new faces to the development team, who you will all meet soon through this blog as well. I will let them introduce themselves in the coming weeks.

Of course this really wouldn’t be a look behind the curtain if I didn’t give you guys a sneak peak of what is coming up. The main focus for the past couple of weeks has been a new feature which will allow creation of “Assignments”.

These could be anything from ordinary station assignments to specific crew assignments or even roster status assignments. The main idea is to give your agency the ability to know who is supposed to be assigned to what at any given time. A simple example would be setting up assignments to represent your multiple fire houses, and scheduling which house everyone should be at each shift.

We’re really excited to roll out this feature in the upcoming month, and will putting out a testable sandbox later this week. Look for a link to it in the upcoming newsletter!

Behind the Curtain: Inter-App Communication

I’ve been working on another Android app the past couple weeks. We’ve added a couple of features, included enhanced logging and a couple of night-mode themes.

One thing that has been often requested is a way to launch Scanner Radio from our app.
I got in contact with the developer of Scanner Radio and he gave me a snippet of code that allows Active911 to launch the Scanner Radio selector.
The main problem I’ve been working on overcoming today is how to store the Intent that is given back in the ActivityResult of the selector.

There’s not really a good way to serialize an Intent in Android, it seems that they are intended to be passed and immediately used.
The design pattern for most Inter-Process Communication is defining an Intent schema that can be used by external applications.
Another pattern used is exposing an external Service, which I haven’t looked into much.

Regardless, right now I have an Intent that, when used immediately, works as intended. However, I can only use that intent as long as the app is open, so I need to find a way to persist the Intent while maintaining all of its required information.

At first I tried just using the Intent.toUri() method. That gets me halfway there, as the URI still doesn’t contain any of the Extras in the Intent.
The next thing I tried was manually serializing the data into tab-delimited name value pairs. However, this doesn’t work as I don’t necessarily know if the extras are all Strings, as well as how many layers of nesting there are.

My final strategy is to take the Bundle and serialize it into a byte array, and then store that byte array as a UTF-8 string, then reversing the process to get it back out.
This stackoverflow answer was the most helpful in figuring out how to do so

The downside to this is that, since the class implementing Parcelable may change between platform versions, this is not guaranteed to work across system updates.
Also, after digging through the Parcel and Bundle source code, it seems that the Intent may be too large to serialize and deserialize in this way.

It looks as though the extras only has a single item, so instead I will be just serializing and deserializing this one item.
It contains a set of 3 name value pairs, so I will serialize that into JSON, though serializing it into byte data did work initially.

In the end, I ditched the Parceling and byte storing of the data, since that is highly discouraged, and just stored the name value pairs.
As long as the bundle doesn’t get deeper, have a change in the name of the extra, or use anything other than Strings in the sub-bundle it will work.

The Android update with a Scanner Radio setting is hitting our new Beta group today, and will likely be updated in the Google Play store next week.

Behind the Curtain: Goodbye TestFlight

With Apple’s acquisition of TestFlight’s parent company, we received notice that TestFlight will no longer serve up Android builds as of March 21st.

While unfortunate, we must persevere. I’ve begun looking at options for replacing TestFlight.

The first option I considered was switching over to just using Google Play for both Beta and Release versions.
This would allow all of our Android deployments to come from one place, though it would mean that users would have to choose between Beta and Release for all their devices, whereas they can have different devices using different versions currently.
Also, Google’s analytics are geared more towards user engagement and marketing rather than Beta testing.

The second option we are currently considering is a site called TestFairy. They immediately jumped on TestFlight’s recent announcement about discontinuing Android support by welcoming former TestFlight users, even providing hooks that allow apps which include the TestFlight SDK to not need any code change at all. A very smart move.
So far, in my 1 day of testing, I am fairly impressed with the polish and extensive metrics provided by TestFairy. Not only does it gather checkpoints that I’ve set, it is able to monitor memory, CPU, Network, Phone Signal Strength, Battery, and even allows for screencast recordings.

The one snafu that I encountered was Google Maps initially not working with our app. Sadly, the documentation is a bit lacking, though the interface is very clean and simple so it doesn’t matter so much for the most part.
In the end, I was able to figure out that, since TestFairy actually modifies our SDK to add in the metrics gathering, it has to sign itself with TestFairy’s signing key. There is a field under App > Settings > Signature that shows Facebook and Google API keys, which, upon mousing over, reminded me that I needed to add the TestFairy signed package to our Google Maps API allowed apps.

Overall, I am thoroughly impressed with TestFairy’s capabilities, if still a little weary of the fluffy name. Hopefully, unlike TestFlight, they won’t poof and disappear on us. I’ll be testing TestFairy for a couple more days, but expect us to switch over to using it for Beta next week.

Behind the Curtain: Developer API Alpha

I am putting the finishing touches on the API Alpha Release this week, as well as an example web application.

The preliminary documentation for getting started is on our new Wiki at http://wiki.active911.com/wiki/index.php/Active911_Developer_API
The JSON is still fluid so I may be changing a few of the device/agency/alert/response attribute names.

I did have a bit of trouble yesterday in making a test application. While getting the cross-site scripting to work, I stumbled on 2 errors.
First, I spent almost an hour debugging invalid requests, trying different Apache configs and different data formats, before realizing I had inadvertently typed access.or.active911.com instead of access.active911.com (facepalm)

After I figured that out, I had to learn a bit about CORS and HTTP Headers. It seems that, if you want to set headers like Authorization, you first need to establish which headers your page will accept from another domain. I kept running into 401 errors and, upon inspecting the request, found that my Authorization header was not being sent. (Using jQuery’s AJAX method and a beforeSend to insert the OAuth2 Bearer Token)

After some more fiddling with the AJAX and digging around the internet, I found that I needed to accept the OPTIONS request method, which I had previously not heard of.
Once I figured that out, it was a simple matter to implement the preflight workflow and enable CORS for my web application.

Please try out the API and let me know how it works for you. We will be continuously improving it as time goes on.

Behind the Curtain: User Access

We are looking at ways to give every agency more ways to personalize and add content for their users.

One upcoming change we are considering is giving all users the ability to create their own account, regardless of the settings of any particular agency.

Currently, the only way to create a user account it to be an administrator of an agency or have one created by the an administrator.

However, we’ve come across some cases where the administrator does not want to create accounts for users because they are afraid their users may inadvertently mess up some settings for the agency.

Unfortunately for those users, being denied a login account also prevents them from participating in our forums and user polls.

As such, we will most likely be removing the agency setting that enables/disables account creation. However, this does not necessarily mean that everyone will be able to see their devices. Administrators will still be the ones that create devices, and to create an account users will still need to provide an email that is associated with a device.
This means that if an administrator sets the email address of all devices to just the administrator’s email, people will still be unable to create an account, but there does not seem to be a good way to get around it besides suggesting that administrators abandon that practice.

If you have any thoughts on other ways to improve the user experience on our website, please comment below!

Behind the Curtain: Catching Up

I’ve fallen a bit behind on my blogging since the new year.
I have 2 posts which are half finished, both regarding new features that are coming up this year.
I hope to complete those and out to you all this week and next.
As with any growing company, we’re reaching a critical point of juggling multiple pressing features.

Currently, I have an Android update waiting in the wings that I hope to put out to the Testflight group today and release to the public next Monday.
I have the OpenAPI which requires both an OAuth implementation as well as a RESTful architecture to implement, which I hope to have an initial implementation of by early February.
I have a change I need to make that will default to our new SMS shortcode for all agencies in the US and an international number for all others.
And then I have the daily little “fires” to put out each day.

There’s certainly enough to keep even twice our number busy.
So if I do forget to reply to an email or request in a timely fashion, please don’t hesitate to send me reminders.
I may have that information/change sitting on my desk/desktop, just waiting for me to remember it’s there.

Behind the Curtain: Database Whoa’s

Aside from the newly released Custom Response Buttons, we did a bit of work behind the scenes relocating our main database server to a beefier piece of hardware in our Texas datacenter.
This resulted in around 0.1s of increased lag when loading up a page through our website or mobile interfaces, barely noticeable in most cases.

Our support team however started noticing huge slowdowns in their ability to browse through our websites, along with a couple of administrators for large counties with multiple agencies.
It turns out one of our authentication methods is using multiple queries for each agency a user is a part of, which doesn’t pose much of a problem when turnaround to the database is 10ms, but quickly adds up when queries start taking 100ms or so.

I’ll be working on optimizing this, either by removing the extra queries and putting them into the workflow somewhere else, or by combining them into an already existing query.

The advantage of putting the queries elsewhere in the workflow are that they will no longer affect every page request, and will only be called when needed.
The disadvantage of course is the proliferation of operations in our interface, which is rapidly growing as we add more nuances and functionality. We have talked about a web interface revamp in the near future, so this may be the correct solution until we can get to that project.

The advantage of putting the queries into one big one is that it will keep the logic in the same place, thus keeping our workflows streamlined and simple.
The disadvantage is that the query may become so big an cumbersome that it is too complex to work efficiently, as well as maintain in the future.

Either way, we are due for a reworking of the website in the near future as neither is a great solution, and will require some restructuring in order to remain scale-able for the future.