Behind the Curtain: Sharding Over Time

I’ll explore my first topic to do with sharding in a bit more depth today.

The first obstacle to tackle has been increasing the number of requests needed to fulfill some of our API calls.
For example, whenever a device calls home for an update, it asks for all of the agencies and devices which are related to it, as well as all of the alerts that the device hasn’t read yet.
In our current single database structure, this is basically 3 database calls, albeit fairly heavy ones.
Within a sharded system however, a call must be made for each individual related object since each could live on an entirely different server located hundreds of miles from the original device.

Now, the heavy complex query may each take 140ms to establish a connection, 200-400ms to run each query, and 100ms to transmit the response.
Making 3 such calls equates to about 2 seconds.
Comparatively, in a sharded system, a database call would need to be made, in the worst case, once for each related device and agency.
The average device belongs to about 3 agencies, each agency contains on average 25 devices.
That means a single call home would require at least 80 database calls.
In a naive implementation, that would be 140ms for each database connection, 5ms to run each query, and 5ms to transmit each response.
As you can see, this adds up to about 12 seconds for a single call home, a drastic increase.

Of course, there are methods by which we can reduce this. For example, taking advantage of the fact that, since most people will be part of geographically similar agencies, the likelihood of them being all on the same server cluster is high. This allows us to implement things like connection pooling and multiple select queries when we are able to anticipate records being on the same shard server.
Connection pooling alone would likely decrease the above example from around 12 seconds to about 2.4 seconds, already bringing it closer in line with our current performance.

The one thing that makes all of this even more complicated is the fact that, the more we shard things, the more prone to latency issues it will be. A 25ms delay between the interface and the database in our current system equates to about 75ms longer api calls, a decent but mostly unnoticeable difference.
In the world of shards, this would translate to a whopping 6 second increase, almost quadrupling the average use case.
That leads us to the next obstacle, latency caused by widespread geographic distribution of our services, which we will tackle in my next blog post about clustering of services.

My Fascination With Node.js and Server Side JavaScript

In my previous blog posts the state of the Windows Phone app has been the primary focus. The Windows app is very close to complete and, other than adding some features and fixing some bugs, I don’t feel like there is much to update on. Instead I thought it might be interesting to share one of my interests with you.

In my time writing code I have used many different languages. Some have been by choice and others out of necessity, and I have to say that I have had an enjoyable experience working in most of them, but there is one technology that has become more of a favorite of mine in the last couple of years. That technology is Node.js, and I want to touch on some of its history and features.

I have always felt that a good indicator of a technology’s success and future relevance has been the activity of the online community surrounding it. And Node.js has a HUGE following in the online community and will probably be a promising solution to problems for years to come. With such a large community and rapid development, there are many different blog worthy topics that I could cover, but I will focus primarily on the history of Node.js and where it excels: I/O efficiency through an asynchronous event loop.

Node.js is JavaScript. Many technology savvy people are aware of JavaScript and may have experience writing some of their own. Those who are less familiar with programming languages often confuse JavaScript for Java because the names sound similar, but they are distinctly different languages. It isn’t surprising that JavaScript (aka ECMAScrip) is so well known because it has been around since early 1995. Since its release, it has become the language for client side code in internet browsers. Today, you would have a difficult time if you were tasked with finding a web page that didn’t use JavaScript as it is used on almost every website that we visit daily.

Where Node.js comes in to the picture is that it takes JavaScript from a client’s web browser to the server or almost any other platform that you could potentially execute code on. It is interesting to note that Node.js is not the first implementation of server side JavaScript; Netscape Enterprise Server had an implementation of JavaScript when it was released in December of 1994. However, Node.js isn’t what it is simply because there was a desire to bring JavaScript to other environments. Ryan Dahl, the creator of Node.js, had a problem to solve, and that problem was that we had been using I/O inefficiently. When a connection is made to a database and we query for data, a typical program will wait for a response and then when it receives a result, do something with it. The problem is that the time in which a program is waiting for a response from the database can be at the cost of millions of clock cycles. An efficient program will be able to make a call to a database and execute other code while waiting for a response, thus utilizing resources as optimally as possible.

A programmer’s first instinct may be spooling up multiple threads, allowing the program to do other things while waiting for a response. The problem with this is that spooling up a new thread and context switching between them is not free. An example of this behavior can be found by examining how two web servers, Apache and NGINX specifically, handle connections. NGINX tends to handle a higher load of requests per second with less wait time than Apache. Part of this is because NGINX came onto the scene after Apache with more awareness of the concurrency problems that arise with Apache at a high scale. The difference is that Apache uses a thread for each individual connection while NGINX implements an event loop. So both servers are capable of handling multiple connections concurrently, but Apache has to context switch between threads while NGINX queues tasks for execution in an event loop. Switching context between threads adds extra overhead especially at higher loads where more threads are involved.

This is why Node.js set out to solve the I/O problem by using an event loop instead of threading. An important concept to consider is that, being human, we need to be able to write code in a way that things make sense. Ideally we write code requesting a piece of information like we would from a database and immediately after, we do something with that data. This is where JavaScript comes in. I believe that Ryan Dahl would have chose another programming language to build Node on if it made more sense. But JavaScript being a language that was used primarily in browsers was built for event loops with its ability to pass functions around as callbacks. There are many different events happen in a browser, such as page load, button clicks, images being loaded in or many other forms of I/O. This trigger what is known as a callback function is triggered in JavaScript. This makes JavaScript very well suited for building into an event loop environment.

It is important to point out that other languages had asynchronous libraries that would handle an event loop at the time of Node.js’ conception (ex Python’s Twisted and Perl’s AnyEvent). But the problem with these libraries is that a developer would use these libraries and then want to use other libraries as well, but if the other libraries that the developer wants to use are not written with the ability to run asynchronously then they really won’t play well with the asynchronous library. Node.js had the intention of being asynchronous at its core.

Now that we have covered all of this, let’s look back at the example I have so frequently referenced where we are waiting on data from a database. We can have a function in Node.js that will handle opening a connection to the database and querying for a result. The key difference is that this function will take in a callback function. Then, Node.js will handle queueing tasks up in the event loop effectively, allowing for execution of code while we are waiting on the I/O of the database connection. When the database returns with a result, our callback function will be called with the result from the database. Thus keeping the local machines resources working as effectively as possible without any down time.

1|    function getData(callback) {
2|        var db = getConnection();
3|        var result = db.query(“SELECT something FROM something;”);
4|        process.nextTick(function () {
5|            callback(result);
6|        });
7|    }

This example isn’t using the database connection asynchronously, but it gives you a basic idea as how this code can run. It won’t hold up any other pieces of code from running until it calls the callback function on its next tick.

Overall, I find Node.js fascinating as well as fun to work in, because you get to write asynchronous code in a completely new way. Node.js projects are quick and easy to write, and they can also run on many different platforms. The evolution of Node.js is, and continues to be, very interesting as the large Node.js community that is constantly building new projects and libraries as well as improving Node.js as a whole.

Node.js is not without its own problems. If I were to show an example of using a database function asynchronously, we would see an example of “Callback Hell”, a certain type of code readability issue. This is an issue in Node.js, but has an interesting solution called “promises” that I may discuss in a future blog post.

Behind the Curtain: Scalability

As Paul noted in his post, we’ve been working on building up our backend to be more scalable, especially our Databases. As it currently stands, we have a single database which is loaded into memory on startup. This allows all of our queries to be super fast, event when traversing hundreds of thousands of rows.

There are a few downsides to our current deployment. First, since we’re preloading all the data into memory, the spool up time for this database is huge. Since the cost is so high, taking it down for maintenance is something we do very infrequently. While this is not a bad thing in and of itself, it does mean that we cannot make large changes to our database structure more than a couple of times a year.

Second, since all the data is on a single server, the only way to scale up is to throw more resources at the server. This is only feasible as long as our growth is at or below the pace which technological power increases. As we grow faster and faster, this is becoming less and less feasible, necessitating the need to more toward a more distributed datastore system.

Third, serving all of our data from a single datacenter means that connectivity issues in that area would adversely affect service nationwide.

Sharding solves all 3 of these issues.

First, since each shard contains just a fraction of the data, it takes much less time to warm up each server. It also allows us affect only a small set of our users if we need to take a server down for maintenance.

Second, sharding allows us to gradually bring up more servers as needed to serve the growing load. This is especially useful in allowing high volume areas to have their own dedicated cluster of shards.

Third, by splitting the data up across multiple shards, we can begin widely distributing our servers. We have long discussed and wanted to implement a way for us to host the servers away from the areas they service. For example, New York would be serviced mainly by servers hosted outside of the Northeast. Not only would this resolve the single point of failure issue that Paul outlined, it would also decrease the likelihood that our service would go down during a natural disaster, since the servers are outside of the affected area.

Of course, such a huge change will take time, and is not without its risks and potential downsides. We are making this change gradually, starting with the new PARS board feature that is coming soon. Check back often as we explore the wonder and mysteries of sharding.

Meet Active911 for Windows Phone 8.1

Greetings! In my last blog post I shared some of the development plans for Windows versions of the Active911 app. Since then there has been some considerable progress with the Windows Phone version and we have kicked off production on the tablet version. So, I thought I would attempt to wow and amaze you with sneak peeks of the latest beta release for the phone. Currently, we are on the 6th update for the phone and the beta testers have contributed wonderfully to the final production version. There is still some work to do with Windows Push Notifications, but the phone version is almost ready for production!

Since we worked hard to make the app look and feel at home in the Microsoft environment, we had different tools available to us than when our other apps had been developed. This has resulted in a uniquely Microsoft version of Active911. To start off with, Instead of listing all of a user’s agencies in one list view, we used what is called a pivot view to allow convenient swiping between different departments.

pivotviews

The personnel view also takes advantage of a pivot view and looks much the same as far as sliding between agencies to see who is responding and how far they are from your current location.

Another neat eye catching design is the map in the background. Along with the agency in the heading this is to help give you a better feel for which agency you are currently viewing. It also turns out to be a pleasing aesthetic to keep a map somewhat central to the overall theme.

Viewing an alert’s details is very much similar to our other apps with the exception of a mini map to give you an idea of the location without going directly into the map view. Of course a quick tap on the map or address will take you to an interactive map where you can quickly get a view of the surrounding area.

navi

And we have taken advantage of the Command Bar at the bottom of the screen to give you a different set of view specific options throughout the app. For instance, the three buttons that you see at the bottom of the map view will allow you to switch to a different route, switch between satellite and road views, and view step by step directions to the alert. While on other views you will have the options to view the status of personnel, or access the settings of the app.

All in all, I think that the Windows Phone version of Active911 gives the same great experience, but just wrapped in a Microsoft style interface. We have already discussed plans for how we are going to bring a similar, but distinctly tablet, design to Windows tablets, which will be a fun thing to show off in the future.

As I said, things are coming along very well with the Windows Phone app, and development on the tablet app has increased. I would like for your takeaway from this to be that the phone app is close and that everyone who is eager to get their hands on it won’t have to wait much longer. In the meantime, I hope I have shared at least a little bit of my excitement and satiated any curiosity as far as Windows Phone is concerned.