Behind the Curtain: Sharding Over Time

I’ll explore my first topic to do with sharding in a bit more depth today.

The first obstacle to tackle has been increasing the number of requests needed to fulfill some of our API calls.
For example, whenever a device calls home for an update, it asks for all of the agencies and devices which are related to it, as well as all of the alerts that the device hasn’t read yet.
In our current single database structure, this is basically 3 database calls, albeit fairly heavy ones.
Within a sharded system however, a call must be made for each individual related object since each could live on an entirely different server located hundreds of miles from the original device.

Now, the heavy complex query may each take 140ms to establish a connection, 200-400ms to run each query, and 100ms to transmit the response.
Making 3 such calls equates to about 2 seconds.
Comparatively, in a sharded system, a database call would need to be made, in the worst case, once for each related device and agency.
The average device belongs to about 3 agencies, each agency contains on average 25 devices.
That means a single call home would require at least 80 database calls.
In a naive implementation, that would be 140ms for each database connection, 5ms to run each query, and 5ms to transmit each response.
As you can see, this adds up to about 12 seconds for a single call home, a drastic increase.

Of course, there are methods by which we can reduce this. For example, taking advantage of the fact that, since most people will be part of geographically similar agencies, the likelihood of them being all on the same server cluster is high. This allows us to implement things like connection pooling and multiple select queries when we are able to anticipate records being on the same shard server.
Connection pooling alone would likely decrease the above example from around 12 seconds to about 2.4 seconds, already bringing it closer in line with our current performance.

The one thing that makes all of this even more complicated is the fact that, the more we shard things, the more prone to latency issues it will be. A 25ms delay between the interface and the database in our current system equates to about 75ms longer api calls, a decent but mostly unnoticeable difference.
In the world of shards, this would translate to a whopping 6 second increase, almost quadrupling the average use case.
That leads us to the next obstacle, latency caused by widespread geographic distribution of our services, which we will tackle in my next blog post about clustering of services.

Leave a Reply

Your email address will not be published. Required fields are marked *