We are reaching the limits of our current architecture, and there has been quite a few examples of this recently: performance of the private view or feed features, CPU bottlenecks at increased RPM, etc. This post will focus on the former, as the latter is related to implementation details, while the former points to more generic issues.
I would also like to talk about scalability, because plans for new features of the platform go way further than anything we currently do. If we want to grow (number of users, number of features, amount of data stored and processed, etc.), we not only need to solve our current performance problems, but we need to find a way to guarantee good performance as new features are introduced, and more and more users interact with the platform in a richer and richer way.
Our current practice to address performance issues is mainly going for the low hanging fruits, and one low hanging fruit leads to another. While this obviously improves the platform, still it is meandering, and doesn't eventually take us to a stable state with reliable performance. What I'm suggesting here is to cut some corners and to introduce slightly deeper changes to our architecture.
But before I describe what I have in mind, let me quickly go through how Kinja currently works, and for that I will use a few diagrams. These diagrams are simplified and abstract, and real life of course is a bit more complex, but still they capture the essence of these operations, and help to make some points. I'm using two scenarios as examples: creating a new post and then viewing the permalink page of that post.
The first diagram shows how a post is saved in Kinja. Blue arrows correspond to database queries, and red arrows indicate blocking. The HTTP request for saving a new post comes from the right, and the response to that request is actually not that important in this case, as it is merely just an acknowledgement to the client.
As I said, this example is simplified a lot, and for example in reality authorization and permission checks require more than one round-trip to the backend to resolve the current user and the current blog, but I left that out to keep things simple. Also, the step of adding the post involves validating the author id and blog ids by resolving the author and blogs. But I'm trying to focus here on the essence of this write operation, which is basically the construction of a post document, and also of an insert query, to save that document in a table.
After that query is built and executed, a response is sent to the client, and at the same time, asynchronously, the building of lookup tables is kicked off. The post document is normalized data, and thus references from the post to other objects are saved as object ids in the document. The lookup tables, built asynchronously, are means of representing relations of objects in our system, so that for example posts can be queried by authors or blogs. They link object id's of different object types to each other. In this example three such lookup tables are updated when a new post is saved: the post's relation to blog id's, to author id's, and to a parent post (if a post is a reply).
Here's the write process itself, without the databases.
The second example I want to show is the scenario of a user visiting the permalink page of a post. The same apply to this case too: I left out the details of how authorization and permission checks work, and the only thing I want to focus on is that the outcome of request pre-processing is a post id, which is then used to retrieve that post from the database.
The read operation is basically a sequence of blocking steps, which are database queries. As details of the blog and the author of the post are displayed on the page, those objects need to be resolved. And all replies to that post are retrieved too, first by acquiring all post ids where the parent is the current post, and second by resolving these post ids. This is another, quite significant simplification, as building up a conversation tree is way more complicated in real life. But still this example is good for showing the purpose of lookup tables, and the way they are used (i.e. yet another blocking database query).
Let's see the big picture - both operations, read and write, at the same time. The second diagram below shows the actual applications too, the runtime environments where different steps of these processes are actually executed.
If we think about these two operations for a second, we quickly realize that both are basically functions: write is a function which takes a request and returns normalized data, and read is a function which takes normalized data (and of course a request too) and returns a page. These are transformations between data structures: write is data normalization, and read is transformation of normalized data to the structure dictated by page templates.
There are two very important differences between the nature of these two operations:
- reads are much more frequent than writes,
- the result of a read is always awaited by someone (the user opens the page and waits for it to be displayed), so it can never be asynchronous (while, in contrast, nobody really cares when a write finishes).
So our system does reads most of the time, and these reads have to be fast.
Storing normalized data means that the structure we store the data in is very far from the structure dictated by page templates. The purpose of data normalization, apart from compactness, is abstraction, so that changes to layers like user interface or business logic does not affect the data. This is especially important if data is stored in rigid stores with schemas, like a relational database, and the business logic or the user interface changes frequently - and this actually is the case. On the other hand, if the structure of stored data is very different from the structure templates need it, then the majority of logic and processing, the real heavy lifting is in the read function, the read operation.
So we are in a situation where we do significantly more reads than writes, which reads also need to be very fast, and at the same time these reads are way more complex and expensive than writes. The biggest part of our business logic is de-normalization of data on the fly, and is encoded in sequences of complex queries, executed on every read operation, while a user is waiting for the result.
Our current preferred way of attacking performance issues is to ease this tension by
- optimizing these complex queries, and not only by smarter JOIN's and slicker IN's, but also by breaking up bigger queries into smaller pieces, running these in parallel, and moving some of the logic to the Scala side;
- finding bottlenecks and adding arbitrary caching here and there.
Slow queries are the low hanging fruits most of the time, and these can always be remedied by caching and/or refactoring. The problem is that these solutions are kind of arbitrary, and one-off patches build up over time and clutter the code, which makes development harder, especially as teams are growing. And also some of these solutions don't scale very well.
What I'm rather suggesting is of course not special or groundbreaking in any way - quite the opposite actually, it is nothing but glorified caching, or more precisely a slightly different approach to caching, which is more systematic and more coherent. The essence of this suggestion is to bring the structure of stored data and the structure of data required by templates closer to each other, by moving all the logic from read operations to write operations, and persist the result of all the processing. Or, in other words, to store (or to cache, if you prefer) redundant, de-normalized data, in a structure which lets it to be handed over to template rendering without additional transformations.
Is this any different from pre-rendering all mathematically possible HTML pages in all possible combinations for every possible user in every possible situation? Not really. I don't, of course, suggest that, but it's a good metaphor, and it also makes the downsides of this approach more obvious.
For example frequent UI changes cascade down all the way to the level of stored data. But that only means that relational databases, or any kind of data storage using schemas is not a good match for such a system. While lookup tables encoding relationships of objects fit nicely into relational tables, de-normalized views require something rather similar to what we use for storing master data (i.e. protocol buffer documents).
Another issue is that the data normalization process deals with consistency, and cases like reflecting an author name change on every page where this name is displayed, are handled naturally, as these pages are assembled on the fly, from always up-to-date atomic pieces. But the update (or invalidation, if you prefer) of pre-computed views need to be triggered. The most natural way of doing this is publishing these changes as events to subscriber components that pre-compute views. An author changing his name is represented by an author rename event, published to components in the system, and all of these that have author name included in their view, subscribe to this event and react to it by updating the pre-computed view.
Such an event-driven pub/sub system must be centered around a message queue, an event log. The event log must be persisted, and must be immutable (append-only), as it is the master data in the system. Events are published through this event log, and components pre-computing views subscribe to this event log. In such a setup, in a way, entities like posts or authors are just properties, or meta-data of events. And views, built of these properties, can be dropped and rebuilt any time from the master data, the timeline of events.
Let me now go through how a pub/sub system would work. Again, these are abstract examples of the same two scenarios - new post and permalink page view -, and are simplified a lot. Actually more so than in the previous case, as for example authorization, even in the most simple form of resolving a token to a user, is completely left out. There's the option of doing authorization during pre-computation of views, but that of course would result in a lot of dummy data in the event log (as every unauthorized attempt to do anything in the system would be logged), which would need to be dealt with.
The first diagram shows how a new post is saved, by publishing a new post event to the event log, and services like the permalink service, subscribed to this event, react by computing and updating views that use any data of this event. (The labels in the diagram only mention insert queries, but those could easily be updates or deletes, even on a new post.)
These datastores at the bottom, where pre-computed views are stored, can be storages of any kind, tailored to the needs of the particular service, page, or functionality. For example private view pages, which display personalized views listing posts of blogs followed by a user, could benefit from a graph database, which stores posts along the relationships ("follow", "recommend", "reply", "block", "dismiss" etc.) between users. Of course in such a setup the same piece of data, even like the actual body of a post, is stored multiple times, highly redundantly.
The next diagram shows how read operations work, how a post's permalink page is rendered. In the ideal situation a single record is retrieved from the datastore by its key, and handed over to the template rendering component, without any additional transformation and processing.
This diagram below shows how data is propagated and processed in the system.
As you can see, there's no heavy lifting during reads, but these only involve simple queries to datastores and template rendering, so they are as fast as they can be. But this setup not only performs, but scales better too. It is a distributed, event-driven architecture, where components are decoupled much more. Also, simple queries do not rely on data being physically in one place, which also supports scalability. And in such a system scaling is mainly about scaling writes, so the increase in traffic related to read-only operations like simple page views does not affect it that much.
On the downside, the catch is the highly increased number of database writes, especially to the event log, so this would require a datastore which is optimized for inserts. Lucky for us, there are very good solutions to this problem. Another problem can be that potentially infinite storage space is needed for all the redundant data.
Let me summarize everything in four flowcharts, which reduce all the above diagrams to the very essence of the four examples shown. Expensive operations are highlighted, so that the differences of the two architectures explained are more obvious. The first flowchart shows a write operation in the current setup:
While a write operation in the pub/sub setup would look like this:
A read operation in the current system involves the following steps:
And finally a read operation in the proposed setup consists of these steps: