Facebook Twitter Linkedin
article | 17 Oct 2021
Introduction to Apache Cassandra
Przemysław Pala
ss

Apache Cassandra - A Perspective for Relational Developers

Let’s talk about an introduction to Cassandra! During this article, we're going to cover a few things. We'll show a little bit about relational databases, maybe some of the problems that you run into when you try to scale relational databases for high availability then we'll also cover some core concepts of how:

  • Cassandra works internally, 
  • Some of the dials that it gives you as a developer to control its fault tolerance and, this notion of high availability
  • The different distributions that you can choose from Cassandra's whether his open-source Apache Cassandra are right for you or should you maybe think about using DataStax Enterprise Distribution DSC when you employ Cassandra to production. 

Small database - Apache Cassandra DataStax

Enterprise distribution, is when you're deploying Cassandra to production? The first thing we're gonna point out is small data. Maybe you whip up a python script or something in ruby but at the end of the day, it generally means that it's a one-off and you don't need any concurrency, you're not sharing this with anybody - this is running on your laptop and it's okay like maybe the batch takes like 30 seconds if it's a big file but that's kind of what you're looking at with small data. 

 

Medium database - Apache Cassandra DataStax

We've talked about small data so let's talk about medium data now. Probably if you're a web application developer of some other kind, this is probably the typical data set that you're working with. This is data that fits on a single machine. You're probably using a relational database (RDBMS - Postgres, MySQL). You can support hundreds of concurrent users so you've got some concurrency going on now and the other kind of nice thing that we get when we're working with the relational database is these ACID guarantees. ACIDs stand for atomicity, consistency, isolation, and durability.

As a developer, we've been taught for many years how to develop on top of machines like this. When I go to put data into a relational database with these ACID guarantees, I can kind of feel warm and cuddly, and I kind of know exactly what's going to happen with my data when I put it in. The other thing to know about it is the way we try to scale these types first. It happens by scaling vertically so we buy more expensive hardware, like more memory or maybe a  bigger processor and this can get expensive quickly.  But it is not so true all the time, lots of new RDBMS can work in distributed database mode.


Can the relational database (RDBMS) work for big data?

The first thing that we find when we start to use a relational database to try and apply it to big data is that ACID is a lie, we're no longer enveloped in that cocoon of safety, which is animosity, consistency, isolation, and durability. Let's take our scenario: over here we have a single master and we have a client that's talking to that master and we have a read-heavy workload and what we do is we decide to add on replication. One of the things that are important to know about replication is that the data is replicated asynchronously and this is known as replication lag and so what happens is when the client decides to write to the master, it takes a little while to propagate over to the slave and if the client decides to do a read to the slave before the data has been replicated, it's going to get old data back and what we find is that we have lost our consistency completely in the scope of our Database.

That whole thing that we built, our entire application around that certainty that we had, that we were always going to get up-to-date data is completely gone. All the operations that we do are no longer in isolation. They're not atomic so we have to recode our entire app to take advantage or at least to accommodate the fact that we have replication and there is replication lag. Another thing that we run into when we start to deal with performance problems is relational databases. It is probably something most of us that have worked with relational databases see. These crazy queries with lots of joins, maybe it's being generated by an ORM behind the scenes. In fact, at a company I used to work for we had this problem where every day at one o'clock we have a lot of users try to log onto the system and nobody would be able to log in and when we went and looked at what was going on behind the scenes, it was some crazy queries like these with lots of joints essentially locking up the database behind the scenes. 


 

So queries like this can cause lots of problems. It's kind of one of the side effects of using the third normal form to do all of our data modelings and so what we try to do when we're dealing with queries like this that have unpredictable performance or poor performance. A lot of times we normalize so we'll create a table and that table is built specifically to answer that query so at the right time what we'll do is de-normalize. At the reading time, we can do a sort of a select star very simple query that doesn't have a lot of expensive joins in it and that means that now we've probably got duplicate copies of our data, and we kind of violated this sort of third normal form that we're used to using and that has been drilled into our heads as developers for a long time.


What is Sharding?

As we continue to scale our application the next thing that we're gonna have to do is implement Sharding.

Sharding is when you take your data instead of having an all-in-one database and one master. You split it up into multiple databases and this works well for a little while. The big problem with this is that now your data is all over the place and even if you were relying on it, let's say a single master to do your old app queries, you can't do it anymore. All of your joins, all of your aggregations, all that stuff is history and you absolutely can not do it and you have to keep building different denormalized views of the data that you have to answer queries efficiently. We also find that as we start to query secondary indexes don’t scale well either.

 

If we take our servers and we say I'm going to split my users into four different shards and then I haven't shared users on something like state and I want to do a query, I want to say I want all the users in the state of Massachusetts. That means I have to hit all of the shards. This is extremely not optimal, meaning if there were 100 shards, I have to do 100 queries, this does not scale well at all. As a result we end up de-normalizing the process again so now we start to copy our users, one by user ID and another by state. Whenever we decide to add shards to our cluster, if we want to double the number from 4 to 8, we now have to write a tool that will manually move everything over. This requires a lot of coordination between developers and operations and is an absolute nightmare because there's dozens of edge cases that can come up when you're moving your data around. You have to think about all of them in your application and  so does your ops team.

The last thing that we find is that managing your schema is a huge pain. If you have 20 shards all with the same schema on it, you have to now come up with tools to apply schema changes to all of the different shards in your system. It's not just the master that has to take it, but all of your slaves, this is a huge burden and an absolute mess.

We mentioned using master-slave replication to kind of scale-out when you have a read-heavy workload. Another reason why people will introduce this sort of master-slave architecture with replication is for high availability or maybe at least higher availability. The thing is when you introduce this replication a lot of times you have to decide how you're going to do the failover so maybe you build some sort of automatic process to do the failover. Maybe it's a manual process where somebody has to notice that the database has gone down and push a button to failover to the slave server. If you build an automatic process of some kind, then who is going to watch the automatic process to make sure it doesn't crash and ultimately not end up being able to failover your database? In any scenario, the problem is that you still end up with downtime because whether it's a manual failover process or an automatic one, that implies it's something that is going to have to detect that the database is down and that you're having downtime before the failover can kick in. 

The other thing is trying to do this with a relational database and do multiple data. The enter is a disaster. It's hard to do. We're not just talking about downtime as far as unplanned downtime. We know the hardware fails, amazon reboots your servers as a service sort of thing, that kind of stuff happens but then there's also planned downtime as well. There are things like OS upgrades or upgrades to your database server software so you got a plan for those as well. Furthermore, it would be nice to have some way to have higher availability than what the master-slave kind of architecture gives to us. 

Let’s summarize the ways that the relational database fails us handling big data. 

 

We know that scaling is an absolute pain. We want to put bigger hardware that costs a lot of money. We want to shard that absolute mess but we've got replication, it's falling behind. We have to keep changing our applications to account for the things that we give up in the relational database. ACID, you know that cocoon of safety, we're not in that thing anymore we are treating our relational database pretty much like a glorified key-value store. We know that when we want to double the size of our cluster and we have to reshard that is an absolute nightmare to deal with. Nobody wants to do this because it requires way too much coordination. We know that we're gonna have to denormalize all of our cool third normal forms queries that we love to do our aggregations. Those things that we were so proud to write in the first place they're gone.

Now we're just writing our data and a bunch of different tables and some of the time we're just gonna achieve it. High availability it's not happening right if you want to do multi D.C. with MySQL or Postgres. It is not happening unless you wouldn't have an entire dedicated engineering team to try and solve all the problems that they're going to come up with along the way. 

If we were to take some of the lessons that we've learned, some of the points of failure that we just summarized and applied to a new database may be from scratch that would kind of be good for handling big data. What are some of the lessons that we've learned from those that fail? 

So the first thing is that consistency is not practical. This whole idea of ACID consistency is probably not practical in a big distributed system so we're gonna give it up. We also noticed that manual sharding and rebalancing are really hard. We had to write a lot of code just to move data from place to place and handle all these error conditions. So instead, what we're gonna do is push that responsibility to our cluster. Our dream database can go from 3 to 20 machines and we, as developers, don't have to worry about it. We don't have to write any special code to accommodate that. 


The next thing we know is that every moving part that we add to the system makes things more complex and all this failover and processes to watch the failover and everything. We want our system to be as simple as possible with as few moving parts as possible. None of this master-slave architecture sort of thing. We also find that scaling up is expensive if you want to vertically scale your database, you're gonna have to put things like sand in place. You're gonna have to get bigger and bigger servers every time you do it. It's a lot of money and it's not worth it in the end. 

So what we would do in our dream database is to only use commodity hardware, we want to spend $5-$10,000 per machine instead of $100,000. Furthermore, what we want to do is buy more machines that way, when we want to double the capacity of our cluster, we're not going from a $100,000 machine to a $200,000 machine, we're just doubling the number of cheap machines that we use. 

 

 


The last lesson learned here is that scatter/gather queries are not going to be any good so we want to have something that kind of tries to push us. Maybe in its data modeling or something like that towards data locality where queries will only hit a single machine so that we're efficient. What’s more, we don't introduce a whole bunch of extra latency where we're instead of doing a full table scan now. We're doing a full cluster Scan.