The repository itself takes care of the problems of a single point of failure, server failures, and data distribution between cluster nodes. And both in the case of placement of servers in one data center, as well as in the configuration with many data centers, separated by distances and, accordingly, network latencies. Reliability refers to the eventual consistency of the data with the ability to set a tune consistency level for each query.
NoSQL databases generally require more insight into their inner workings than SQL databases. This article will describe the basic structure, and the following articles will discuss: CQL and the programming interface; design and optimization techniques; features of clusters deployed in multiple data centers.
In contrast to the relational database, there are no restrictions on records (and in terms of the database, they are rows) to contain columns with the same names as in other records. Column families can be of several kinds, but we will omit that detail in this article. Also, in the latest versions of cassandra, the ability to perform queries to define and change data (DDL, DML) using CQL language and create secondary indices (secondary indices).
The specific value stored in the cassandra is identified:
Linked to each value is a timestamp, a user-defined number that is used to resolve conflicts during recording: the higher the number, the newer the column is considered, and it rubs out old columns during comparisons.
By data types: keyspace and column family are strings (names); a timestamp is a 64-bit number; and key, column name, and column value are an array of bytes. Cassandra also has a concept of data types. These types can optionally be specified when creating a column family. For column names, this is called comparator; for values and keys, it's called validator. The first determines which byte values are valid for column names and how to order them. The second determines which byte values are valid for column and key values. If these data types are not specified, the cassandra stores the values and compares them as byte strings (BytesType) since, in effect, they are stored internally.
The data types are as follows:
In cassandra, all data write operations are always overwrite operations. That is, if a column with the same key and name already exists in the column family and the timestamp is larger than the one stored, the value is overwritten. The written values never change; newer columns come in with new values.
Suppose you look at cassandra from a data model design perspective. In that case, it's easier to think of the column family not as a table but as a materialized view - a structure that represents the data of some complex query but stores it on disk. Instead of trying to aggregate the data somehow using queries, it is better to try to store everything you might need for this query into a column family. That is, it is not the relationships between entities or relationships between objects that should be approached, but rather the queries: which fields should be selected; in what order the records should go; which data related to the main ones should be queried together - all this should already be stored in a column-family. The number of columns in a record is theoretically limited to 2 billion. This is a short digression; you can find more details in the design and optimization techniques article. Now let's go deeper into the process of saving data to cassandra and reading them.
The first strategy distributes data depending on the md5 value of the key (random partitioner). The second considers the bitwise representation of the key itself - byte-ordered partitioner. The first strategy, in most cases, gives more advantages because you do not have to worry about even data distribution between servers and similar problems. The second strategy is rarely used when interval requests (range scan) are needed. It is important to note that this strategy is chosen before the cluster is created and, in fact, cannot be changed without a complete reboot of the data.
This approach distributes the data between nodes and ensures that when a new node is added or removed, the amount of data forwarded is small. To do this, each node is assigned a token that parses the set of all md5 key values. Since RandomPartitioner is used in most cases, let's take a look at it. As I said before, RandomPartitioner calculates a 128-bit md5 for each key. To determine in which nodes the data will be stored, it simply goes through all node labels from the smallest to the largest, and, when the label value becomes more significant than the key md5 value, that node, together with some subsequent nodes (in label order) is selected for storage. The total number of selected nodes must be equal to the replication factor. The replication level is set for each keyspace and allows for regulating data redundancy. Before you add a node to the cluster, you must set a label for it. The percentage of keys covering the gap between this label and the next depends on how much data will be stored on the node. The whole set of labels for a cluster is called a ring.
We will call the node that performs the coordination as coordinator and the nodes that are selected to store the record with the given key as replica nodes. Physically, the coordinator can be one of the replica nodes - it depends only on the key, the marker, and the labels.
For each request, both read and write. It is possible to set a data consistency level. For writes, this level will affect the number of replica nodes from which to wait for confirmation of successful completion (data written) before returning control to the user.
For writes, there are these consistency levels:
For reads, the consistency level will affect the number of replica nodes to be read from. For reads, there are these consistency levels:
Thus, it is possible to adjust the time delays of reading and writing operations and tuning the consistency and availability of each type of operation. In fact, availability is directly related to the level of consistency of read and write operations, as it determines how many replica nodes can fail and still have those operations confirmed. If the number of nodes from which write acknowledgments come is greater than the number of nodes from which reads reached, then we have a guarantee that once written, the new value will always be read, which is called strong consistency. Without solid consistency, the read operation may return obsolete data.
So, with a QUORUM level of consistency on reads and writes, there will always be strict consistency, and it will be some kind of balance between read and write operation latency. There will be strict consistency for ALL writes and ONE reads, and read operations will be faster and have more availability, meaning that the number of failed nodes at which reads will still be performed may be greater than at QUORUM. On the other hand, write operations will require all working replica nodes. With ONE write, ALL reads will also be strictly consistent, write operations will be faster, and write availability will be high because it will be enough to confirm that a write operation took place on at least one of the servers. In contrast, a read will be slower and require all nodes-replicates. If there is no strict consistency requirement for the application, it is possible to speed up both read and write operations and improve availability by setting lower consistency levels.
Cassandra supports three data recovery mechanisms:
In the same way, it is divided into parts when it reaches a specific size. Such organization allows limits the write speed to the speed of consecutive writes to the hard disk and, at the same time, guarantees data durability. In case of a node crash, the fixture log is read at the start of the cassandra service and restores all tables in memory. It turns out that the speed is limited by sequential writing to disk, and it is about 100MB/sec for modern hard drives. For this reason, putting the fixing journal on a separate disk drive is advised.
It is clear that sooner or later, the memory may fill up. Therefore, the table in memory should also be saved to disk. To determine when to keep, there is a limit to the size of the tables occupied in memory (memtable_total_spacein_mb). By default, it is ⅓ the maximum size of the Java heap space. When tables in memory fill more than this limit, cassandra creates a new table and writes the old table in memory to disk as a saved table (SSTable). The reserved table is immutable once it is created and is never modified again. When saving to disk, parts of the SSTable log are marked as free, thereby freeing up disk space that is occupied by the register. Note that the log has an intertwined structure of data from different column families in the key space, and some parts may not be freed because some areas will correspond to other data still in the tables in memory.
There are three mechanisms to speed up this process: bloom filter, key cache, and record cache:
It turns out that the number of hard disk positioning operations when reading is proportional to the number of saved tables. Therefore there is a compaction process to free overwritten data and reduce the number of saved tables. It reads several saved tables sequentially and writes a new saved table that combines data by timestamp. When a table is wholly written and put into use, the compaction process can release the source tables (the tables that formed it). Thus, if the tables contained overwritten data, this redundancy is eliminated. It is clear that during such an operation, the amount of redundancy increases - the new saved table exists on the disk along with the source tables, which means that the amount of disk space must always be such that compaction can be performed.
Cassandra allows you to choose one of two strategies for performing compaction:
Internally, column deletion operations are operations of writing a particular value - a tombstone value. When such a value results from reading, it is skipped, as if such a value never existed. As a result of compaction, such values gradually displace the obsolete absolute values and possibly disappear altogether. If, however, columns of real data with even newer timestamps appear, they will eventually outlast these mashing values as well.
Cassandra supports transactionality at the single record level, that is, for a set of columns with a single key. Here's how the four ACID requirements are met:
If you want to explore more about Apache Cassandra, check our articles: