Escaping From Disco-Era Data Modeling
On StackOverflow, I have seen Cassandra used in a lot of strange ways – particularly when it comes to secondary indexes. I believe much of the confusion that exists is due to the majority of new Cassandra users having their understanding of database indexing grounded in experience with relational databases. They tend to approach a Cassandra data model with the goal of trying find the most efficient way to store the data, and then add secondary indexes in a vain attempt to satisfy their potential query patterns.
This approach usually manifests itself in questions like this:
“Here is my schema:“
CREATE INDEX user_fromid ON chats (fromid);
CREATE INDEX user_toid ON chats (toid);
CREATE INDEX user_adid ON chats (adid);
“Why doesn't my SELECT query work?“
WHERE fromid='test' AND toid='test1'
ORDER BY messagedatetime DESC;
Or (for the sake of argument) replace the “ORDER BY” clause with “ALLOW FILTERING” and ask:
“Why is my SELECT query so slow?“
This model is representative of what you would commonly see in a relational database. It does a great job of storing the data in a logical way, keyed with a unique identifier (messageid), and clustered by a date/time column. Great that is, until you want to query it by anything other than each row’s unique identifier. So the next logical leap is made, and the next thing you know, three or four (or more) secondary indexes are created on this column family (table).
Secondary indexes in Cassandra are not meant to be used as a “magic bullet,” allowing a column family to be queried by multiple high-cardinality keys. Indexes on high-cardinality columns (columns with a large number of potential values) perform poorly, as many seeks will be required to filter a large number of rows down to a small number of results. Likewise, indexes on extremely low-cardinality columns (booleans or columns with a small number of values, such as gender) perform badly, due to the large number of rows that will be returned. Also, secondary indexes do not perform well in large clusters, as result sets are assembled from multiple nodes (the more nodes you add, the more network time you introduce into the equation).
There are other important points concerning secondary index use, such as the conditions under which they are intended to be used. These (and those mentioned above) are detailed in a DataStax document titled When To Use An Index. When helping others with Cassandra data modeling, this is the document that I will most often refer people to. Anyone considering using secondary indexes should read that document thoroughly.
Cassandra queries work best when given a specific, precise row key. The use of additional, differently-keyed column families populated with the same (redundant) data will perform faster than a query which uses a secondary index. Therefore, the proper way to solve this problem is to create a column family to match each query. This is known as query-based modeling. Here is one way to design a column family to suit the afore-mentioned query:
PRIMARY KEY((fromid, toid), messagedatetime, messageid));
With this solution, the data will be partitioned on fromjid and tojid, and clustered by messagedatetime. Messageid is added to the PRIMARY KEY definition to ensure uniqueness. This will allow the user’s query to be keyed on fromjid and tojid, while still allowing an ORDER BY on the clustering column messagedatetime.
Of course, it’s hard to change the way people think within the bounds of a single conversation. Here is a typical response to the above solution:
“I know my original design somewhat violates the whole NoSQL concept, but it saves a lot of memory.“
That is a very 1970′s way of thinking. Relational database theory originated at a time when disk space was expensive. In 1975, some vendors were selling disk space at a staggering eleven thousand dollars per megabyte (depending on the vendor and model). Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars. Today (2014), you can buy a terabyte drive for sixty bucks. Disk space is cheap; operation time is the expensive part. And overuse of secondary indexes will increase your operation time.
Therefore, in Cassandra, you should take a query-based modeling approach. Essentially, (Patel, 2014) model your column families according to how it makes sense to query your data. This is a departure from relational data modeling, where tables are built according to how it makes sense to store the data. Often, query-based modeling results in storage of redundant data (and sometimes data that is not dependent on its primary row key)…and that’s ok.
In summary, we have managed to free ourselves from bell-bottoms, disco, and shag carpeting. Now it is also time that we free ourselves from the notion that our data models need be normalized to the max; as well as the assumption that all data storage technologies are engineered to work with normalized structures. With Cassandra, it is essential to model your data to suit your query patterns. If you do that properly, then you shouldn’t even need a secondary index…let alone three or four of them.
DataStax (2014). When To Use An Index. Retrieved from: http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html on 09/29/2014.
McCallum, J. (2014). Disk Drive Prices (1955-2014). Retrieved from: http://www.jcmit.com/diskprice.htm on 09/29/2014.
Patel, J. (2012). Cassandra Data Modeling Best Practices, Part 1. Retrieved from: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1 on 10/03/2014.