Querying by nulls in Cassandra
Recently, I was answered this question on StackOverflow from a user wanting to query Cassandra for column values that were not null. While I pretty much knew the answer to this as I wrote it, I did experiment a little with my local cluster just to see what would happen in certain scenarios. First of all, I needed to jump over to my trusty “stackoverflow” keyspace:
Yes, everyone needs a “stackoverflow” keyspace as a sandbox to emulate the weird ways in which Cassandra is sometimes used. But I digress. From there, I created a new table:
And gave it some data:
cqlsh:stackoverflow> INSERT INTO nullTest (id,name) VALUES ('2',null);
cqlsh:stackoverflow> INSERT INTO nullTest (id,name) VALUES ('3','Wash');
I could then indiscriminately query the data like this:
id | name
3 | Wash
2 | null
1 | Jayne
In Cassandra, your WHERE clause can only contain primary key columns (unless you have created a secondary index). However, primary key columns will not allow null values to be inserted. But I thought I'd give that one a try anyway:
Bad Request: Invalid null value for partition key part id
Thought so. Ok, but what if we had a composite primary key, and the null value was in the clustering column(s) and not in the partition key? For that, I'll need to create a new table:
cqlsh:stackoverflow> INSERT INTO nullTest2 (id,id2,name) VALUES ('1','1','Simon');
cqlsh:stackoverflow> INSERT INTO nullTest2 (id,id2,name) VALUES ('1',null,'River');
Bad Request: Invalid null value for clustering key part id2
And there you have it. You cannot query by nulls in Cassandra (like you can in a relational database), because:
- Cassandra requires all fields in the WHERE clause to be part of the primary key.
- Cassandra will not allow a part of a primary key to hold a null value.
- While Cassandra will allow you to create a secondary index on a column containing null values, it still won't allow you to query for those null values.
- Cassandra does not support the use of NOT or not equal to (!=) operators in the WHERE clause. This is worth mentioning, as the StackOverflow question mentioned above was actually about how to query rows where certain column values were not null. Since they can't be null in the first place, querying not null would simply give you everything. Assuming of course that you could query by not null, which you cannot.
An alternative solution would be to intercept null values at the application level, and insert empty strings instead. You could certainly query by those:
cqlsh:stackoverflow> SELECT * FROM nullTest WHERE name='';
id | name
Without having tried it at scale, it is possible that using a secondary index (of an appropriate level of cardinality) may not perform as badly as you might think. It certainly would be the next logical step to try. Of course, you still can't query by "not empty" so that may not be effective.
At the end of this post, I mentioned the use of secondary indexes in Cassandra. Having done so, I feel obligated to reference this part of the official DataStax Cassandra documentation: When to use an index. Secondary indexes are not something to be used lightly, and are not some sort of "magic bullet" to bridge the short-comings of your data model. You can almost always avoid the need for a secondary index by proper query table modeling.
Inserting null values was done here for demonstration purposes only. It is something you should do your best to avoid, as inserting a null column value WILL create a tombstone. Likewise, inserting lots of null values will create lots of tombstones.