HBase vs Cassandra in the Big Data Game
In this article we'll compare Cassandra vs HBase and see which use cases apply for either approach and where each falls short against the other.
Utilizing big data is all about taking vast information dumps on their own terms. A typical information reservoir is likely to consist of multiple machines that may contain a huge variety of data types and formats — from video and image files with minable metadata through to more traditional or legacy storage types, such as Excel or CSV files.
Additionally, a big data file system might host an almost endless variety of possible other types of database dumps: server log files, JSON-based usage stats, system reports, and IoT device reports in multiple formats, among many other possible information models and formats.
Such a variegated collection of information is easier to 'map' than to rationalize in the same way that we might re-order a smaller collection of files, or migrate less voluminous data to a new and unified format.
Since we can't realistically or economically impose order on information at this scale, we need to find ways to parse and interpret it in order to gain insights and extract essential statistics and other types of intelligence.
Random Access for Big Data
In our latest Hadoop overview, we looked at how Apache Hadoop collects a potentially vast number of small machines into a single virtual file system called HDFS, and then uses a Java-based approach (MapReduce) to slowly extract core facts from the cluster.
We have also examined how Apache Spark uses RAM memory to traverse the same huge mass of data more quickly, and how streaming data models can inject high volume data into analytics dashboards and other real-time systems.
However, these are ponderous and rigid approaches for querying big data, even when using Spark and other streaming architectures. Considering the amount of work and the typical execution times necessary to program and run a batch processing workload, we need to be fairly sure of what it is we are trying to find out in order to justify the effort involved.
Our use case may not fit into a typical batch job. We might want to extract different information than the patterns programmed into a dashboard or analytics system; or to make a one-off result comparison that the dashboard algorithms can't accommodate; or to query the data using SQL. Or we might need an answer quickly for a question that will never be repeated and is not worth great programming expenditure.
In short, we need low-latency random access to the data repository, as if we were hitting Ctrl-F on a spreadsheet. We need a result, not a report, and—of course—we need it now.
HBase and Cassandra are two popular database management systems used in data science consulting, capable of providing this more agile and disposable interrogation layer for HDFS storage. Though they are stablemates from the Apache repository, they are effectively rivals for your time and resources. Let's compare their capabilities, limitations and areas of similarity.
How HBase Treats Data
HBase is an open-source non-relational distributed database, developed as an adjunct to the Hadoop development project. It runs over a Hadoop HDFS file system or Apache Alluxio, and was derived originally from the architecture of Google Bigtable. It can run both ad-hoc queries and batch processing via MapReduce.
HBase Data Model
HBase uses a 'column family' system that is less rigid than traditional relational database schemas, and scales more easily.
A relational database structure comes with certain expectations regarding what kind of data will populate the fields, and incoming data usually needs to conform to its schema in order to avoid import errors or time-consuming manual intervention after the fact.
Here's a very simplified example: we're trying to import an old generic CSV file into a relational database:
The relational database (RD) doesn't understand that Brian Johnson (row 3) did not provide a first name when he signed up, since the form did not make that field mandatory. The RD now has to insert 'null' into that empty field, bloating the database size (presuming that the RD even has a programmed routine that will perform this task instead of throwing an error).
Neither does the RD like that Kate Brown's surname (second row) is not enclosed in escape quotes, or that Lynn (fourth row) confused the 'surname' field with the 'phone number' field, and the validation algorithm did not catch this error when she submitted the form. The RD is expecting escaped text for that column, not integers or raw text.
Worse, some of the rows have entries for a third data column ('photo') that the RD has not been able to identify or parse from the key in the CSV. Since the RD therefore has no column to accommodate these entries, it can't incorporate them at all.
The non-relational database (NRD) is only expecting bytes of some kind—a number, text, escaped text, a BLOB type that could represent a picture or even a video file… anything will do. If a field is empty, that's also okay.
HBase uses column families that can be assembled into complete tables, as with the above example. There, HBase was able to create a third, unspecified column on the fly at import time, so that no data was lost or misinterpreted. Once the unnamed column was identified by an operator as 'photos', a column family called 'photos' was manually imposed on the data in a second pass and for subsequent imports (column families must be defined in advance in both HBase and Cassandra, and cannot be 'retro-fitted' onto existing workload results).
With a traditional relational database model, all columns must be defined in advance, and all fields pre-typed (i.e. 'integer', text, TinyInt, etc.). At petabyte scale, all those 'nulls' add up, too, whereas an empty field in HBase is just…nothing.
Field types can also be customized with the Apache data serialization system Avro.
None of this means that an HBase-style NoSQL schema understands the incoming data better than a relational model. If anything, it understands significantly less about it, at least by default. But it leaves validation and interpretation for higher level mechanisms, instead prioritizing the accurate capture of the data.
Examining Previous Data Versions in HBase
With an HBase-style approach, it is also possible to see how fields, rows, and columns have changed over time, based on timestamp data associated with imported material.
This kind of chronological traversal is native to HBase, and rather more difficult to set up in a traditional relational database.
RowKeys in HBase
There is an additional first column in an HBase table that we have not included in the above examples, since the values can be rather long, and rarely conform to a standard template.
The HBase RowKey provides a unique identifier for a row in a table, and will default to numbers (00001, 00002, etc.).
However, a good RowKey design will usually employ additional facets, such as ASCII characters (automatically translated to Unicode for the benefit of Java) concatenated from subsequent field contents in the table, or from other possible facets, such as a Unix timestamp or a salting algorithm.
The latter two, for instance, will ensure unique RowKey values, essential for the performance and integrity of an HBase operation.
RowKey naming and reference operations should be distributed across multiple nodes in order to avoid 'hotspotting', where one node receives so many requests for a local data item that it becomes overwhelmed, similar to the way a website might crumble under unexpected volumes of traffic.
Memory Handling in HBase
Like Spark, HBase writes back to disk when in-memory capacity is reached, but also has several memory features to optimize read and write operations. These include MemStore (which will store operations until a disk write is necessary), BlockCache (which stores frequently accessed rows), and WAL ('Write Ahead Log', which stores processed operations prior to writing, allowing data recovery if the write process fails).
Regions and Region Servers in HBase
HBase is dealing with non-centralized nodes—individual machines in a large HDFS cluster. Performance and throughput will vary on each machine according to its specs and the task assigned to it in a workload. Therefore, HBase distributes its own database storage into 'regions', where that part of the database relating to the node's information is stored on the node itself.
HBase assigns a 'region server' to each node. This negotiates with the device's node manager to either read from or write information to that node as necessary. Inevitably some tables will be split among different nodes, and so HBase assigns RowKeys to the tables and uses these as split-points.
It's important for the integrity and performance of a distributed file system that these negotiations take place locally, rather than as external directives from a single operational mechanism. Therefore, the region servers in HBase have autonomy to direct the local processing operations until such time as they report back progress and/or results to the master process.
The HBase (HMaster) Master Server
The HBase Master Server runs on a NameNode, a high-level process that oversees the maintenance and updating of file system metadata throughout the write operations undertaken in a workload.
The Master Server (HMaster) is deceptively named, since it does not deal directly with the progress of the region servers, but acts instead as an operational center. Rather, HBase provides the ZooKeeper cluster manager, which maintains communications between the region servers and other clients, and which synchronizes the distribution and allocation of workloads. ZooKeeper also provides API access to various common services, including Group Services and configuration management.
The Master Server assigns and monitors regions, provides load-balancing services across region servers, oversees the catalog tables that provide client access to keypairs, orchestrates failover operations in the event of node failure, writes updates to tables and metadata throughout the operation, and performs administrative operations on the WAL (see 'Memory Handling in HBase' above), among other tasks.
HBase is written in Java, which is not the most welcoming, user-friendly, or popular among current programming languages. However, HBase provides a REST interface and Thrift gateway that allows external programs and architectures to control its operations from a higher, abstracted level.
HBase can also be accessed from a number of external non-native SQL interfaces, including Impala, Phoenix, Presto, and Hive.
The HBase shell can additionally be scripted from Unix; and since the shell is written in JRuby, that language can also be used as a bridge to HBase.
HBase vs Cassandra: Some Disadvantages of HBase
Single Point of Failure
At odds with the distributed design philosophy of Hadoop, the Master Server workflow model is a point of vulnerability for HBase. If HMaster goes down on a standard HBase configuration, that's the end of the show.
Clumsy Authentication/Security Provisioning
All HBase users can write to all tables in a default deployment. Where HBase is operating over HDFS, it is seen as a single user in HDFS's far more granular authentication and access capabilities. However, HBase can be configured for user authentication at the RPC level and an Access Control List (ACL) enabled for that HBase user.
Where HDFS is the target system, this may entail some duplication of authentication functionality, but at least the two systems can have different access lists on a per-user basis. HBase can also employ Kerberos as an authentication mechanism, the same as HDFS on Hadoop. However, running a framework like Kerberos on both HDFS and HBase will entail a performance penalty for obvious reasons.
No Joins or Normalization in HBase
Joins and normalization, very useful SQL querying functions, are not supported in HBase. You'll need to access HBase via Apache Phoenix or another third-party program capable of replicating this functionality.
No Transaction Support
Distributed database systems have multiple operations running at any point in a query process, making transactions (which require definitive information) difficult to accomplish. The latest version of data may be pending a write, or held back in cache for performance reasons, or subject to alteration throughout the process. However, transactions can be made possible by querying HBase through secondary solutions such as Apache Phoenix.
Unfriendly Programming Environment and Diminishing Industry Take-Up
HBase is trending downwards in popularity over the years, increasing the scarcity and cost of dedicated developers. Some sections of the industry ascribe this to the rise of less operationally complex and more user-friendly alternatives in recent years.
Though HBase claims Facebook and Twitter as end-users, the former migrated its Messages data from HBase in 2018, while Twitter has moved much of its core data to Google Cloud, not the native environment for HBase.
Cassandra was developed at Facebook in 2008 to increase search capabilities across the company's inbox system. The project was released to open source in 2008, and by 2010 had been adopted as a top-level Apache project.
Cassandra shares many features and scope of functionality with HBase, including high scalability, performant high-volume turnover, tolerance for a wide variety of data import states, node replication, similar write path models, and ability to negotiate a huge range of possible data types in a very high-volume cluster.
So, let's take a look at where the differences occur between the two architectures.
Cassandra Data Model
Like HBase, Apache Cassandra is a column-oriented distributed database system. However, the two systems define 'columns' rather differently. A column in Cassandra is similar, logically, to a 'cell' in HBase. We can take a look at that aspect now.
In Cassandra, a 'column family' is an aggregation of rows that would normally be concatenated into a table in HBase (represented by the enclosure around the rows in the illustration below).
A Cassandra row key (also known as a primary key—the first column in the illustration below) encapsulates column headings as well as column content. Additionally, rows can have different numbers of columns:
In HBase, headings that cover multiple columns are called 'column families' (see earlier tables above), whereas 'column families' in Cassandra might more logically be called 'column assemblies' or perhaps 'column collections'.
In Cassandra, HBase-style 'column-grouping' functionality is instead called a 'super column' (colored in illustration below):
These are not mere semantic differences; Cassandra's logical data model defines some of the key differences between the two systems in terms of performance, flexibility and data organization and management.
Unlike HBase's master-slave model, Cassandra delegates and distributes data traversal and processing across its functional architecture (see below), and therefore has no single point of failure that can take down an operation.
Delegated Availability (Heartbeat) Monitoring via Gossip
Intra-system communications between nodes in Cassandra are handled without hierarchical dependence via the peer-to-peer protocol Gossip—a highly distributed and decentralized methodology that's similar in structure to torrent file-sharing in many ways; for instance, in the way that one instance of the data must be defined as a seed node from which other instances will be replicated. Seed nodes are algorithmically chosen based on network distance.
Under Gossip's Failure Detection feature, nodes in the file system are constantly sending TCP-style packets to each other to confirm uptime for adjacent nodes. Each node passes on not only their own condition but that of other nodes that it has communicated with.
Ordered Partitioning and Replication
Cassandra assigns a number to each cluster node, known as a 'token' and collectively as a token ring. Data is partitioned evenly by hashed multi-column primary keys—an abstraction of order that sits above the way data may or may not be physically disposed within the cluster.
To improve read performance, associated data partitions are written in close physical proximity to each other among the various machines hosting the data on the cluster.
The Speed of Write Operations
Cassandra is not hindered by a secondary tier of organizational management (ZooKeeper in HBase), and does not share HBase's complex and sometimes circuitous write path model; and unlike HBase, Cassandra can write simultaneously to cache and logs, which usually brings an improvement in write performance by comparison.
Cassandra Query Language (CQL)
In contrast to HBase, which relies on a third-party ecostructure for complex querying, Cassandra offers a very friendly variant of declarative SQL called Cassandra Query Language (CQL).
By default, Cassandra's security provisioning is turned off, which even Apache acknowledges as a high-risk factor. Like HBase, Cassandra can use Kerberos as an authentication framework, and suffer the same performance penalties for running Kerberos twice (it will already be running in HDFS) in order to specify differing levels of access between the client and the cluster.
Cassandra also has internal authorization protocols that can be enabled by uncommenting two parameters in the cassandra.yaml file, in a default installation. It is then possible to assign pre-defined access levels to users. Without this configuration, a user/pass combination will be needed each time CQL attempts to access the database.
HBase vs Cassandra: Some Disadvantages of Cassandra
Increased, Time-Consuming Replication
Cassandra's data redundancy requirements, though configurable, are generally higher than those of HBase, in part because Cassandra has no master management architecture to ensure data availability across a volatile cluster and relies on a greater level of replication as a failover measure.
Although partitioning data via hashed multi-column primary keys gives Cassandra a performance fillip over HBase on HDFS, it also increases the need for replication (since an entire rack may fail and take down multiple nodes comprising a data partition).
Data Consistency and Higher Performance: Pick One
In Cassandra, due to the fact that its higher level of replication increases latency, data consistency is a negotiable commodity; you can lower the replication levels and trade consistency for performance.
Where you're running sequentially through the data in search of aggregate results, this can be an acceptable compromise (even if such workloads are more suited to slower but surer solutions such as Hadoop/MapReduce or Spark on HDFS).
But where data-critical transactions need to be implemented, perhaps through CQL or third-party architectures, accuracy must become more important than efficacy. The speed difference between Cassandra and HBase will drop notably in that scenario.
Faster Performance (over HBase) Not Guaranteed
Cassandra's often-lauded speed benefit over HBase is also greatly affected by the way that either system is configured. Lowering the block size in HBase can equalize performance between the two systems where random access is important, whereas increasing the block size for sequential (non-random) read operations also puts HBase and Cassandra very near to each other in terms of performance.
Cassandra was designed for write performance, which gives it an advantage in initial data ingestion (see below). However, potential bottlenecks in read-heavy operations can re-balance that initial advantage.
CQL Is a Limited Querying Solution
The fact that the primary database key has already been co-opted for data partitioning makes CQL's functionality a subset of what is usually available in SQL.
A query can't contain an OR or NOT logical operator, and WHERE can only be run on non-primary key columns that have secondary indexes. Additionally, queries using joins, intersection, or union are unsupported, and the date/time operator has notable limitations compared to SQL.
Therefore, more complex queries will require a higher-level architecture running over Cassandra. Two popular choices in this regard are Spark and the DataStax ODBC. Performance penalties are likely to apply, however.
Conclusion: The Difference Between HBase and Cassandra
Both HBase and Cassandra are well-suited to batch-based processing of time-series data, which we covered in our overview of machine learning in text analysis, from sources such as weblogs, IoT devices, and many kinds of statistical data from sectors such as epidemiology, meteorology, and social sciences (population statistics, etc.). Both are also well-suited for non-time-critical traversal and analysis of medical and civic datasets.
Both systems are primarily written in Java, and both have shells derived from the JRuby shell. In terms of security, both architectures provide granular access as necessary. However, each defaults to minimal security standards, relying on Kerberos or other security architectures at the file-system/cluster level. Both HBase and Cassandra initially presume a single-user scenario where these extra security mechanisms would be redundant.
In terms of scalability and DevOps, the two architectures are generally reported to be on a par, though anecdotal evidence suggests that Cassandra's garbage collection routines and slow repair processes can create some DevOps headaches in the long term.
Your choice will obviously be affected by the demands and structure of a particular project. With that in mind, let's look at some specific areas where one system wins out over the other in this Cassandra vs HBase matchup:
Data Consistency: HBase
Cassandra's improved speed over HBase is not inherent or reliable, but is instead a function of its ability to prioritize performance over data integrity. Whether this 'sixth gear' is worth the risk is left to the end user, for whom a higher margin of error may be an insignificant issue (i.e. in calculating broad aggregate values instead of highly granular data points).
OLTP (On-line Transaction Processing): Cassandra
Since Cassandra was designed to prioritize write performance, it is more suited to real-time or near-real-time analytics systems, as well as to an On-line Transaction Processing (OLTP) pipeline. If data consistency is an issue, however, there are some associated penalties (see 'Data Consistency' above).
OLAP (On-line Analytical Processing): HBase
HBase was conceived, along with Hadoop, to enable batch-based processing of historical data—an on-line analytical processing (OLAP) pipeline. HBase is accurate and usually quite fast, considering the vast sums of data it was designed to address. By the time Cassandra has been tuned to be as accurate as HBase for such operations, it no longer has any notable speed advantage.
Scalability and DevOps: HBase
In theory, scaling a cluster up in either HBase or Cassandra is no more complicated than adding more nodes to a cluster. However, HBase does not need to re-rationalize data partitions in order to grow the cluster. It also has a more consistent version history than Cassandra, which has re-tooled some of the most fundamental aspects of its data management systems several times.
User Code Insertion: HBase
HBase offers 'co-processors' that allow user code to run within HBase routines, effectively providing stored procedures and triggers. These features would normally only be available in a relational database model, and Cassandra has no native provision for this.
Data Ingestion: Cassandra
Cassandra's consistent and deliberate focus on faster write speeds means that it can create an initial data store more quickly than HBase.
Supported Programming Languages: Cassandra
Documentation and Crowd-sourced Support: Cassandra
Though considered by Apache to be 'a work in progress', Cassandra's documentation is more comprehensive and searchable than the book-like reference guide supplied for HBase.
At the time of writing, Cassandra has more than double the number of questions in Stack Overflow compared to HBase. Whether or not that's a good sign is, of course, open to interpretation!
Internal Security Architecture: HBase
Cassandra provides templated roles with pre-defined levels of user access, similar to popular operating systems. HBase instead allows an administrator to set object-level access rights to users.
Acknowledgement to Jesse Anderson for examples of columnar tables in relational and non-relational databases.
Take us on your big data journey.
In this article we examine the validity of the Spark vs Hadoop argument and take a look at those areas of big data analysis in which the two systems oppose and sometimes complement each other.
Apache Hadoop emerged at the forefront of the big data revolution. In this Hadoop overview, we'll see how Hadoop came to be and where it stands in the current data analysis sector.
Your organization’s data is dirty and damaging, unless you’ve cleaned it recently. Learn about some data cleaning techniques that every organization can employ.
Find out about the ecommerce business intelligence tools to keep an eye on in 2020, and learn more about their features.
Every day, we send 294 billion emails, 500 million tweets, and 65 billion messages over WhatsApp. What can ecommerce organizations do with that rising sea of information? We have some answers.
WANT TO START A PROJECT?