Spark vs. Hadoop MapReduce: Data Processing Matchup
Both Spark and Hadoop deal with big data, but in different ways. Learn about the differences and similarities between these two powerful frameworks.
There’s an incredible amount of big data technologies that sprung up within the last decade. Driven by the ever-increasing appetites for data, these products offer a long-term solution that should satisfy even the most data-intensive businesses and industries.
This variety puts extra pressure on enterprises that decide which big data solution to use (if any). The price of a mistake is too high, and this mistake can come in different shapes. The wrong solution can lead to lost productivity or capacity to scale, and even digital security issues. So, it’s essential to have at least a basic understanding of the pros and cons that are offered by different big data products.
We decided to help you out by first tackling the Spark vs. Hadoop overview since these technologies are often discussed together. This article will provide a high-level overview of the differences and similarities between these frameworks, along with some notes on the divergence in their practical use.
Understanding the Difference
First, we need to establish that comparing Hadoop as a whole to Spark is meaningless. Hadoop is an ecosystem, which also includes the Hadoop Distributed File System (HDFS). Spark lacks this element and is usually deployed on top of Hadoop to use HDFS as data storage. These two platforms were designed to work in conjunction, and there are many ways to benefit from different variations of this pairing. Built with different computational strategies in mind, Hadoop and Spark are driven by the same goal of enabling faster, scalable, and more reliable enterprise data processing.
So in this review, we’ll focus on comparing specifically data processing capabilities of these two software products. To make the comparison fair, we’ll pit Spark against Hadoop MapReduce, Hadoop's part responsible for data processing. So when we say ‘Hadoop’ further down, we’re talking in fact about Hadoop MapReduce.
Potential to Scale
It’s important to note that both Hadoop and Spark were built with scalability in mind. You can’t have a modern big data framework without this philosophy as its backbone. But given the nature of these two products, Hadoop might be easier to scale, since it was made to run on commodity hardware, where disk space is abundant.
Image Source: Apache Hadoop and Hive presentation by Darius Hibbard
You can’t say that about Spark, as its optimal performance setup requires RAM. Its domain is streaming processing, where data is analyzed on the go to deliver real-time results. And given that RAM is harder to come by, it’s not that prevalent. Although we may as well see that boundary broken down. RAM becomes more available, and often the economic benefits of real-time data processing overwhelm everything else. This takes us to the next point.
Until RAM is as easy to come by as disk space, Hadoop will be a much more cost-effective solution. That is of course when real-time data processing is not something that you require. Both of these frameworks are open-source, so probably the most significant investment on your part will be dedicating your staff or outsourcing the development.
Being able to protect your enterprise data is not a perk but a necessity, even in the industries that aren’t heavily regulated. Bad data security practices can damage your business reputation and lead to legal issues.
Hadoop has a clear advantage here, being built on top of a full-fledged file system (HDFS). It offers granular access control through file-level permissions and ACLs (access rights management through Access Control Lists). It also supports Kerberos, a trusted authentication management system, and LDAP, a reliable authorization management solution. Hadoop’s ecosystem also includes a variety of security solutions, such as Sentry.
Spark on its own only supports password authentication. But since it can be run on top of HDFS, it can enjoy the same robust security solutions that we listed above.
Tolerance to Failure
Both frameworks are tolerant to failures or faults within the cluster that stores or processes data. In fact, Hadoop was designed in a way that presupposes certain elements of the cluster to be offline at all times. It’s designed to run thousands of machines for distributed processing, so it is assumed that some of these machines will be down just because of the sheer scale and statistical probability. But Hadoop and Spark achieve fault tolerance in different ways.
Hadoop’s way is replication. It stores files in blocks, which are similar in size partitions of the file. These blocks are replicated throughout the cluster in a specific manner. This is an integral part of setting up HDFS, as there are many parameters for the replication process that can be fine-tuned, but it mostly boils down to a simple rule. The replicas of blocks have to be distributed in a way that’s optimal for performance and reliability, so placing replicas within the same rack (collection of nodes or computers within the cluster) is not optimal. Sending a replica to a different rack ensures its availability even if the original rack has failed completely.
Image Source: Introduction to Hadoop and MapReduce, presentation by Charles Lynde
Spark uses RDDs (Resilient Distributed Datasets) to ensure fault tolerance. This approach views data as a collection of objects that are distributed across nodes within a cluster. It achieves resilience through introducing lineage to RDDs, which is a history of all changes that were applied to an RDD, its parent components. This way, if data is lost, it can be traced back to the source by recomputing missing or damaged partitions. So RDD is Spark’s version of replication.
Image Source: billyengineering.com
In Hadoop, data is constantly transferred between the disk and the RAM. At the same time, Spark uses as much memory (RAM) as possible and for as long as possible, as RAM, in hardware terms, is more reliable than disk space. Data is cacheable in Spark, and the cache is the preferred ‘storage’ option, with preference given to RAM over disk space.
It’s hard to say which fault tolerance approach is better, given that there are many setups, controls, and settings that vary from enterprise to enterprise, from data warehouse to data warehouse. All in all, they both have proven to be effective.
Hadoop was built around the concept of batch processing. It accumulates large amounts of data and then performs a single action for all of that data at once. A simple example would be loan applications arriving throughout the day. Hadoop would wait until the end of the day, and then score credit risk for all applications that come in for the past X hours.
But this approach to data processing also makes the system highly unusable for real-time or streaming processing. MapReduce processes data in sequences, all of which happen on various nodes (machines) and even on various racks (collections of nodes). These separate processing results are then combined to deliver the final output. This works great for larger datasets, where the task is distributed among many nodes.
In its turn, Spark is geared towards real-time processing, as we described earlier. So if we take the same example of loan applications, Spark would take each separate application as it arrives and score them almost in real-time. It also works for batch processing, making use of its hardware preferences and speeding up batch tasks through in-memory processing. This kind of approach is optimal for real-time predictive analytics and other types of machine-learning tasks that require immediate output.
Image Source: Spark Streaming Large-scale near-real-time stream processing, presentation by Mallory Hampton
Most importantly, Spark and Hadoop can work together to bring in the mix of both batch and real-time processing capabilities in a more robust and fully capable package.
Trying to compare Spark vs. Hadoop is a difficult task because these systems were built to enable different processing capabilities.
Hadoop as a whole is a much more sophisticated system with its file system and a variety of modules. Spark was initially added as part of the Hadoop ecosystem, but it has grown to be used as a separate product that still uses other elements of Hadoop’s infrastructure, such as HDFS (the file system).
If you’re looking for fast data processing and costs aren’t that important - Spark is the way to go. If you’re looking for cost-effective data processing capabilities along with data management, and time constraints are not that important - Hadoop might be a better option for you. These two frameworks can effectively work together to offer a mix of fast processing and more approachable scalability.
If you’re looking for a team with a proven track record of delivering highly productive Hadoop and Spark projects – feel free to ping Iflexion.
What is your experience with any of these two systems? In the Spark vs. Hadoop matchup, which one is your favorite? Share your thoughts and ideas in the comments below.
Have a project in mind but need some help implementing it? Drop us a line: