Hadoop vs Spark: A 2020 Matchup
In this article we examine the validity of the Spark vs Hadoop argument and take a look at those areas of big data analysis in which the two systems oppose and sometimes complement each other.
Big data analytics is an industrial-scale computing challenge whose demands and parameters are far in excess of the performance expectations for standard, mass-produced computer hardware. Compared to the usual economy of scale that enables high-volume hardware to be affordable to many, there is a limited potential customer base for such systems and associated big data consultants. Specifically, it’s made of that upper strata of companies, corporations, government and scientific research entities that have access to very high volumes of historic and/or current data.
Data at this scale is measured at least in hundreds of gigabytes, and frequently in petabytes containing billions—or hundreds of billions—of data points.
For such clients, the prospect of affordable and dedicated hardware is unrealistic. Such a refined market could not sustain a healthy roster of competing vendors. Conversely, market dominance by any one vendor would threaten a monopoly, and associated problems around governance, security, vendor lock-in and pricing. Clearly, a more imaginative approach would be needed.
The Hadoop Approach
Around 2005, a new project emerged from the hugely influential Apache stable that seemed capable, at last, of extracting meaningful insights from very high volumes of data in a cost-effective manner.
Derived from research papers published by Google in 2003 and 2004, Apache Hadoop leveraged the power of distributed and parallel computing to split up the gargantuan processing and storage tasks involved in big data analysis into one coherent client framework.
Since Spark eventually came into being to address some of Hadoop's shortcomings, and since it is still most frequently used in combination with a typical Hadoop deployment, we need to understand Hadoop's architecture a little before we can compare the two respective approaches.
Hadoop Storage: HDFS
The first step toward efficiently addressing such large amounts of data was to create an affordable and secure file-system that could not be compromised by any eccentricities of a host operating system, and where potential data corruption or hardware outages could be anticipated and mitigated.
The Hadoop Distributed File System (HDFS) treats an arbitrary network of cheap storage devices as if they were one single file system. These cheap commodity hardware 'nodes' can be added and removed at will.
HDFS presumes that sections in the cluster will fail or otherwise become unavailable at some point, and ensures data integrity by replicating the information in any node over to another, alternate node.
Where possible, HDFS will back up that information to a node that exists on a completely separate rack (or a discrete end-point), so that the failure of an entire array of units doesn't compromise the integrity of the dataset.
Though HDFS offers the same kind of file-level permissions, access rights management and Access Control Lists (ACLs) as any popular modern operating system, it also uses Kerberos and Lightweight Directory Access Protocol (LDAP) as trusted authentication management solutions.
As we'll see, HDFS's strong security architecture would eventually make it yet more difficult for Spark (which has no such dedicated file system of its own) to completely break away from the Hadoop ecostructure.
Hadoop Processing: MapReduce
Originally written in Java, MapReduce is a programming paradigm where each node in the HDFS cluster is assigned analytical tasks only for the information that is stored on it. When each node has finished its calculations on its own local data, the results that it returns are then added to the results of all the other nodes.
This model can be difficult to take in at first, since it has no obvious analogy to the way we normally process information, either as people or within organizations. Therefore, to understand MapReduce on its own terms, let's consider one popular example: counting all the instances of a particular word.
Let's assume that we have two nodes which each have three different documents stored on them. All six documents are different to each other. We want to find out how many instances (if any) of the words 'cat' and 'mouse' appear across all six documents.
At the Map stage, each node is asked to find any instances of 'cat' or 'mouse' that may appear in any of the documents that are stored on it.
In our example, Node 1 found seven instances of 'cat' and 12 instances of 'mouse' in the documents saved on its storage, while Node 2 found only two instances of 'cat', but eight instances of 'mouse' in its own small collection of documents.
Each node returns a key/value pair to the workflow. In our example the map results are:
Node 1: cat:7 mouse:12 Node 2: cat:2 mouse:8
Next comes the Shuffle/Reduce stage, where each key/pair result is sent to a 'reducer node' process that has been assigned to a specific term. The process adds these results to other results that it has received from other nodes, and produces an aggregate key/pair representing the separate calculations of all the nodes that participated.
After this, the aggregate results are merged and sorted. Finally, the reducer output is written back to HDFS.
For our example, the final results come in as cat:9 mouse:20.
Now imagine that there are two million documents containing about 3 billion instances of each word, and that the process involves 40,000 nodes in a cluster instead of two. The parameters of the search may change, but the principle is the same.
The Limitations of MapReduce
Unsuitable for Live Data
The MapReduce architecture is designed by default to run on pre-loaded data reservoirs saved in HDFS, rather than ad-hoc 'streaming' data, such as live usage statistics from a major social media network.
Stateless, Non-Iterative Execution
You cannot refine a query across a series of consecutive MapReduce jobs, since MapReduce performs stateless rather than Shared State execution. All you can do is either write a better MapReduce job or save the results back into the HDFS pool and include them on a consecutive run. In that case, however, the unstructured nature of HDFS data reservoirs will necessitate more sophisticated kinds of querying that are not really native to the MapReduce paradigm (see 'Difficult to Implement Popular Query Types' below).
Poor Network Performance
Hadoop was originally conceived to address a large number of local machines, rather than a network cluster. Its latency is increased by adding non-local networked nodes to the cluster.
Reliant on Persistent Storage Rather Than RAM
As the workload progresses through a MapReduce operation, the intermediate results are stored to disk rather than in faster RAM memory, which can make the task time-consuming—another factor that makes MapReduce unsuitable for real-time processing.
Complex, Low-Level Programming Environment
In a standard Hadoop installation, MapReduce operations are programmed in Java, the native language of Hadoop. Running MapReduce via Java is a specialized and time-consuming task, with a reduced number of available Java development experts on the market, compared to more current languages and architectures. Writing mapper and reducer code in Java represents intensive and complicated low-level coding work.
Limited Abstraction and Unfriendly APIs
In addition to having to negotiate challenging APIs, very few of the mundane programming tasks are abstracted away in a Java MapReduce programming environment. Consequently, the speed with which such low-level code can traverse HDFS systems is offset by that architectural burden.
Difficult to Implement Popular Query Types
Furthermore, the rigid key/value requirements of the MapReduce model make it difficult to run common SQL-like operations such as join, or to filter results. Although implementing such features can be accomplished in Java, it's a non-trivial task.
Programming Graphs Is Difficult
Graphs can be generated in a MapReduce workload by utilizing the Apache Giraph library, though this adds considerable complexity to the process. Nonetheless, Giraph can often outperform Spark's graphing capabilities, as we shall see.
Slow Solutions for Machine Learning
Machine learning is implemented typically in Hadoop/MapReduce via Apache Mahout, a full-featured but sometimes slow framework for data mining and ML. Mahout is a separate high-level Apache project capable of running on other storage systems and frameworks, and which ceased to accept new code for MapReduce around 2014, with the intention of becoming a more generalized applicable framework in its own right. Since machine learning routines are dependent on iteration—a particular structural weakness in MapReduce running on Hadoop—Spark can usually outperform Mahout in this respect.
Aware that many potential end-users would find the low-level Java programming environment a practical hindrance, Apache Hadoop created an initial framework (Hadoop Streaming) to allow non-Java MapReduce tasks to run on HDFS systems via external programs. In the following years, a considerable ecostructure of more user-friendly programmatic approaches to Hadoop were to emerge; so many, in fact, that it could be argued that Hadoop's internal opacity has become a relatively moot point.
Therefore the majority of Hadoop's “competitors”—including Spark—have arisen from the need to provide a friendlier and more modern data processing architecture than default MapReduce jobs running over HDFS.
From the same stable as Hadoop, the primary challengers to MapReduce over the course of the last decade have included Spark, Flink (which we will return to later), Storm, and Samza.
Additionally, Apache produces Hive, which can impose SQL queries onto unstructured HDFS data; and HBase, a Java-based database that can extract and query sparse data obtained from distributed file systems like HDFS, and which can utilize SQL-style querying with the aid of Apache Phoenix.
These satellite solutions represent only a few examples of the ecostructure that has grown around Hadoop, both open-source and private. All of them operate at a higher level than MapReduce, and all of them implement some flavor of a streaming solution, at various financial and/or operational costs.
Spark in particular has captured the imagination of businesses, governments, and the research community in recent years. Let's take a look at how Spark approaches HDFS traversal and analysis, and at any disadvantages that might accompany its greater speed and ease-of-use.
The Spark Approach
Spark is a general purpose analysis/compute engine that uses RAM to cache data in memory, instead of intermittently saving progress to disk and reading it back later for further calculation, as occurs with a MapReduce workload running on Hadoop.
Since RAM memory is significantly faster than the best available disk read times, Spark is 100 times faster than MapReduce when all of the information will fit in RAM, and ten times faster in cases where the volume of information (or the scope and complexity of the query) is so large as to force Spark to save periodically to disk.
Spark was developed out of the UC Berkeley AMPlab in 2009 and open-sourced on a BSD license the following year. In 2013, the initiative was donated to Apache for ongoing development and maintenance.
Written natively in Scala, Spark can additionally integrate with Java, Python, and R, as necessary. Although each language will provide varying toolsets and levels of abstraction, Scala naturally provides the greatest operational speed, while Python developers are currently most numerous.
Spark is not limited to HDFS file systems, and can run over a range of environments, including Kubernetes, Apache Cassandra, Mesos, and many other repository systems. It can also run in Standalone Mode over SSH to a collection of non-clustered machines.
Though Spark can integrate with YARN (the resource management architecture that Hadoop created when it became clear that MapReduce was being over-taxed with scheduling and task management), as well as alternatives like the default FIFO scheduler, it is more operationally complex than MapReduce, and thus more difficult to explain by analogy. Nonetheless, we'll take a look at the internal logic Spark adopts to process big data.
Before comparing Spark's approach to that of MapReduce, let's examine its key elements and general architecture.
Spark Core and Resilient Distributed Datasets (RDDs)
RDDs are the heart of Apache Spark. A Resilient Distributed Dataset is an immutable array object derived from the data. Querying an RDD will create a new related RDD that carries the query information and output.
In the Spark lexicon, this event is called a 'transformation'. If an RDD should cause or be subject to a general programmatic (rather than catastrophic) failure, saved information about the process up until the failure point enables the end-user to resume the process.
As additional queries and/or data are appended to an RDD, a Directed Acyclic Graph (DAG) develops a chain of amended instances, where each RDD in the chain reflects a new transformation. These are dependent objects that relate back to the original RDD (pictured in blue at the front of the stack in the illustration below).
Creating and manipulating this chain of RAM-based transformations can take place far more quickly than equivalent MapReduce operations, which rely instead on reading and writing to disk.
At any point in the data traversal, Spark's routines can return to an earlier iteration, nearer the start of the RDD chain, improving recoverability and expanding the versatility of the orchestrated processes.
Spark SQL provides a native API to impose structured queries on a distributed database, without the additional overhead and possible performance penalties of the secondary frameworks that have grown up around MapReduce in Hadoop.
Not only does the module offer an enviable level of fault-tolerance, but it is also capable of passing on Hive queries without modification or transliteration. This allows Hive to benefit from the massive speed benefits of Spark's in-memory data traversal.
Spark SQL can provide a transactional layer between the RDDs generated during a session and relational tables. A Spark SQL query can address both native RDDs and imported data sources as necessary.
This transactional functionality within Spark SQL is enacted via 'DataFrames'. Similar to the DataFrame model in Python and R, a Spark SQL DataFrame is a highly optimized and versatile dataset that's not unlike a table in a relational database.
Spark Streaming enables on-demand data to be added to Spark's analytical framework as necessary. Incoming data from sources such as TCP sockets, Apache Kafka (see below) and Apache Flume, among many others, can be processed by high-level functions and passed on to result-sets for further batch-work, back to an HDFS file system, or directly to end-user systems, such as dashboards.
Since Spark Streaming is a near-coded extension of the core Spark API, it offers speed as well as versatility. Although this speed boost naturally diminishes if one piles tertiary APIs, frameworks and mechanisms on top of it, Spark Streaming operates far nearer core code speeds than an elaborated Hadoop/MapReduce streaming solution.
Spark Streaming is a core enabling technology of Spark's graphing and machine learning libraries (see below).
GraphX is Spark's native library for the on-demand production of graphs and Graph-Parallel Analytics. It comes with a suite of additional Spark operators, as well as algorithms and builders. GraphX can also be initiated from outside the Spark shell (i.e. from higher-level frameworks and applications) as necessary.
GraphX can access graphs from Hive via SQL, and perform in-process preview and data analysis via the Spark REPL shell environment — at least on smaller graphs.
Additionally, GraphX's abstractions make programming of graph generation considerably easier than Giraph over Hadoop/MapReduce.
Machine Learning Library (MLlib)
Spark's dedicated MLlib library supports four important functions in machine learning: regression, collaborative filtering, binary classification and clustering. As with GraphX, MLlib was designed as a core module, with reduced overhead and latency in comparison with the secondary frameworks that provide similar functionality for MapReduce.
The Limitations of Spark
Fast, But Not Real-Time
For all its greater speed, and even if the 2.3 release potentially takes latency down to milliseconds, Apache Spark does not operate in real time against streaming data. Like MapReduce, Spark uses a batch-processing architecture, albeit a far more atomized and responsive one that's capable of handling critical missions (such as fraud detection, which is an important use case of big data in ecommerce, for example) with acceptable operational latency. Genuine real-time solutions can be created with Storm or Apache Kafka over HBase, amongst other possible conjunctions from the Apache data streaming portfolio, as well as third-party offerings.
No Native File System
Since Spark has positioned itself as a platform-agnostic analytics framework, it must adapt itself to the requirements and limitations of the system where the target data is hosted. Since Hadoop has captured such a large segment of the market, Spark still runs most frequently over some flavor of Hadoop and HDFS. This brings security benefits (see below) for which Spark has no native provisioning, but also bottlenecks that can reduce the potential efficacy of Spark's RAM-based speed advantage.
Requires HDFS (or Similar) for Best Security Practices
By default, Spark's security contingencies are set to 'off'. The fact that Spark relies so heavily on the security provisions of an external cluster is one of the reasons it is most often used with HDFS, which has wide industry adoption and robust and mature security protocols. As mentioned in the preceding point, Spark is effectively a 'guest' in this scenario, forced to inherit any latency or performance issues associated with file read/writes on the host cluster.
Expensive Hardware Requirements
Though RAM prices have significantly reduced since the advent of Hadoop, the cost of RAM still exceeds the array of very cheap and redundant disk-based cluster processing that MapReduce exploits on a Hadoop operation.
Does Not Always Scale Well
High concurrency and high compute loads can cause Spark to perform poorly, compared to the more plodding but sure-footed routines of MapReduce on HDFS. It can also run more slowly under more ambitious workloads, with out-of-memory (OOM) events an occupational hazard that is still not handled well natively, and must be anticipated and mitigated at the query/workload development stage.
Does Not Handle Small Files or Gzipped Files Well
Though Spark's speed opens up the possibility of including networked resources in a workload, some of the most popular cloud solutions are optimized for large sizes of file, rather than the many small files that will comprise a typical big data batch job. For instance, just getting an Amazon S3 bucket instance to present a long list of files may take a minute or more, depending on size. The advantages of abstracted directories such as JSON and metadata are almost entirely lost in Spark, which has to first parse all the data points into actual file locations before it can begin processing them. If data is stored in compressed GZip format on the host system, Spark will also end up dedicating valuable resources to the memory-intensive job of extracting the files.
Diminished Recovery Capabilities (compared to MapReduce)
MapReduce uses persistent storage on the HDFS file system during a workload, which allows a crashed job to resume from the point of failure onward, once the cause has been addressed. Since RAM is volatile memory, Spark can only run at its best speed by sacrificing a certain amount of recoverability. Since big data batch processing can be very time-consuming, this remains a logistic consideration.
GraphX Limitations vs Giraph
Despite close integration with Spark, which normally leads to an inherited speed boost for a Spark module, GraphX has come under criticism in recent years (not least from Facebook) for laggard performance and hard limitations to the size of graph that can be produced from large volumes of data, in comparison to Giraph on a MapReduce/Hadoop data run. Giraph is also reported to require less memory and less time to create high-scale graphs than GraphX.
Limited Machine Learning Algorithms
While Spark's MLlib operates very near the core code, it offers a reduced set of algorithms compared to the more mature Mahout running on MapReduce.
Difference between Spark and Hadoop: Conclusion
Since the rise of Spark, solutions that were obscure or non-existent at the time have risen to address some of the shortcomings of the project, without the burden of needing to address 'legacy' systems or methodologies. Notable among these is Apache Flink, conceived specifically as a stream processing framework for addressing 'live' data. Though Spark includes a component for streaming data, Flink was engineered from scratch with this model in mind, and has less collateral overhead for this reason.
Other frameworks and tools have likewise arisen that are capable either of addressing the challenges of big data workloads in a more modern and unencumbered way than Spark, or which have rendered moot some of the traditional misgivings about adopting MapReduce over Hadoop, by providing accessible, user-friendly APIs that hide the opaque workings of MapReduce in Hadoop/HDFS.
Therefore it is important, when choosing possible frameworks for big data analysis, to balance your big data requirements against the wider landscape, where the 'next Spark' may be emerging.
That said, let's conclude by summarizing the strengths and weaknesses of Hadoop/MapReduce vs Spark:
- Live Data Streaming: Spark
For time-critical systems such as fraud detection, a default installation of MapReduce must concede to Spark's micro-batching and near-real-time capabilities. However, also consider Apache Druid, either as an alternative or adjunct to Spark. Druid was designed to deal with low latency queries, and can also integrate with Spark to accelerate OLAP queries.
- Recoverability: Hadoop/MapReduce
Spark saves a lot of processing time by avoiding read/write operations. However, its memory management capabilities can be overwhelmed by larger jobs, leading to the need to restart the entire batch instead of resuming from the point of failure, which is more easily handled in Hadoop/MapReduce.
- Ease of Use: Spark
Developing MapReduce jobs in Java is arcane work. Although an ecostructure of tools and frameworks have grown up around it to abstract the problem away, accessibility and ease-of-use were founding principles of Spark.
- Scalability: Hadoop/MapReduce
Some of Spark's scaling issues can be addressed by re-partitioning data and paying attention to volatile joins, among other tricks. For the most part, scaling is, by design, a non-issue with MapReduce in a Hadoop HDFS cluster. The tortoise wins!
- Available Development Talent: Spark
Spark's support for Python alone vastly broadens the available developer pool, in comparison to MapReduce. The only caveat is that the swarm of higher-level programming tools for MapReduce in recent years, most of which are written in more popular programming languages, makes this less of an issue than it once was.
- Graphing: Hadoop/Giraph
In one 2018 study between various high-volume data graphing systems (including GraphX and Giraph), GraphX emerged as the slowest of the tested systems due to processing overheads such as RDD lineage, checkpointing and shuffling. Considering that GraphX is a core Spark module and Giraph a higher-level adjunct to Hadoop, this is a surprising result. Combine this with the lower memory requirements of Giraph and graph size limitations mentioned earlier, and Hadoop/Giraph are the clear winners in this category.
See our big data knowledge in action.
Apache Hadoop emerged at the forefront of the big data revolution. In this Hadoop overview, we'll see how Hadoop came to be and where it stands in the current data analysis sector.
Your organization’s data is dirty and damaging, unless you’ve cleaned it recently. Learn about some data cleaning techniques that every organization can employ.
WANT TO START A PROJECT?