Hadoop Overview: A Data Management Powerhouse
Apache Hadoop emerged at the forefront of the big data revolution. In this Hadoop overview, we'll see how Hadoop came to be and where it stands in the current data analysis sector.
Hadoop is an open-source software framework that enables the intensive processing of very large datasets in order to gain insights and to reveal correlations between different sources of information.
Since the advent of Hadoop predates the peak of the cloud computing revolution and the increasing affordability of high-capacity hardware (such as RAM), a raw Hadoop installation is seen by some as arcane, hard to program for, and perhaps more part of big data's history than its future. Nonetheless, Hadoop remains a core industry standard for many of data science consultants on the planet, Iflexion included.
Here we’ll examine the conditions that brought Hadoop into being, provide a Hadoop architecture overview, examine how much the newer technologies in big data analysis owe to it…and ask whether Hadoop really is a 'legacy' approach to high-volume data analysis or rather an implacable and long-term player in the sector.
Hadoop Overview: Why Hadoop Came About
By the beginning of the twenty-first century, the private and public sectors had begun to comprehend the potential gains from accessing, processing and cross-referencing the high volumes of raw digital data that were now emerging after the first wave of digitization throughout the 1990s. Even before the advent of IoT, the potential to interpret, utilize and monetize these new information streams promised a quantum leap in business intelligence.
Data that could reveal end-user patterns, from consumer preferences to urban planning; data from essential sectors including health, emergency services, environmental and urban monitoring systems; data from older digital repositories that had yet to break out from their own local environments and purposes, capable of providing analytical insights so much needed for customer engagement strategies; data with civic, economic and military applications beyond the dreams of analysts and modelers—if these huge and apparently ungovernable stores of information could somehow be pressed into useful service, businesses and governments would tap into completely new opportunities.
Eventually this abstract notion of vast untapped and unconnected knowledge would come to be called 'big data'.
Hadoop Overview: The High Demands of Big Data
By the early 2000s, the problems of mining large datasets of this nature had become apparent. Ten years of free market competition during the networking and computing revolution of the 1990s had resulted in multiple proprietary formats across multiple platforms. Furthermore, some of these formats were dependent on particular types of file system storage. Even where licensing issues could be resolved for the data, navigating the babel of file formats and sheer variety of database systems would prove an additional hurdle.
The hardware requirements of such systems also seemed prohibitive for smaller enterprises, since it takes an extraordinary amount of responsive computing power and fast read times to collate and cross-reference data points in tables with million—or billions—of data points.
Furthermore, since such intense processing operations are likely to wear down host systems, those host systems would need to be well-specced, dedicated and designed for the task, which would make them correspondingly expensive.
The Long Read
Even if the budget permitted, bespoke big data analysis systems had yet another bottleneck to contend with: it can take a very long time to read multiple small files from high-volume storage disks. This is because the 'read head' in the disk has to move a great deal over the surface of the disk to access the many tiny files that need to be correlated during big data analysis.
The volume of read operations is so high in such a process that even a resource-intensive multi-disk approach (such as RAID) can't compensate for it. This pipeline problem is also affected by the limitations of the host system's bus throughput speeds. If the analysis involves data resources that need to be accessed over a network, the bottleneck, naturally, becomes even worse.
An effective, affordable and nimble big data analysis system seemed out of reach, unless resource costs lowered radically, or else some elegant and imaginative workaround could be found.
Government-led big data projects of the era were routinely hampered by bureaucratic and political factors, while private sector solutions would likely come at the cost of vendor lock-in and proprietary formats, as well as issues around data provenance, security, and the ramifications of achieving a potential monopoly in what was predicted to become a pivotal sector of the new century.
In the early 2000s, an acceptable answer began to emerge from the search engine market, which was tackling the problem not in order to develop a saleable product, but rather as an obstacle in the way of its core business—advertising.
Hadoop Architecture Overview
In 2003, faced with overwhelming amounts of user-generated data to analyze in the wake of its sudden global success, Google developed a new file system designed to deliver a high-yield and fault-tolerant storage on cheap, commodity hardware.
The Google File System (GFS) was an imaginative leap in distributed storage, without any of the shortcomings of RAID or similar predecessors. Under GFS, data was split up and mirrored across multiple storage nodes, hosted on an arbitrary number of scalable clusters of affordable, dispensable hardware.
Practically any type of storage could be added to the cluster, without needing to be comparable to other nodes, and without the expectation that all the nodes would need to perform equally well.
A Shorter Journey to the Data
The use of multiple smaller physical drives meant that the data points assigned to each node were physically nearer to each other on the disk, improving read performance compared to systems that attempt to read many small files from larger storage volumes.
Abstract and flexible, GFS operated as a user-space library instead of being tied to a host operating system (and therefore becoming vulnerable to any issues that the host OS might experience, such as downtime or unrelated data corruption).
Since GFS was specifically designed for analyzing existing data, it prioritized seek time (read) operations over write speeds. However, it also maintained acceptable write performance by appending data to existing files, rather than completely overwriting them every time an update was necessary.
Using MapReduce for More Efficient Assignment and Analysis
The following year the search engine giant released details of MapReduce, a complementary programming model that could split up high-volume data into many separate 'worker nodes', each capable of discretely solving its own part of a larger query against the data.
Some nodes would have harder (or more) calculations to perform than others, while other nodes might be operating on less performant hardware. It didn’t matter, since they operated independently of each other, contributing their own percentage of the solution when their calculations were finished.
MapReduce enabled the efficient analysis of even petabytes of data, without the bottleneck of slow access times that would normally occur when attempting to read large data stores from disk and with enviable redundancy and speed.
Though Google obtained a patent for MapReduce, the paradigm behind it was first presented in 1985, and its concepts put into action for the Connection Machine range of experimental supercomputers at MIT during the 1980s—a period where more primitive hardware resources had sparked intense research into parallel computing techniques of this kind.
This Hadoop overview video from Cloudera (which began life as a commercial Hadoop distribution and contributes code significantly to the Hadoop infrastructure) explains how these initial discoveries were eventually utilized by the Hadoop project.
Hadoop Overview: The Project Begins
From 2004 onward, this hive-like approach to big data processing caught the attention of Doug Cutting, then a researcher at Yahoo!, and data scientist Mike Cafarella. Intrigued by potential applications in dedicated data analysis frameworks, the pair began to develop new projects derived from Google's published research.
Initially used as an enabling technology for the Apache open-source search engine Nutch, the broader analytical potential of GFS and MapReduce was soon spun into its own dedicated project—Apache Hadoop.
Named after the favorite toy elephant of Doug Cutting's son, Hadoop would eventually comprise four core elements, described below.
Hadoop Distributed File System (HDFS)
Written in Java, the Hadoop Distributed File System (HDFS) is very similar to Google's original conception as described above: an abstracted platform-agnostic file system split across any number of available drives that the end-user might want to add to the cluster. Replication is deeply embedded in the HDFS model, so that no crippling data loss ensues if one of the drives should fail.
A number of the core concepts of GFS have been reworked in Hadoop to cater to a single end-user rather than the multiple clients envisaged by Google, making HDFS more project-oriented by comparison.
HDFS also has a minimum block size of 128mb (double that of GFS) and takes a different approach to file deletion, snapshots, data integrity checksums, and file appends, among other aspects.
The MapReduce component of Hadoop splits up data processing tasks into as many job lots as there are hardware nodes in the cluster. If one particular task involves a file that's large enough to potentially affect performance, that file can be further divided into a more workable sub-array of 'Hadoop splits'.
In 2012, MapReduce evolved to become a function of the YARN framework that was introduced into Hadoop at that time (see below).
Also known as The Core, Hadoop Common comprises essential components to initialize and run Hadoop, notably providing the abstraction for the HDFS file system. In addition to JAR files and associated scripts, Hadoop Common includes documentation and source code for the project.
Some years into Hadoop's development, it became clear that MapReduce was overburdened with administrative tasks such as scheduling, job tracking, and other process and resource functions. Yahoo! estimated at that time that scalability was also affected by this over-commitment, creating an effective limit of 40,000 concurrent tasks over 5,000 nodes.
To clear the bottleneck, the developers created an additional higher-level administrative layer to take over many of these tasks. The additional architecture was named Yet Another Resource Manager (YARN), perhaps as a reference to Doug Cutting's time with Yahoo!.
This new rationalization of tasks enabled Hadoop to incorporate other processing frameworks into its operations, and to become an ecostructure in its own right, albeit at the cost of some additional complexity and oversight.
Industry Take-Up of Hadoop
Hadoop was initially conceived by one of the most powerful tech entities of all time and designed to take on hyperscale data processing challenges, addressing terabytes or more information at a time.
Therefore it is perhaps less surprising than usual for an open-source project that it would have accrued so many famous end-users, including AWS, LinkedIn, Twitter, Alibaba, eBay, Microsoft, Netflix, IBM, RackSpace, Facebook, and Spotify, among many others.
Only the biggest tech entities and government sectors were likely to have the sheer quantities of data that Hadoop was designed to address. For smaller enterprises, Hadoop is generally considered to be 'over-specced'.
Looking at Hadoop-based Projects
No Hadoop overview is possible without considering the secondary system of tools and full-fledged projects that have grown to support or extend the framework over the last two decades. A comprehensive list is beyond the scope of this article, but here are some of the most notable Hadoop derivatives, adjuncts, and challengers.
After initial development at UC Berkeley in 2009 and through to its establishment as a high-level open-source project within Apache in 2014, Spark came to embody the 'next generation' of big data analysis frameworks.
Spark processes data directly in RAM rather than evolving its calculations by saving data back to disk as Hadoop does. According to the project's own publicity, Spark can therefore run workloads 100 times faster than Hadoop.
Spark's API also offers developers more than 80 high-level operators to access its capabilities, rather than challenging them to create workflows for the more arcane and obscure protocols of MapReduce. End-users can control workloads through popular programming languages (including SQL, Python, Scala, and Java) that have a high level of developer availability.
However, there are some caveats:
- Spark requires the user to throw money at the problem, namely it has more expensive computing resources (a hindrance that Hadoop was specifically designed to address).
- Its main benefits diminish when a workload cannot fit into available RAM memory.
- It can't resume a crashed job as easily as Hadoop, since the job's progress is stored in RAM, and therefore is more volatile.
- Its standards for security are not yet as developed as those of Hadoop, except when Spark is operating as an interface to a Hadoop HDFS file system, and can inherit the stronger security measures of its 'predecessor'.
Though it can operate independently of Hadoop, Spark is still most commonly used as a user-friendly GUI for the Hadoop ecostructure, and is frequently deployed in tandem with other Hadoop-related tools and frameworks.
A basic installation of Hadoop makes no provision for integrating the results of previous workloads into the current workload, a modeling concept known as 'shared state'. Running multiple data processing workloads using shared state allows consecutive jobs to drill down to more granular and meaningful results, without needing to increase the available total resources of the cluster.
The Apache HBase database schema provides Hadoop with a useful 'memory' to apply against subsequent data analysis runs.
HBase is designed for real-time interaction with billions of data points, allowing the user to run less burdensome or time-consuming queries against large HDFS data lakes. However, HBase analyses run against a 'summarized' or more superficial strata of extracted data. Therefore, the quality of the results will depend on the integrity and relevance of the slower MapReduce queries that are initially extracted.
Though it is written in Java as a non-relational distributed database, end-users that prefer to control data queries via SQL can access HBase via the Apache Phoenix database engine.
The Apache Ambari project is yet another initiative to make Hadoop queries easier to develop and administrate. It provides a dashboard with a metrics system and an alert framework, as well as a friendlier way to configure Hadoop cluster services.
Conceived as a facet of the Apache Lucene search engine in 2008, Mahout is now a top-level Apache project that provides scalable data mining and applicable machine learning libraries for Hadoop.
As some indicator of its relationship to Hadoop, Mahout was named after a Hindi term for someone riding on top of an elephant. Nonetheless, it is capable of running on non-Hadoop clusters, and began to demote MapReduce to a legacy technology around 2014.
The launch of the 0.10x release a year later established Mahout as a programming language in its own right, and one leaning toward the growing industry take-up for Spark.
Other Tools and Challengers of Hadoop
There are too many other offshoots, interpretive software layers and putative replacements for Hadoop to examine here in any detail. From Apache itself, these include Storm (a real-time Clojure-based system similar to MapReduce in many ways), Mesos (a cluster management system used by Apple, eBay, and Verizon, among others), Flink (a highly fault-tolerant stream-processing framework), and Hive (an SQL-style querying system for Hadoop).
From other sources, there is Pachyderm (which uses Docker to simplify MapReduce processing), BigQuery (Google's highly scalable big data analytics powerhouse), and Ceph (an object-based parallel file-system which attempts to address some of Hadoop's technical limitations).
The (Arguable) Decline of Hadoop
Hadoop was conceived at the tail end of the peak era of on-premises, 'hand-spun' corporate and civic data centers. Cloud computing, lower RAM costs, newer architectures and the evolution of high-speed connectivity has removed or reduced several of the problems that Hadoop was designed to solve when first released. Consequently, articles heralding the demise of Hadoop are by now an annual feature in tech and business media outlets.
However, nothing has come along in its wake that replaces Hadoop outright. Rather, the project has developed a growing and profitable ecostructure of programs that either rely on some or other facet of the Hadoop paradigm or have built additional or supporting functionality over the architectural foundations of Hadoop.
A further argument for Hadoop's future is its incumbent position and the high level of investment that has already been expended on multi-million-dollar analytical systems that continue to power customer-centric business operations, such as omnichannel marketing, and serve end-users.
Indeed, most of Hadoop's 'rivals' issue from the same stable and attempt to redress its shortcomings (such as accessibility, latency and efficiency) rather than approaching the same problems from scratch with more up-to-date technologies.
Just as many of the world's major cities are built over Victorian (or older) infrastructure that's more economical to maintain than to replace, Hadoop's early lead in big data analytics has embedded it so deeply in the sector that it is, by now, difficult to usurp—no matter how much the tech landscape that defined it has changed in the last twenty years.
New architectures that seek to replace Hadoop outright must challenge a long-established incumbent with a huge install-base of historically profitable data mining pipelines. Other would-be successors have cherry-picked sections of Hadoop's core (such as HDFS) and remain at least partly dependent on that downstream code base.
Others yet are trying to accomplish the same results in a similar manner to Hadoop but to do it faster with better (and more expensive) resources, rather than by applying a novel and imaginative use of the same limited resources that Hadoop sought to address. It could be argued that this is a glib consumer choice rather than a genuine evolutionary leap.
So Hadoop remains a powerful open-source solution to data mining and analysis. The more tools that emerge to make it increasingly accessible and manageable, the more irrelevant its initial complexity and opacity become. By now, Hadoop's core is so buried in the stack of a typical big data analysis project that the average user is often completely shielded from its inner workings.
Reach out to Iflexion’s consultants to discuss which.
Your organization’s data is dirty and damaging, unless you’ve cleaned it recently. Learn about some data cleaning techniques that every organization can employ.
Find out about the ecommerce business intelligence tools to keep an eye on in 2020, and learn more about their features.
Every day, we send 294 billion emails, 500 million tweets, and 65 billion messages over WhatsApp. What can ecommerce organizations do with that rising sea of information? We have some answers.
WANT TO START A PROJECT?