Make Science Videos Start A Science Blog Get Project Help Shop Science Kits

Hadoop and Spark Basics

Updated May 2026

Apache Hadoop and Apache Spark are the two most influential open-source frameworks for processing big data across clusters of computers. Hadoop pioneered the approach of distributing both storage and computation across commodity hardware, while Spark built on that foundation with dramatically faster in-memory processing. Together, they form the backbone of most large-scale data processing systems in both industry and scientific research.

Apache Hadoop: The Foundation

Hadoop emerged from a pair of research papers published by Google in 2003 and 2004 describing the Google File System and the MapReduce programming model. Doug Cutting and Mike Cafarella created the open-source implementation while working on the Nutch web search project, naming it after a toy elephant belonging to Cutting's son. The Apache Software Foundation adopted the project in 2008, and it quickly became the standard platform for big data processing.

Hadoop consists of four core modules. The Hadoop Distributed File System, known as HDFS, stores data across a cluster by splitting files into large blocks, typically 128 megabytes each, and replicating each block across multiple machines for fault tolerance. The default replication factor is three, meaning every block exists on three different servers. If one server fails, the data remains available from the other copies, and the system automatically creates a new replica to restore the target replication count.

YARN, which stands for Yet Another Resource Negotiator, manages cluster resources and schedules applications. It replaced the original MapReduce-only job scheduler in Hadoop 2, making it possible to run different types of processing frameworks on the same cluster. A central ResourceManager allocates memory and CPU cores to applications, while NodeManagers on each machine monitor resource usage and report back to the ResourceManager.

MapReduce is the original processing engine for Hadoop. It expresses computation in two phases: a map phase that processes input records in parallel and produces intermediate key-value pairs, and a reduce phase that groups intermediate results by key and combines them into final output. Between these phases, a shuffle step sorts and transfers data across the network so that all values for the same key arrive at the same reducer. While powerful, MapReduce writes intermediate results to disk after every step, which makes it slow for iterative algorithms that need to pass over the same data multiple times.

Apache Spark: Speed Through Memory

Apache Spark was developed at UC Berkeley's AMPLab starting in 2009 and became a top-level Apache project in 2014. Its key innovation is the Resilient Distributed Dataset, or RDD, a fault-tolerant collection of data distributed across cluster nodes that can be operated on in parallel. Unlike Hadoop MapReduce, which writes results to disk between every computation step, Spark keeps data in memory as much as possible. For iterative algorithms common in machine learning and graph analysis, this approach can deliver performance improvements of 10 to 100 times compared to MapReduce.

Spark organizes computation as a directed acyclic graph, or DAG, of transformations. Rather than executing each operation immediately, Spark builds up a plan of transformations and then optimizes the entire pipeline before executing it. This lazy evaluation strategy allows the query optimizer, called Catalyst in Spark SQL, to rearrange operations, combine redundant steps, and choose efficient execution strategies that would be impossible if operations were executed one at a time.

The Spark ecosystem includes several specialized libraries built on the same core engine. Spark SQL provides structured data processing with a SQL interface and DataFrames, which are distributed tables with named columns and type information. MLlib offers a library of machine learning algorithms optimized for distributed execution, including classification, regression, clustering, and collaborative filtering. GraphX supports graph-parallel computation for social network analysis, recommendation systems, and other graph-structured problems. Spark Streaming enables processing of continuous data streams using a micro-batch approach that divides incoming data into small batches and processes each batch using the standard Spark engine.

Spark can run on multiple cluster managers including YARN, Apache Mesos, Kubernetes, and its own standalone cluster manager. It can read data from HDFS, Amazon S3, Apache Cassandra, Apache HBase, and many other storage systems. This flexibility means Spark can be deployed alongside existing Hadoop infrastructure or as a standalone system without Hadoop at all.

Hadoop vs. Spark: When to Use Each

Hadoop MapReduce excels at large-scale batch processing where fault tolerance and reliability are more important than speed. It handles jobs that process petabytes of data reliably, even on clusters with thousands of nodes where hardware failures are frequent. Because it writes intermediate results to disk, MapReduce uses less memory per node, which can be advantageous when processing volumes are so large that data cannot fit in memory even across a large cluster.

Spark is the better choice for iterative algorithms, interactive data exploration, and workloads that benefit from in-memory processing. Machine learning training, which typically passes over the same dataset dozens or hundreds of times, sees the most dramatic improvements from Spark's approach. Interactive analysis, where a data scientist runs queries against a dataset and refines their approach based on the results, also benefits from Spark's ability to cache data in memory between queries.

In practice, many organizations run both systems on the same cluster. YARN can manage resources for Hadoop MapReduce jobs and Spark applications simultaneously, with each framework handling the workloads it is best suited for. Data stored in HDFS is accessible to both systems, so there is no need to duplicate storage.

The trend in recent years has been strongly toward Spark. Most new big data projects choose Spark as their primary processing engine, and many organizations are migrating legacy MapReduce jobs to Spark for better performance and a more productive programming model. However, MapReduce remains in production at many large organizations where its proven reliability and lower memory requirements make migration unnecessary.

Hadoop and Spark in Scientific Research

Bioinformatics has been an early adopter of both frameworks. The Genome Analysis Toolkit, developed by the Broad Institute, uses Spark through its GATK4 release to parallelize variant calling across distributed clusters. Processing a whole genome sequencing dataset that might take 24 hours on a single workstation can be completed in under an hour on a moderately sized Spark cluster. The ADAM project provides a set of genomics tools specifically designed for Spark, offering distributed implementations of common operations like read alignment and variant annotation.

Astronomy uses Hadoop and Spark for processing the massive imaging datasets produced by modern telescopes. The Zwicky Transient Facility processes more than 1 million alerts per night, each representing a potential astronomical event detected by comparing new images against reference images. A Spark-based pipeline filters, classifies, and distributes these alerts to astronomers worldwide within minutes of image capture.

Environmental science relies on these frameworks for processing satellite imagery and sensor network data. The Google Earth Engine, while using proprietary infrastructure rather than Hadoop or Spark directly, implements similar distributed processing principles to enable researchers to analyze the entire Landsat archive, which spans more than 40 years of global imagery totaling multiple petabytes.

Particle physics processes data from the Large Hadron Collider using a combination of custom frameworks and increasingly Spark-based tools. The CERN physics community has developed Spark-based analysis frameworks that allow physicists to interactively explore datasets containing billions of collision events, a capability that was previously available only through batch processing jobs that could take hours or days to complete.

The Evolving Ecosystem

The big data processing landscape continues to evolve beyond the original Hadoop and Spark paradigms. Apache Flink provides true streaming processing rather than Spark's micro-batch approach, offering lower latency for applications that need to react to events within milliseconds. Apache Beam provides a unified programming model that can run on multiple execution engines including Spark, Flink, and Google Cloud Dataflow, allowing developers to write processing logic once and deploy it on different platforms.

Cloud-managed services have simplified deployment significantly. Amazon EMR, Google Dataproc, and Azure HDInsight provide Hadoop and Spark clusters that can be launched in minutes and scaled up or down based on workload. Serverless options like AWS Glue and Databricks eliminate cluster management entirely, allowing users to submit processing jobs without thinking about the underlying infrastructure.

The lakehouse architecture, popularized by Databricks with Delta Lake and by the Apache Iceberg and Apache Hudi projects, combines the flexibility of data lakes with the structure and performance of data warehouses. These systems run on Spark and add transaction support, schema enforcement, and time travel capabilities to data stored in open formats on cloud object storage. This evolution represents the next generation of the Hadoop ecosystem, retaining the principles of distributed processing while addressing limitations of the original architecture.

Key Takeaway

Hadoop pioneered distributed big data processing by combining distributed storage with the MapReduce model, while Spark dramatically improved performance through in-memory computation. Most scientific big data projects today use Spark for processing, often running on infrastructure that still relies on Hadoop's storage and resource management foundations.

Apache Hadoop: The Foundation

Apache Spark: Speed Through Memory

Hadoop vs. Spark: When to Use Each

Hadoop and Spark in Scientific Research

The Evolving Ecosystem

Related Articles

Distributed Computing Explained

How to Build Data Pipelines

Real Time Data Processing

Data Lakes Explained

Python for Scientific Computing