Elastic MapReduce (EMR)

EC2 instances for big data processing based off Hadoop framework

Essentials

  • Simplifies running big data frameworks (Hadoop, Spark) on AWS

  • Processing of data in large batches

  • A managed service for running Hadoop clusters on EC2

  • EMR is used to analyze and process vast amounts of data

Features and Advantages

  • S3 for storage

    • Load into HDFS or keep in S3 (EMRFS)

  • Transient clusters

    • Can also be persistent

    • Steps or tasks to be completed are defined, then termination

    • Saves a lot of money

  • Spot Instances

  • Bootstrapping

    • Customized configuration while launching

  • Preconfigured application frameworks

  • Scalability (add and remove core and task nodes)

Master Node

  • Node that manages the cluster by running software components which coordinate the distribution of data and tasks among other (slave) nodes for processing

  • The master node trackks the status of tasks and monitors the health of the cluster

  • SSH to the master node is permitted to run interactive jobs

Slave Nodes

Two Types

  • Core Node

    • A slave node has software components which run tasks AND store data in the Hadoop Distributed File System (HDFS) on your cluster

    • The core nodes do the "heavy lifting" with the data

  • Task Node

    • A slave node that has software components which only run tasks

    • Task nodes are optional

Mapping Job

Map Phase

  • Mapping is a function that defines the processing of data for each split

  • The block size for HDFS is 128MB, which is the optimum split size.

  • The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.

  • If there are more chunks than nodes/mappers, the chunks will queue for processing

Reduce Phase

  • Reducing is a function that aggregates the data into a single output file

  • Reduced data needs to be stored (maybe in S3) as data processed by the EMR cluster is not persistent.

Last updated

Was this helpful?