CtrlK

Elastic MapReduce (EMR)

EC2 instances for big data processing based off Hadoop framework

Essentials

Simplifies running big data frameworks (Hadoop, Spark) on AWS
Processing of data in large batches
A managed service for running Hadoop clusters on EC2
EMR is used to analyze and process vast amounts of data

Features and Advantages

S3 for storage
- Load into HDFS or keep in S3 (EMRFS)
Transient clusters
- Can also be persistent
- Steps or tasks to be completed are defined, then termination
- Saves a lot of money
Spot Instances
Bootstrapping
- Customized configuration while launching
Preconfigured application frameworks
Scalability (add and remove core and task nodes)

Master Node

Node that manages the cluster by running software components which coordinate the distribution of data and tasks among other (slave) nodes for processing
The master node trackks the status of tasks and monitors the health of the cluster
SSH to the master node is permitted to run interactive jobs

Slave Nodes

Two Types

Core Node
- A slave node has software components which run tasks AND store data in the Hadoop Distributed File System (HDFS) on your cluster
- The core nodes do the "heavy lifting" with the data
Task Node
- A slave node that has software components which only run tasks
- Task nodes are optional

Mapping Job

Map Phase

Mapping is a function that defines the processing of data for each split
The block size for HDFS is 128MB, which is the optimum split size.
The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.
If there are more chunks than nodes/mappers, the chunks will queue for processing

Reduce Phase

Reducing is a function that aggregates the data into a single output file
Reduced data needs to be stored (maybe in S3) as data processed by the EMR cluster is not persistent.

PreviousKinesis NextCompute Services

Last updated 6 years ago

Was this helpful?