Elastic MapReduce (EMR)
EC2 instances for big data processing based off Hadoop framework
Essentials
Simplifies running big data frameworks (Hadoop, Spark) on AWS
Processing of data in large batches
A managed service for running Hadoop clusters on EC2
EMR is used to analyze and process vast amounts of data
Features and Advantages
S3 for storage
Load into HDFS or keep in S3 (EMRFS)
Transient clusters
Can also be persistent
Steps or tasks to be completed are defined, then termination
Saves a lot of money
Spot Instances
Bootstrapping
Customized configuration while launching
Preconfigured application frameworks
Scalability (add and remove core and task nodes)
Master Node
Node that manages the cluster by running software components which coordinate the distribution of data and tasks among other (slave) nodes for processing
The master node trackks the status of tasks and monitors the health of the cluster
SSH to the master node is permitted to run interactive jobs
Slave Nodes
Two Types
Core Node
A slave node has software components which run tasks AND store data in the Hadoop Distributed File System (HDFS) on your cluster
The core nodes do the "heavy lifting" with the data
Task Node
A slave node that has software components which only run tasks
Task nodes are optional
Mapping Job
Map Phase
Mapping is a function that defines the processing of data for each split
The block size for HDFS is 128MB, which is the optimum split size.
The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.
If there are more chunks than nodes/mappers, the chunks will queue for processing
Reduce Phase
Reducing is a function that aggregates the data into a single output file
Reduced data needs to be stored (maybe in S3) as data processed by the EMR cluster is not persistent.
Last updated
Was this helpful?