# Elastic MapReduce (EMR)

## Essentials

* Simplifies running big data frameworks (Hadoop, Spark) on AWS
* Processing of data in large batches
* A managed service for running Hadoop clusters on EC2
* EMR is used to analyze and process vast amounts of data

### Features and Advantages

* S3 for storage
  * Load into HDFS or keep in S3 (EMRFS)
* Transient clusters
  * Can also be persistent
  * Steps or tasks to be completed are defined, then termination
  * Saves a lot of money
* Spot Instances
* Bootstrapping
  * Customized configuration while launching
* Preconfigured application frameworks
* Scalability (add and remove core and task nodes)

### Master Node

* Node that manages the cluster by running software components which coordinate the distribution of data and tasks among other (slave) nodes for processing
* The master node trackks the status of tasks and monitors the health of the cluster
* SSH to the master node is permitted to run interactive jobs

### Slave Nodes

Two Types

* Core Node
  * A slave node has software components which run tasks AND store data in the Hadoop Distributed File System (HDFS) on your cluster
  * The core nodes do the "heavy lifting" with the data
* Task Node
  * A slave node that has software components which only run tasks
  * Task nodes are optional

## Mapping Job

### Map Phase

* Mapping is a function that defines the processing of data for each split
* The block size for HDFS is 128MB, which is the optimum split size.
* The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.
* If there are more chunks than nodes/mappers, the chunks will queue for processing

### Reduce Phase

* Reducing is a function that aggregates the data into a single output file
* Reduced data needs to be stored (maybe in S3) as data processed by the EMR cluster is not persistent.
*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://gitbook.nicacton.com/cloud-computing/aws/analytics/elastic-mapreduce-emr.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
