Nic Acton
  • My Gitbook
  • My Favorite Things
    • Podcasts
    • Newsletters
  • Monthly Summaries
    • May 2019
    • June 2019
  • Cloud Computing
    • Cloud Concepts
    • AWS
      • Certified Solutions Architect
      • Well Architected Framework
        • Operational Excellence
        • Reliability
        • Performance Efficiency
        • Cost Optimization
        • Security
      • Analytics
        • Elasticsearch Service
        • Kinesis
        • Elastic MapReduce (EMR)
      • Compute Services
        • Elastic Beanstalk
        • Elastic Container Service (ECS)
      • Deployment
        • CloudFormation
      • Application Services
        • Key Management Service (KMS)
        • Simple Queue Service (SQS)
        • API Gateway
        • Simple Work Flow (SWF)
        • Amazon MQ
        • Simple Notification Service (SNS)
      • Simple Storage Service (S3)
        • Macie
      • Databases
        • RDS
        • DynamoDB
        • ElastiCache
        • Neptune
        • Redshift
      • Cloudfront
      • IAM
      • Monitoring
        • Trusted Advisor
        • Amazon Inspector
        • AWS Config
        • AWS Shield
        • CloudWatch
          • VPC Flow Logs
        • CloudTrail
        • Guard Duty
      • Route53
      • Serverless Architectures
        • Lambda
      • VPC
        • Highly Available & Fault Tolerant VPCs
        • Hybrid Environments
          • VPC Peering
          • Direct Connect
        • Cloud HSM
    • GCP
    • Azure
    • HashiCorp
    • Red Hat
      • RHEL
        • Basics
        • Grep & Regex
        • SSH
      • Ansible
    • Tutorials/Guides
      • Linux
        • Admin
  • Software Engineering
    • Machine Learning
      • Deep Learning
        • Tensorflow
      • Training and Loss
    • Programming
      • APIs
    • Security
    • Web Development
      • OSI 7 Layer Model
    • Tutorials/Guides
      • Apache Server
    • Virtualization
      • Virtual Machines
      • Containers
      • Serverless
  • Fitness
    • Nutrition
      • Diets
      • Macronutrients
      • Supplements
      • Miscellaneous
    • Strength Training
    • BodyBuilding
  • Miscellaneous
    • Technology Ethics
      • Education
    • Interesting Concepts
      • Libertarian Paternalism
Powered by GitBook
On this page
  • Essentials
  • Features and Advantages
  • Master Node
  • Slave Nodes
  • Mapping Job
  • Map Phase
  • Reduce Phase

Was this helpful?

  1. Cloud Computing
  2. AWS
  3. Analytics

Elastic MapReduce (EMR)

EC2 instances for big data processing based off Hadoop framework

Essentials

  • Simplifies running big data frameworks (Hadoop, Spark) on AWS

  • Processing of data in large batches

  • A managed service for running Hadoop clusters on EC2

  • EMR is used to analyze and process vast amounts of data

Features and Advantages

  • S3 for storage

    • Load into HDFS or keep in S3 (EMRFS)

  • Transient clusters

    • Can also be persistent

    • Steps or tasks to be completed are defined, then termination

    • Saves a lot of money

  • Spot Instances

  • Bootstrapping

    • Customized configuration while launching

  • Preconfigured application frameworks

  • Scalability (add and remove core and task nodes)

Master Node

  • Node that manages the cluster by running software components which coordinate the distribution of data and tasks among other (slave) nodes for processing

  • The master node trackks the status of tasks and monitors the health of the cluster

  • SSH to the master node is permitted to run interactive jobs

Slave Nodes

Two Types

  • Core Node

    • A slave node has software components which run tasks AND store data in the Hadoop Distributed File System (HDFS) on your cluster

    • The core nodes do the "heavy lifting" with the data

  • Task Node

    • A slave node that has software components which only run tasks

    • Task nodes are optional

Mapping Job

Map Phase

  • Mapping is a function that defines the processing of data for each split

  • The block size for HDFS is 128MB, which is the optimum split size.

  • The larger the instance size used in our EMR cluster, the more chunks you can map and process at the same time.

  • If there are more chunks than nodes/mappers, the chunks will queue for processing

Reduce Phase

  • Reducing is a function that aggregates the data into a single output file

  • Reduced data needs to be stored (maybe in S3) as data processed by the EMR cluster is not persistent.

PreviousKinesisNextCompute Services

Last updated 6 years ago

Was this helpful?