Skip to content

Data Manipulation

Elastic MapReduce (EMR)

  • Fully managed on-demand HADOOP framework
  • Use Cases
    • Log processing
    • Clickstream analysis
    • Genomics and life sciences

Types of Storage

Storage Type Description
  • Hadoop Distributed FS (standard FS that comes with Hadoop)
  • All data replicated across multiple instances to ensure durability
  • Can use
    • EC2 Instance Storage → lose data if shut down
    • EBS
  • Implementation of HDFS that stores data on S3
  • Preserve data if cluster is shut down

Types of Clusters

Cluster Type Description
  • Runs 24x7 (when continuous analysis is run on data)
  • Better with HDFS
  • Stopped when not in use
  • Use EMRFS

Access Control

Control Description
Security Groups
  • 2 SGs setup when launching job flows → both don't allow access from external
  • 1 for PRIMARY node: 1 port open for comm with service, 1 port for SSH
  • 1 for SECONDARY nodes: only allows interaction with primary
  • Can set permissions that allow users other than the default hadoop user to submit jobs
  • If IAM user launches a cluster → cluster hidden from other IAM users

Data Pipeline

  • Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
  • Use Cases
    • BATCH MODE ETL process
    • NOT for continuous data streams (use Kinesis)


Component Description
PIPELINE Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes
DATA NODES Locations where the pipeline reads input data or writes output data (can be AWS or on-prem)
  • Executed by the pipeline
  • Represent common scenarios (e.g., moving location)
  • May require additional resources to run (e.g., EMR, EC2)
  • Supports PRECONDITIONS: conditional statements that must be true before an activity can run
  • If it fails, retry is automatic


  • Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
  • Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
  • Use Cases
    • Storage migration
    • Application migration


Option Description
  • Uses Amazon-provided shippable storage appliances
  • Each Snowball is protected by KMS & physically ragged
  • Snowball Edge offers data processing at the edge before that data is returned to AWS
  • Sizes: 50TB, 80TB
  • Features:
    • Import/Export: on premise ↔ S3
    • Encryption enforced
    • Manage jobs via its console
  • Supports transfers data directly onto and off of storage devices you own using Amazon high-speed internal network
  • Features:
    • Import data in S3/Glacier/EBS
    • Export data from S3
    • Encryption is OPTIONAL
    • You buy and maintain your HW devices
    • Can't manage jobs
    • Upper limit of 16TB