Data Manipulation

Elastic MapReduce (EMR)¶

Description

Fully managed on-demand HADOOP framework
Use Cases
- Log processing
- Clickstream analysis
- Genomics and life sciences

Types of Storage

Storage Type	Description
`HDFS`	Hadoop Distributed FS (standard FS that comes with Hadoop) All data replicated across multiple instances to ensure durability Can use EC2 Instance Storage → lose data if shut down EBS
`EMRFS`	Implementation of HDFS that stores data on S3 Preserve data if cluster is shut down

Types of Clusters

Cluster Type	Description
Persistent	Runs 24x7 (when continuous analysis is run on data) Better with HDFS
Transient	Stopped when not in use Use EMRFS

Access Control

Control	Description
Security Groups	2 SGs setup when launching job flows → both don't allow access from external 1 for `PRIMARY` node: 1 port open for comm with service, 1 port for SSH 1 for `SECONDARY` nodes: only allows interaction with primary
IAM	Can set permissions that allow users other than the default `hadoop` user to submit jobs If IAM user launches a cluster → cluster hidden from other IAM users

Description

Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
Use Cases
- BATCH MODE ETL process
- NOT for continuous data streams (use Kinesis)

Components

Component	Description
PIPELINE	Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes
DATA NODES	Locations where the pipeline reads input data or writes output data (can be AWS or on-prem)
ACTIVITIES	Executed by the pipeline Represent common scenarios (e.g., moving location) May require additional resources to run (e.g., EMR, EC2) Supports PRECONDITIONS: conditional statements that must be `true` before an activity can run If it fails, retry is automatic

Description

Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
Use Cases
- Storage migration
- Application migration

Options

Option	Description
SNOWBALL	Uses Amazon-provided shippable storage appliances Each Snowball is protected by KMS & physically ragged Snowball Edge offers data processing at the edge before that data is returned to AWS Sizes: 50TB, 80TB Features: Import/Export: on premise ↔ S3 Encryption enforced Manage jobs via its console
IMPORT/EXPORT DISK	Supports transfers data directly onto and off of storage devices you own using Amazon high-speed internal network Features: Import data in S3/Glacier/EBS Export data from S3 Encryption is OPTIONAL You buy and maintain your HW devices Can't manage jobs Upper limit of 16TB