Data Manipulation
Elastic MapReduce (EMR)¶
- Description
-
- Fully managed on-demand HADOOP framework
- Use Cases
- Log processing
- Clickstream analysis
- Genomics and life sciences
Types of Storage
Storage Type | Description |
---|---|
HDFS |
|
EMRFS |
|
Types of Clusters
Cluster Type | Description |
---|---|
Persistent |
|
Transient |
|
Access Control
Control | Description |
---|---|
Security Groups |
|
IAM |
|
Data Pipeline¶
- Description
-
- Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
- Use Cases
- BATCH MODE ETL process
- NOT for continuous data streams (use Kinesis)
Components
Component | Description |
---|---|
PIPELINE | Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes |
DATA NODES | Locations where the pipeline reads input data or writes output data (can be AWS or on-prem) |
ACTIVITIES |
|
Import/Export¶
- Description
-
- Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
- Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
- Use Cases
- Storage migration
- Application migration
Options
Option | Description |
---|---|
SNOWBALL |
|
IMPORT/EXPORT DISK |
|