Data Manipulation
Elastic MapReduce (EMR)¶
- Description
-
- Fully managed on-demand HADOOP framework
- Use Cases
- Log processing
- Clickstream analysis
- Genomics and life sciences
Types of Storage
| Storage Type | Description |
|---|---|
HDFS |
|
EMRFS |
|
Types of Clusters
| Cluster Type | Description |
|---|---|
| Persistent |
|
| Transient |
|
Access Control
| Control | Description |
|---|---|
| Security Groups |
|
| IAM |
|
Data Pipeline¶
- Description
-
- Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
- Use Cases
- BATCH MODE ETL process
- NOT for continuous data streams (use Kinesis)
Components
| Component | Description |
|---|---|
| PIPELINE | Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes |
| DATA NODES | Locations where the pipeline reads input data or writes output data (can be AWS or on-prem) |
| ACTIVITIES |
|
Import/Export¶
- Description
-
- Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
- Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
- Use Cases
- Storage migration
- Application migration
Options
| Option | Description |
|---|---|
| SNOWBALL |
|
| IMPORT/EXPORT DISK |
|