Data Manipulation
Kinesis
Streaming data platform consisting of 3 services
Use Cases
Use Case | Description |
DATA INGESTION | Ensure data is accepted reliably & successfully stored in AWS |
REALTIME PROCESSING of massive data streams | Act on knowledge gleaned from a big data stream right away |
NOT for BATCH Jobs | Not appropriate for batch jobs (ETL) |
Services
Service | Description |
FIREHOSE | - STORE: Load massive volumes of STREAMING data into AWS
- Receives stream data & stores it in S3/Redshift/Elastic
- Just create a Delivery Stream & configure destination for data
 |
STREAMS | - PROCESS: Collect and process large streams of data records in realtime
- Can create a Kinesis STREAMS App that processes data as it moves through the stream
- Can scale by distributing incoming data across shards
- Processing executed on consumers which read data from the shards & run the Kinesis Streams App
|
ANALYTICS | - Analyze streaming data realtime with SQL
|
Elastic MapReduce (EMR)
- Description
-
- Fully managed on-demand HADOOP framework
- Use Cases
- Log processing
- Clickstream analysis
- Genomics and life sciences
Types of Storage
Storage Type | Description |
HDFS | - Hadoop Distributed FS (standard FS that comes with Hadoop)
- All data replicated across multiple instances to ensure durability
- Can use
- EC2 Instance Storage → lose data if shut down
- EBS
|
EMRFS | - Implementation of HDFS that stores data on S3
- Preserve data if cluster is shut down
|
Types of Clusters
Cluster Type | Description |
Persistent | - Runs 24x7 (when continuous analysis is run on data)
- Better with HDFS
|
Transient | - Stopped when not in use
- Use EMRFS
|
Access Control
Control | Description |
Security Groups | - 2 SGs setup when launching job flows → both don't allow access from external
- 1 for
PRIMARY node: 1 port open for comm with service, 1 port for SSH - 1 for
SECONDARY nodes: only allows interaction with primary
|
IAM | - Can set permissions that allow users other than the default
hadoop user to submit jobs - If IAM user launches a cluster → cluster hidden from other IAM users
|
Data Pipeline
- Description
-
- Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
- Use Cases
- BATCH MODE ETL process
- NOT for continuous data streams (use Kinesis)
Components
Component | Description |
PIPELINE | Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes |
DATA NODES | Locations where the pipeline reads input data or writes output data (can be AWS or on-prem) |
ACTIVITIES | - Executed by the pipeline
- Represent common scenarios (e.g., moving location)
- May require additional resources to run (e.g., EMR, EC2)
- Supports PRECONDITIONS: conditional statements that must be
true before an activity can run - If it fails, retry is automatic
|
Import/Export
- Description
-
- Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
- Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
- Use Cases
- Storage migration
- Application migration
Options
Option | Description |
SNOWBALL | - Uses Amazon-provided shippable storage appliances
- Each Snowball is protected by KMS & physically ragged
- Snowball Edge offers data processing at the edge before that data is returned to AWS
- Sizes: 50TB, 80TB
- Features:
- Import/Export: on premise ↔ S3
- Encryption enforced
- Manage jobs via its console
|
IMPORT/EXPORT DISK | - Supports transfers data directly onto and off of storage devices you own using Amazon high-speed internal network
- Features:
- Import data in S3/Glacier/EBS
- Export data from S3
- Encryption is OPTIONAL
- You buy and maintain your HW devices
- Can't manage jobs
- Upper limit of 16TB
|