Data Manipulation
Kinesis
Streaming data platform consisting of 3 services
Use Cases
Use Case |
Description |
DATA INGESTION |
Ensure data is accepted reliably & successfully stored in AWS |
REALTIME PROCESSING of massive data streams |
Act on knowledge gleaned from a big data stream right away |
NOT for BATCH Jobs |
Not appropriate for batch jobs (ETL) |
Services
Service |
Description |
FIREHOSE |
- STORE: Load massive volumes of STREAMING data into AWS
- Receives stream data & stores it in S3/Redshift/Elastic
- Just create a Delivery Stream & configure destination for data
 |
STREAMS |
- PROCESS: Collect and process large streams of data records in realtime
- Can create a Kinesis STREAMS App that processes data as it moves through the stream
- Can scale by distributing incoming data across shards
- Processing executed on consumers which read data from the shards & run the Kinesis Streams App
|
ANALYTICS |
- Analyze streaming data realtime with SQL
|
Elastic MapReduce (EMR)
- Description
-
- Fully managed on-demand HADOOP framework
- Use Cases
- Log processing
- Clickstream analysis
- Genomics and life sciences
Types of Storage
Storage Type |
Description |
HDFS |
- Hadoop Distributed FS (standard FS that comes with Hadoop)
- All data replicated across multiple instances to ensure durability
- Can use
- EC2 Instance Storage → lose data if shut down
- EBS
|
EMRFS |
- Implementation of HDFS that stores data on S3
- Preserve data if cluster is shut down
|
Types of Clusters
Cluster Type |
Description |
Persistent |
- Runs 24x7 (when continuous analysis is run on data)
- Better with HDFS
|
Transient |
- Stopped when not in use
- Use EMRFS
|
Access Control
Control |
Description |
Security Groups |
- 2 SGs setup when launching job flows → both don't allow access from external
- 1 for
PRIMARY node: 1 port open for comm with service, 1 port for SSH - 1 for
SECONDARY nodes: only allows interaction with primary
|
IAM |
- Can set permissions that allow users other than the default
hadoop user to submit jobs - If IAM user launches a cluster → cluster hidden from other IAM users
|
Data Pipeline
- Description
-
- Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
- Use Cases
- BATCH MODE ETL process
- NOT for continuous data streams (use Kinesis)
Components
Component |
Description |
PIPELINE |
Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes |
DATA NODES |
Locations where the pipeline reads input data or writes output data (can be AWS or on-prem) |
ACTIVITIES |
- Executed by the pipeline
- Represent common scenarios (e.g., moving location)
- May require additional resources to run (e.g., EMR, EC2)
- Supports PRECONDITIONS: conditional statements that must be
true before an activity can run - If it fails, retry is automatic
|
Import/Export
- Description
-
- Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
- Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
- Use Cases
- Storage migration
- Application migration
Options
Option |
Description |
SNOWBALL |
- Uses Amazon-provided shippable storage appliances
- Each Snowball is protected by KMS & physically ragged
- Snowball Edge offers data processing at the edge before that data is returned to AWS
- Sizes: 50TB, 80TB
- Features:
- Import/Export: on premise ↔ S3
- Encryption enforced
- Manage jobs via its console
|
IMPORT/EXPORT DISK |
- Supports transfers data directly onto and off of storage devices you own using Amazon high-speed internal network
- Features:
- Import data in S3/Glacier/EBS
- Export data from S3
- Encryption is OPTIONAL
- You buy and maintain your HW devices
- Can't manage jobs
- Upper limit of 16TB
|