Skip to content

Data Manipulation

Kinesis

Streaming data platform consisting of 3 services

Use Cases

Use Case Description
DATA INGESTION Ensure data is accepted reliably & successfully stored in AWS
REALTIME PROCESSING of massive data streams Act on knowledge gleaned from a big data stream right away
NOT for BATCH Jobs Not appropriate for batch jobs (ETL)

Services

Service Description
FIREHOSE
  • STORE: Load massive volumes of STREAMING data into AWS
  • Receives stream data & stores it in S3/Redshift/Elastic
  • Just create a Delivery Stream & configure destination for data
STREAMS
  • PROCESS: Collect and process large streams of data records in realtime
  • Can create a Kinesis STREAMS App that processes data as it moves through the stream
  • Can scale by distributing incoming data across shards
  • Processing executed on consumers which read data from the shards & run the Kinesis Streams App
ANALYTICS
  • Analyze streaming data realtime with SQL

Elastic MapReduce (EMR)

Description
  • Fully managed on-demand HADOOP framework
  • Use Cases
    • Log processing
    • Clickstream analysis
    • Genomics and life sciences

Types of Storage

Storage Type Description
HDFS
  • Hadoop Distributed FS (standard FS that comes with Hadoop)
  • All data replicated across multiple instances to ensure durability
  • Can use
    • EC2 Instance Storage → lose data if shut down
    • EBS
EMRFS
  • Implementation of HDFS that stores data on S3
  • Preserve data if cluster is shut down

Types of Clusters

Cluster Type Description
Persistent
  • Runs 24x7 (when continuous analysis is run on data)
  • Better with HDFS
Transient
  • Stopped when not in use
  • Use EMRFS

Access Control

Control Description
Security Groups
  • 2 SGs setup when launching job flows → both don't allow access from external
  • 1 for PRIMARY node: 1 port open for comm with service, 1 port for SSH
  • 1 for SECONDARY nodes: only allows interaction with primary
IAM
  • Can set permissions that allow users other than the default hadoop user to submit jobs
  • If IAM user launches a cluster → cluster hidden from other IAM users

Data Pipeline

Description
  • Process and MOVE data between different AWS storage/services/on-prem, at specific intervals
  • Use Cases
    • BATCH MODE ETL process
    • NOT for continuous data streams (use Kinesis)

Components

Component Description
PIPELINE Schedules and runs tasks according to the pipeline definition & interacts with data stored in data nodes
DATA NODES Locations where the pipeline reads input data or writes output data (can be AWS or on-prem)
ACTIVITIES
  • Executed by the pipeline
  • Represent common scenarios (e.g., moving location)
  • May require additional resources to run (e.g., EMR, EC2)
  • Supports PRECONDITIONS: conditional statements that must be true before an activity can run
  • If it fails, retry is automatic

Import/Export

Description
  • Accelerates transferring large amount of data IN and OUT of AWS using PHYSICAL STORAGE APPLIANCES (bypassing the Internet)
  • Data is copied to a device at the source (AWS or data center), shipped, and then copied to destination
  • Use Cases
    • Storage migration
    • Application migration

Options

Option Description
SNOWBALL
  • Uses Amazon-provided shippable storage appliances
  • Each Snowball is protected by KMS & physically ragged
  • Snowball Edge offers data processing at the edge before that data is returned to AWS
  • Sizes: 50TB, 80TB
  • Features:
    • Import/Export: on premise ↔ S3
    • Encryption enforced
    • Manage jobs via its console
IMPORT/EXPORT DISK
  • Supports transfers data directly onto and off of storage devices you own using Amazon high-speed internal network
  • Features:
    • Import data in S3/Glacier/EBS
    • Export data from S3
    • Encryption is OPTIONAL
    • You buy and maintain your HW devices
    • Can't manage jobs
    • Upper limit of 16TB