Databases

Comparison¶

Database	Transport encryption	Encryption at rest	Audit	Authentication	Authorization
RDS	Rooted at global RDS certs, configuration is per-engine	KMS or TDE w/ SQL Server and Oracle - RDS managed key	per-engine log files	per engine user accounts	per engine
DynamoDB	Standard AWS HTTPS endpoint	KMS	CloudTrail	IAM only Cognito possible	IAM identity policies - resources & condition keys
Redshift	ACM managed certificate, redshift specific root	KMS CloudHSM	S3	DB user accounts	SQL
Neptune	Publicly trusted Amazon root	KMS	Console	User accounts; or a limited IAM identity policy mechanism + request signing	Engine-specific; or broad access if using IAM
Aurora	Rooted at global RDS certs	KMS	mysql -> CloudWatch Logs	User accounts; or an IAM authenticated API to obtain short lived passwords to connect	SQL
DocumentDB	Rooted at global RDS certs	KMS	CloudWatch Logs	MongoDB user accounts	MongoDB standard

OLAP vs OLTP

OLAP: Redshift

OLTP: relational DBs

RDS¶

General Info¶

Description

Relational DB Service (RDS): expose a DB instance but does not provide shell access
Operational Benefits
- AWS in charge of: scaling, HA, backups, DB engine patches
- No admin access provided
Pay
- Provisioned compute resources
- Provisioned storage resources

Use Cases

Suitable	Not Suitable
Complex Transactions/Queries Medium-to-High query/write rate No more than a single worker node/shard High durability	Massive read/write rates (e.g., 150K write/s) Sharding Simple GET/PUT requests and queries RDBMS customization

DB Instance

Isolated DB environment deployed in a private network
Each instance runs & manages a DB ENGINE
Configuration


DB Parameter Group	container for engine configuration values that can be applied to 1+ DB instances
DB Option Group	container for engineer features (empty by default)

DB Engines

MySQL

OSS Community Edition with InnoDB
Supports multi-AZ & read replicas

PostgreSQL

Supports multi-AZ & read replicas

Oracle

Edition	Multi-AZ	Encr.	License
Standard One	YES	KMS	included, BYOL
Standard	YES	KMS	BYOL
Enterprise	YES	KMS & TDE	BYOL

MS SQL

Edition	Multi-AZ	Encr.	License
Express	NO	KMS	included
Web	NO	KMS	included
Standard	YES	KMS	included, BYOL
Enterprise	YES	KMS & TDE	BYOL

Aurora

Built using EBS
Scale up to 4-6TB & 30k IOPS
Storage types
- Magnetic
- General Purpose (SSD)
- Provisioned IOPS (SSD)

Operations¶

Backup & Recovery

Backup Mechanisms


AUTOMATED BACKUPS	Continuously tracks changes & backs up the DB Automated backups are turned on by default RDS creates a storage volume snapshot of the complete DB instance (not just individual DBs) Retention Period = 1day by default, max 35 days When DB instance deleted → all automated backup snapshots are deleted Backups occur daily in a 30min maintenance window
MANUAL	At any time Kept until explicitly disabled

Recovery
- You cannot restore from a DB snapshot to an existing DB instance, a new DB instance is created
- Only default DB params & SGs are associated with restored instance

HA

Multi-AZ

Synchronous replication to minimize RPO
Fast failover to minimize RTO
Allows to place a 2ndary copy of DB in another AZ for DISASTER RECOVERY purposes
DNS name that AWS resolves tp IP address: my-app-db.<id>.us-west-2.rds.amazonaws.com

Replication

RDS automatically replicates data from MASTER DB in PRIMARY instance to SLAVE DB/2ndary instance
Read replicas use asynchronous replication, pushing data to the read replicas whenever possible, for improved read performance
AWS provides up to five read replicas for a single database instance, configurable via the AWS console
RDS automatically performs a failover in the event
- loss of availability in primary AZ
- loss of network connectivity to primary DB
- compute unit failure on primary DB
- storage failure on primary DB
Custom failover to standby instance without user intervention

Scaling

Vertical Scalability (UP)

Each DB instance can scale from 5GB to 6TB in provisioned storage
Storage Expansion is supported for all engines except SQL Server

Horizontal Scalability (OUT)

Use cases
- scale beyond capacity of single DB instance
- handle read traffic when source DB is unavailable
- offload reporting or data warehousing
With PARTITIONING (sharding)
- partition DB into multiple instances (shards)
- handle more users/requests but requires additional logic in the app layer
- designed for NoSQL (DynamoDB, Cassandra)
With READ REPLICAS
- offload read transactions from primary DB & increase overall number of transactions
- updates made to the source DB are async copied to the read replica
- supported by: MySQL, PostgreSQL, MariaDB, Aurora
- replicas can span across regions

Security¶

Access Control

When create a DB instance you create a MASTER USER ACCOUNT, a native DB user account that allows to login with ALL DB privileges
Can create additional user accounts
No resource based policies

Network Isolation


Runs in a VPC	DB instance within a `private` subnet Connect to on-premises with VPN `Publicly accessible` option controls whether there is a publicly resolvable DNS name for the instance
DB Subnet Group	Collections of subnets designated for DB instances in a VPC Each subnet group should have at least 1 subnet for each AZ in a region
DB SG	~ EC2 SG Default to `DENY ALL` access Must explicitly authorize network ingress (IP range, EC2 SG) Only allows access to the DB server port
Network ACL	Allow/deny traffic IN/OUT subnet

Encryption

TLS

Encrypt connections App ↔ DB instance
Single root for all RDS database TLS certs

Engine	Description
MySQL & SQL Server	RDS creates a SSL cert & installs it on DB instance when provisioned
MySQL	Launch client with SSL CA pointer to public key
SQL Server	Download public key & import cert within OS
Oracle	Uses Oracle native network encryption

RDS Encryption

Encryption at rest, set during creation, uses KMS
Covers database, backups, replicas, snapshots

Transparent data encryption (TDE)

For SQL Server and Oracle with CloudHSM
Automatically encrypts data before it is written to storage
If you need data at rest with MySQL → app must manage it

DynamoDB¶

General Info¶

Description

NoSQL DB
Provide consistent performance levels by automatically distributing data&traffic for a table over multiple partitions
All data stored on SSD & automatically replicated across multiple AZ → > resilience than RDS, Aurora, Redshift
Supports cross-region replication

Components


`TABLE`	collection of items
`ITEM`	collection of 1+ attributes, with `PRIMARY KEY`
`ATTRIBUTE`	name/value pair

Data Types


`SCALAR`	single value (string, number, binary, boolean, null)
`SET`	single list of 1+ SCALAR each value is unique all must be of same type (string, number, binary)
`DOCUMENT`	nested attributes (`list`, `map`)

Primary Key


`PARTITION KEY`	1 attribute, a partition (hash) key
`PARTITION & SORT KEY`	2 attributes possible to have same `PARTITION KEY`, but those 2 items must have different `SORT KEY`

Secondary Indexes

Creation
- Define 1+ secondary indexes when creating a table with PARTITION & SORT KEY
Kinds


`GLOBAL`	index with `PARTITION & SORT KEY` != from table any number of `GLOBAL` secondary index, created whenever
`LOCAL`	index with = `PARTITION KEY` as primary index with != `SORT KEY` from primary 1! `LOCAL` secondary index & must be created when creating table

Usage
- Updated when an item is modified → consume W capacity units
- GLOBAL → from own provisioned throughput
- LOCAL → from main table

Operations¶

Provisioned Capacity

At creation, provide R/W capacity (R/W capacity units)
Each operation consume capacity units, proportional to size(item)

Example

Table w/o secondary index
Consumes:
- 1 unit for R item <= 4KB
- 1 unit for W item <= 1KB
R item of 110KB → 28 units (ceil(110/4) = ceil(27.5) = 28)
Plus, if strongly consistent → 28 * 2 = 56

Eventual Consistency

DynamoDB stores multiple copies of an item across a region
Eventually Consistent = R request immediately after a W might not show the latest change
Reads
- Eventually Consistent R = might include stale data
- Strongly Consistent R = most up-to-date data

Search


`QUERY`	Primary search operation to find items in a table or secondary index using only `PRIMARY KEY` attribute values Each query requires a `PARTITION KEY ATTRIBUTE NAME` + value to search Results sorted by `PRIMARY KEY` & limited to 1MB
`SCAN`	Read each item in a table or secondary index Returns `ALL` data attributes for each item in the table/index Each request returns <= 1MB data

Managing Data


Atomic Counters	Increment/decrement value & guaranteed to be consistent across multiple concurrent requests
Reading	By default EVENTUALLY CONSISTENT READ `PARTITION KEY` → must be specified `PARTITION & SORT KEY` → must be specified
Scaling	Scale horizontally using PARTITIONS (compute & storage units) Table items stored across multiple partitions Which partition based on the `PARTITION KEY` → items with same `PARTITION KEY` are stored in the same partition
Partition	Provisioned throughput divided evenly across partitions No sharing of throughput across partitions 1 partition <= 10GB data <= 3KB R capacity units OR <= 1K W capacity units After a partition is split, it cannot be merged back together
Streams	List of item modifications for last 24h period `STREAM` = stream records `STREAM RECORD` = single data modification in the table to which the stream belongs & organised into groups (shards) for max 24h
Backups	Full or incremental to a table, in same or different region Copy used for Disaster recovery To federate data across regions to support a multi-region application

Security¶

Encryption

Optional encryption at rest integrated with KMS

Access Control

RESOURCE-level = IAM
- No resource based policies
- Full access to a table requires access to not just the table/<name> resource, but also table/<name>/*
DB-level = fine-grained access controls
- Permissions that allow/deny access to items (rows) and attributes (columns)
- Create them using an IAM policy
- Several condition keys for fine-grained access including: dynamodb:LeadingKeys, dynamodb:Select, dynamodb:Attributes
Each request must contain a valid signature (STS)

Integrations

Get and Put API calls are not logged to CloudTrail
Has a VPC endpoint you can use
Integration with Cognito:
- identity pool with roles configured
- roles have appropriate policy to (a) allow cognito to assume them and (b) perform desired DynamoDB actions

Redshift¶

General Info¶

Description

Data Warehouse service
Relational DB designed for OLAP & optimised for analysis & reporting of large datasets
Based on PostgreSQL
Fast querying capabilities over structured data using standard SQL commands
Data typically spans multiple data repo & at least 2TB
Pay
- Provisioned compute resources
- Storage (charged according to computer resources)

Clusters and Nodes

CLUSTER = LEADER NODE + 1+ COMPUTE NODES
Each cluster contains 1+ DBs
Node Types
- 6 supported
- grouped into 2 categories
  - DENSE COMPUTE = clusters up to 326TB using fast SSD
  - DENSE STORAGE = up to 2PB using large magnetic disks
Slices
- disk storage for a COMPUTE node is divided into a number of slices (>2 & <16)
- nodes participate in parallel query execution

Operations¶

Sort Keys

Enables efficient handling of range-restricted predicates
Types
- COMPOUND = more efficient when query predicates use a PREFIX, subset of sort key column in order
- INTERLEAVED = equal weight to each column in sort key. Query can use any subset of the columns of sort key

Distribution

Distributes & executes query in parallel across all of a cluster's compute nodes
Spreads table data across all compute nodes based on a distribution strategy
All data written to a node is continuously backed up to S3 using snapshots

Distribution Strategy

How to distribute the records across the nodes & slices in a cluster
When run a query, the optimiser shifts the rows to the compute nodes so to perform joins
Goal = minimise impact of redistribution step
3 styles


`EVEN`	Data distributed across slices uniformly (standard)
`KEY`	Rows distributed according to values in 1 column
`ALL`	Full copy of entire table is distributed to each node (lookup tables)

Snapshots

Automated & manual (1-35days & until explicitly deleted)
Used to restore a copy or create a clone

Security¶

Additional Resources

Redshift Security: Attack Surface Explained

Cluster Access


General	By default clusters are CLOSED to everyone AWS account that creates the cluster has FULL ACCESS IAM to create user accounts (RESOURCE-level) No resource based policies
DB-level	DB users are named user accounts that can connect to a DB and are authn when they login to Redshift You grant DB user permissions on a PER-CLUSTER basis (instead of per-table) Users can only see data in the table rows that were GENERATED by them
Owner of DB object	User who creates it By default only a super user of the owner can query/modify/grant permissions Only the owner can modify/delete it

Data Encryption

All data in user-created tables encrypted with AES-256 (all data on disk & any backup)
Uses a 4-tier key-based architecture for encryption

Tier	Description
`DATA ENC KEYS` encrypt data blocks in the cluster	Each block has random AES-256 key
`DB KEY` encrypt `DATA ENC KEYS`	Random AES-256 key Stored on disk in separate network from cluster Stored encrypted by a `MASTER KEY` Kept in memory in the cluster
`CLUSTER KEY` encrypt the `DB KEY`	Stored by AWS or HSM
`MASTER KEY` encrypt the `CLUSTER KEY` if stored in AWS	If not → encrypt the `CLUSTER-KEY-ENCRYPTED DB KEY`

Network Security

SSL
- To comms with S3 or DynamoDB for: COPY, UNLOAD, backup, restore
- Can encrypt client ↔ cluster
Clusters are associated with 1+ security groups
Audit logs
- Disabled by default
- stored to S3 (as well as the standard CloudTrail logs)
- bucket policy has to allow putobject and getacl to a specific user from a redshift AWS account that varies by region: arn:aws:iam::<redshift regional account id>:user/logs

Neptune¶

Encryption

Encryption at rest with KMS
HTTPS access

IAM

Permissions are a subset of RDS permissions, and all the actions are rds
- Can constrain to just neptune with a condition of rds:DatabaseEngine = graphdb
Hybrid model where you can authenticate with IAM, and define identity policies that allow access
- No condition keys, no fine grained access (only a single neptune-db:* action)
- HTTP requests then need to be signed with standard AWS v4 signatures that you construct yourself

Integrations

Interfaces appears in at least two subnets spanning two AZs in a VPC, interfaces have security groups
Optional audit logs
CloudTrail events appear as though they are from the RDS service not Neptune

Aurora¶

Features

Engine type for RDS
Stores 6 copies of the data, across 3 AZ, to ensure failover and disaster recovery → > resilience than RDS
Performance
- Is ~5 times as fast as MySQL and ~3 times as fast as PostgreSQL
- Serverless capability → horizontal scaling
Pay
- Provisioned or actual compute resources
- Actual storage

IAM

Supports IAM database authentication, similar to Neptune
- Attach identity policy to IAM principals that allow rds-db:connect for a particular database user
- Manage user permissions within the DB as per normal (IAM is just for authentication)
- You get a token from the RDS API by specifying the db and user, then use the token in place of the user's password when connecting normally

Integrations

Uses normal VPC security groups to control access within a VPC

DocumentDB¶

Encryption

TLS from the RDS root
KMS encryption at rest

IAM

Master user + mongodb user management
IAM identity policies for management

Integrations

Auditing can be enabled to send events to CloudWatch Logs