Skip to content

Engineering Decisions

Documentation

Link Notes
Gergely Orosz Scaling Engineering Teams via RFCs: Writing Things Down The power of writing things down, and spreading knowledge across the organization
Gergely Orosz Engineering Planning with RFCs, Design Documents and ADRs What are some successful planning approaches engineering teams use as they grow?
Design Docs at Google Anatomy of a good design doc
Architecture decision record (ADR) An architectural decision record (ADR) is a document that captures an important architectural decision made along with its context and consequences
What is the best way to write a PRD?
  • Anatomy of a good PRD (Product Requirement document), a document that tells you what exactly you are building
  • Plus, PRD examples from top companies
S.P.A.D.E. Toolkit: How to implement Square's famous decision-making framework A decision-making framework, alternative to consensus built on accountability and clarity, where the person responsible for executing the decision is the one who decides
Technical Writing Courses for Engineers from Google
  • Design Docs (or RFCs) are a great way to get higher-level feedback on an approach, before starting the work
  • Written to share:
    • context
    • suggested approach
    • bird's eye view of requirements and general architecture of the project
    • tradeoffs
    • and to invite feedback
  • Architecture Decision Records (ADRs) document implementation decisions
  • Written to document decisions, and less for getting feedback on these decisions
  • Usually live in the same repo (living docs)

Cloud

Link Notes
Cloud design patterns Design patterns for building reliable, scalable, secure applications in the cloud by walking through examples based on Microsoft Azure
AWS App-Layer Encryption in AWS
AWS Network access for private clusters Very interesting article going into the problem of providing network connectivity between Kubernetes clusters and other internal tools (like deployment pipelines)
AWSSquare Adopting AWS VPC Endpoints at Square Secure communication between data centers and the cloud
AWSSquare Providing mTLS Identities to Lambdas Writeup on how Square added support for mutual TLS calls from AWS Lambda into their data center
AWSSquare Expanding Secrets Infrastructure to AWS Lambda How Square extended their datacenter-based secrets infrastructure to enable a cloud migration supporting Lambda
AWSSquare Connecting Block Business Units with AWS API Gateway How Block enables backend services to securely connect across business unit boundaries using AWS API Gateway
AWS Cloud Encryption is worthless! Click here to see why... When evaluating your cloud security posture priorities, encryption should be at the bottom of your list. First, get your IAM house in order
AWS Building the Next Evolution of Cloud Networks at Slack How Slack has gone through an evolution of their AWS infrastructure from running a few hand-built EC2, all the way to provisioning thousands of them across multiple AWS regions
Multicloud failover is almost always a terrible idea Multicloud failover is complex and costly to the point of nearly almost always being impractical, and it's not an especially effective way to address cloud resilience risks

Infrastructure

Link Notes
Uber Why We Leverage Multi-tenancy in Uber's Microservice Architecture
  • Testing in production: make the current production stack multi-tenant and allow both test as well as production traffic to flow through it
  • Canary deployments: a canary can be treated as yet another tenant in a multi-tenant architecture
  • Capture/replay and shadow traffic: replaying previously captured live traffic or replaying a shadow copy of live production traffic in a hermetically safe environment is another use case of multi-tenancy
Uber Introducing Domain-Oriented Microservice Architecture This piece explains DOMA, the concerns that led to the adoption of this architecture for Uber, its benefits for platform and product teams, and, finally, some advice for teams who want to adopt this architecture
Uber Crane: Uber’s Next-Gen Infrastructure Stack Post examining the original motivation and some key features behind Uber's been multi-year journey to reimagine their infrastructure stack for a hybrid, multi-cloud world
Container technologies at Coinbase: Why Kubernetes is not part of our stack Container technologies also create a large set of challenges that must be overcome to prevent failures
Decentralized GitOps over multiple environments How SAP Artificial Intelligence implements GitOps in their large-scale project spanning multiple environments
How we use HashiCorp Nomad Reliability model of services running in our more than 200 edge cities worldwide
Design Considerations at the Edge of the ServiceMesh Set of design patterns around inbound and outbound traffic to and from a service mesh
A Kubernetes engineer's guide to mTLS What mTLS is, how it relates to ordinary TLS, and why it's relevant to Kubernetes
Lyft Scaling productivity on microservices at Lyft History of development and test environments
monday.com’s Multi-Regional Architecture: A Deep Dive When making a decision to go multi-region, one needs to understand the primary motivation, as the work will vary greatly between performance-first, resilience-first and privacy-first designs
Inside Figma: securing internal web apps A deep-dive into how Figma built a system for securing internal web applications that lets them require SSO authentication, enforce fine-grained authorization (via Okta groups), and support CLI tools, all using ALBs, AWS Cognito, and Okta
Inside Figma: getting out of the (secure) shell A simple solution for zero-trust shell access on AWS, by leveraging AWS SSO and Systems Manager
Building ClickHouse Cloud From Scratch in a Year
  • Have you ever wondered what it takes to build a serverless SaaS offering in under a year?
  • Planning process, design and architecture decisions, security and compliance considerations, global scalability and reliability in the cloud, and some of the lessons learned

Development Environments & CI

Link Notes
AWS Setup
  • My CI/CD pipeline is my release captain: How Amazon continuously release changes to production by practicing trunk-based development, by using CI/CD pipelines to manage deployment artifacts and coordinate releases across multiple production environments, and by practicing proactive and automatic rollbacks
  • Automating safe, hands-off deployments: The principles of continuous delivery, the use of blue-green deployments, and how to implement automated canary deployments. It also provides best practices for monitoring and rolling back deployments
Automating Our Infrastructure to Empower Engineers
  • Syncing Dev Environments
  • Mirroring Dev and Prod Environments
  • Developing Locally
  • Deploying to Production
Devpod: Improving Developer Productivity at Uber with Remote Development How Uber improved the daily edit-build-run developer experience using DevPods
Balancing Safety and Velocity in CI/CD at Slack
  • Pre-merge pipeline: high criticality end-to-end tests execute and must pass to merge a user's PR to mainline
  • Post-merge pipeline: After merging to mainline, a larger subset (medium-criticality) of end-to-end tests execute against each commit, and these tests give signal to the CD pipeline before deploy
  • Regression pipeline: After merging to mainline, the remainder of end-to-end tests batch execute against batches of commits. These tests give signal to broken low-criticality features on each deploy, and run on a bi-hourly cadence

Various

Link Notes
Why is it so hard to decide to buy?
  1. If it isn't related to your core business
  2. No one else has an offering or one that is any good for you to use
  3. You shouldn't build a better solution, you should ask yourself why no one else needs to solve this problem
Software Development Waste A taxonomy for any team that's trying to figure out how to be more efficient
The top 10 fallacies in platform engineering
  1. The prioritization fallacy
  2. The visualization fallacy
  3. The "wars you cannot win" fallacy
  4. The "everything and everybody at once" fallacy
  5. The "the new setup isn’t better" fallacy
  6. The abstraction fallacy
  7. The "loudest voice" fallacy
  8. The freedom fallacy
  9. The "Google/Facebook/Netflix" fallacy
  10. The "compete with AWS" fallacy