[아티클 리뷰] 카오스 마스터하기 - 넷플릭스의 MSA 여정 - 조쉬에반스
후기
내가 짠 아키텍쳐를 보면서 이게 진짜 될까? 라는 궁금즘이 생기기도 한다.
큰 장애들을 여러번 겪고나서 진화한 형태의 아키텍쳐라는 느낌을 받았다.
나도 언젠가는 이런 아키텍쳐를 만들어서 스케일을 아름답게 해보고싶다.
Index
Introductions
Microservices Basics
Challengs & Solutions
Organization & Architecture
1. Intro
What is not Microservices
Monolithic code Base
Monolithic db
Tightly coupled architecture
2. What is Microservice?
the microservice architectural style is an approach to developing a single app as a suite of small services, each running in its own process & communicating with light-weight mechanisms, often an HTTP resource API
Separation of Concerns
Modularity, Encapsulation
Scalability
Horizontally Scaling
Workload Partitioning
Virtualization & Elasticity
Automated Ops
On demand Provisioning
How does Netflix system look like?
Product
Bucket Testing
Subscriber → Customer
Recommendation → movies
Platform
Routing → services comm
Configuration
Crypto securities
Persistence
Cache
DB
3. Challenge & Solutions
Dependency
Scale
Variance
Change
Dependency
Problem
Intra-Service Requests
Service A → Service B
network Latency, congestion, failure
Logical or Scaling Failure
Cascading Failure
one service down → entire system down → Hystrix(K8s equivalents)
Solutions
Fallback with static return
Isolated thread pool
Circuit breaker
Tests to check the answers
Inoculation with FIT (Fault Injection Testing) [Chaos Monkey]
Synthetic Transactions
Traffic up to 100%
However, Ensuring testing scopes to Crticial Microservices
Problem
Client Libraries
Common logic, access code
Making a lib for all services required UNIFYING CLIENT LIB:
Heap Consumption
Logical Defects
Transtive Dependencies
Single point of failure on API Gateway
MY thoughts
Code generations and access code should be written with lan neutral codes
(easy api implementation), (access codes are easily abstracted away)
Solution
Limit Heap Consumption
Case by case solving problems
CAP Theorem
In the presence of network, you want between consistency and availiability. They want availability, eventual consistency using Cassandra
It is written on more than three nodes → Multi region strategy to keep availability
if more than 3 nodes are written make other writes priorities
Scalability
Use Cases
Stateless services → Not a cache or db, Frequently accessed metadata, Loss of a node is a non-event. No instance affinity.
Stateful Services → DB, Cache, Custom Apps which hold large amounts of data is notable event. (Avoid holding all business logics in one core service.)
Hybrid Service
In 1's case, simply respawn and provide static rcs in case.
Problem
1.Dedicated Shards - An Anti- Pattern
Bulkheading( isolation of threadpools not done)
isolation of threadpools not done
Cache Failure
Fall back to DBs that cannot handle the loads
Solution
Making them redundant and let them get a writing approval.
Solutions to Wrong Fallback on dbs
Workload partitioning
Request-level caching
Secure token fallback
Chaos under load
Auto Scaling Groups & Surviving Failure( Chaos Monkey)
Compute Efficiency
Node Failure
Traffic Spikes
Performance bugs
Variance
Use Case (Using Other Tools)
Operational drift(Unintentional Drift)
Polyglot & Containers
Over Time
Alert Thresholds
Timeouts, retries, fallbacks
Throughputs
Across microservices
Reliability Best Practices
Production Ready
Alerts
Apache & Tomcats
Automated canary analysis
Autoscaling
Chaos
Consistent naming
ELB config
Healthcheck
Immutable machine images
Squeeze Testing
Staged, red/black deployments
Timeouts, retries, fallbacks
The paved Road ↔ Off Road
Netflix tech stacks vs python ruby other tools
Cost of Variance
Productivity Tooling
Insight & triage capabilities
Base Image Fragmentation
Node management
Lib/platform duplication
Learning curve - production expertise
Solution
Raise awareness of costs
Constrain centralized support
Prioritize by impact
Seek reusable solutions
Change
Problem
It only breaks when we introduce changes
How do we achieve velocity with confidence?
Solution
Conformity checks
Red/black pipelines
Automated canaries
Staged deployments
Squeeze tests
Hybrid Architecture
Question
What is the right long term architecture?
Do you care about the organizational implications?
Answer
Conway's Law
If you have four teams working on a compiler you will end up with a four pass compiler
Summary
Dependecy
Circuit breakers, fallbacks, chaos
Simple clients
Eventual Consistency
Multi-region failover
Scale
Auto-Scaling
Redundancy
Partitioned workloads
Failure-driven design
Chaos under load
Variance
Engineered operations
Understand the cost of variance
Prioritized support by impact
Change
Automated delivery
Integrated Practices
Organization & Architecture
Soultions first & team second
Last updated