[아티클 리뷰] 카오스 마스터하기 - 넷플릭스의 MSA 여정 - 조쉬에반스

후기

내가 짠 아키텍쳐를 보면서 이게 진짜 될까? 라는 궁금즘이 생기기도 한다.
큰 장애들을 여러번 겪고나서 진화한 형태의 아키텍쳐라는 느낌을 받았다.
나도 언젠가는 이런 아키텍쳐를 만들어서 스케일을 아름답게 해보고싶다.

Index

Introductions
Microservices Basics
Challengs & Solutions
Organization & Architecture

1. Intro

What is not Microservices

Monolithic code Base
Monolithic db
Tightly coupled architecture

2. What is Microservice?

the microservice architectural style is an approach to developing a single app as a suite of small services, each running in its own process & communicating with light-weight mechanisms, often an HTTP resource API

Separation of Concerns
1. Modularity, Encapsulation
Scalability
1. Horizontally Scaling
2. Workload Partitioning
Virtualization & Elasticity
1. Automated Ops
2. On demand Provisioning

How does Netflix system look like?

Product
1. Bucket Testing
2. Subscriber → Customer
3. Recommendation → movies
Platform
1. Routing → services comm
2. Configuration
3. Crypto securities
Persistence
1. Cache
2. DB

3. Challenge & Solutions

Dependency
Scale
Variance
Change

Dependency

Problem

Intra-Service Requests

Service A → Service B

network Latency, congestion, failure
Logical or Scaling Failure
Cascading Failure

one service down → entire system down → Hystrix(K8s equivalents)

Solutions

Fallback with static return
Isolated thread pool
Circuit breaker

Tests to check the answers

Inoculation with FIT (Fault Injection Testing) [Chaos Monkey]
Synthetic Transactions
Traffic up to 100%
However, Ensuring testing scopes to Crticial Microservices

Problem

Client Libraries
1. Common logic, access code
2. Making a lib for all services required UNIFYING CLIENT LIB:
  1. Heap Consumption
  2. Logical Defects
  3. Transtive Dependencies
  4. Single point of failure on API Gateway

MY thoughts

Code generations and access code should be written with lan neutral codes

(easy api implementation), (access codes are easily abstracted away)

Solution

Limit Heap Consumption
Case by case solving problems

CAP Theorem

In the presence of network, you want between consistency and availiability. They want availability, eventual consistency using Cassandra

It is written on more than three nodes → Multi region strategy to keep availability

if more than 3 nodes are written make other writes priorities

Scalability

Use Cases

Stateless services → Not a cache or db, Frequently accessed metadata, Loss of a node is a non-event. No instance affinity.
Stateful Services → DB, Cache, Custom Apps which hold large amounts of data is notable event. (Avoid holding all business logics in one core service.)
Hybrid Service

In 1's case, simply respawn and provide static rcs in case.

Problem

1.Dedicated Shards - An Anti- Pattern

Bulkheading( isolation of threadpools not done)
isolation of threadpools not done

Cache Failure
Fall back to DBs that cannot handle the loads

Solution

Making them redundant and let them get a writing approval.

Solutions to Wrong Fallback on dbs
Workload partitioning
Request-level caching
Secure token fallback
Chaos under load
Auto Scaling Groups & Surviving Failure( Chaos Monkey)
Compute Efficiency
Node Failure
Traffic Spikes
Performance bugs

Variance

Use Case (Using Other Tools)

Operational drift(Unintentional Drift)
Polyglot & Containers
Over Time
Alert Thresholds
Timeouts, retries, fallbacks
Throughputs
Across microservices
Reliability Best Practices

Production Ready

Alerts
Apache & Tomcats
Automated canary analysis
Autoscaling
Chaos
Consistent naming
ELB config
Healthcheck
Immutable machine images
Squeeze Testing
Staged, red/black deployments
Timeouts, retries, fallbacks

The paved Road ↔ Off Road

Netflix tech stacks vs python ruby other tools

Cost of Variance

Productivity Tooling
Insight & triage capabilities
Base Image Fragmentation
Node management
Lib/platform duplication
Learning curve - production expertise

Solution

Raise awareness of costs
Constrain centralized support
Prioritize by impact
Seek reusable solutions

Change

Problem

It only breaks when we introduce changes
How do we achieve velocity with confidence?

Solution

Conformity checks
Red/black pipelines
Automated canaries
Staged deployments
Squeeze tests

Hybrid Architecture

Question

What is the right long term architecture?
Do you care about the organizational implications?

Answer

Conway's Law

If you have four teams working on a compiler you will end up with a four pass compiler

Summary

Dependecy
1. Circuit breakers, fallbacks, chaos
2. Simple clients
3. Eventual Consistency
4. Multi-region failover
Scale
1. Auto-Scaling
2. Redundancy
3. Partitioned workloads
4. Failure-driven design
5. Chaos under load
Variance
1. Engineered operations
2. Understand the cost of variance
3. Prioritized support by impact
Change
1. Automated delivery
2. Integrated Practices
Organization & Architecture
1. Soultions first & team second

Previous아티클 리뷰 Next[아티클 리뷰] 데이터 엔지니어링의 Future

Last updated 1 year ago