[아티클 리뷰] 카오스 마스터하기 - 넷플릭스의 MSA 여정 - 조쉬에반스

후기

  1. 내가 짠 아키텍쳐를 보면서 이게 진짜 될까? 라는 궁금즘이 생기기도 한다.

  2. 큰 장애들을 여러번 겪고나서 진화한 형태의 아키텍쳐라는 느낌을 받았다.

  3. 나도 언젠가는 이런 아키텍쳐를 만들어서 스케일을 아름답게 해보고싶다.

Index


  1. Introductions

  2. Microservices Basics

  3. Challengs & Solutions

  4. Organization & Architecture

1. Intro


What is not Microservices

  1. Monolithic code Base

  2. Monolithic db

  3. Tightly coupled architecture

2. What is Microservice?


the microservice architectural style is an approach to developing a single app as a suite of small services, each running in its own process & communicating with light-weight mechanisms, often an HTTP resource API

  1. Separation of Concerns

    1. Modularity, Encapsulation

  2. Scalability

    1. Horizontally Scaling

    2. Workload Partitioning

  3. Virtualization & Elasticity

    1. Automated Ops

    2. On demand Provisioning

How does Netflix system look like?

  1. Product

    1. Bucket Testing

    2. Subscriber → Customer

    3. Recommendation → movies

  2. Platform

    1. Routing → services comm

    2. Configuration

    3. Crypto securities

  3. Persistence

    1. Cache

    2. DB

3. Challenge & Solutions


  1. Dependency

  2. Scale

  3. Variance

  4. Change

Dependency


Problem

  1. Intra-Service Requests

Service A → Service B

  1. network Latency, congestion, failure

  2. Logical or Scaling Failure

  3. Cascading Failure

one service down → entire system down → Hystrix(K8s equivalents)

Solutions

  1. Fallback with static return

  2. Isolated thread pool

  3. Circuit breaker

Tests to check the answers

  1. Inoculation with FIT (Fault Injection Testing) [Chaos Monkey]

  2. Synthetic Transactions

  3. Traffic up to 100%

  4. However, Ensuring testing scopes to Crticial Microservices

Problem

  1. Client Libraries

    1. Common logic, access code

    2. Making a lib for all services required UNIFYING CLIENT LIB:

      1. Heap Consumption

      2. Logical Defects

      3. Transtive Dependencies

      4. Single point of failure on API Gateway

MY thoughts

Code generations and access code should be written with lan neutral codes

(easy api implementation), (access codes are easily abstracted away)

Solution


  1. Limit Heap Consumption

  2. Case by case solving problems

CAP Theorem


In the presence of network, you want between consistency and availiability. They want availability, eventual consistency using Cassandra

It is written on more than three nodes → Multi region strategy to keep availability

if more than 3 nodes are written make other writes priorities

Scalability


Use Cases


  1. Stateless services → Not a cache or db, Frequently accessed metadata, Loss of a node is a non-event. No instance affinity.

  2. Stateful Services → DB, Cache, Custom Apps which hold large amounts of data is notable event. (Avoid holding all business logics in one core service.)

  3. Hybrid Service

In 1's case, simply respawn and provide static rcs in case.

Problem


1.Dedicated Shards - An Anti- Pattern

  1. Bulkheading( isolation of threadpools not done)

  2. isolation of threadpools not done

  1. Cache Failure

  2. Fall back to DBs that cannot handle the loads

Solution


  1. Making them redundant and let them get a writing approval.

  1. Solutions to Wrong Fallback on dbs

  2. Workload partitioning

  3. Request-level caching

  4. Secure token fallback

  5. Chaos under load

  6. Auto Scaling Groups & Surviving Failure( Chaos Monkey)

  7. Compute Efficiency

  8. Node Failure

  9. Traffic Spikes

  10. Performance bugs

Variance


Use Case (Using Other Tools)


  1. Operational drift(Unintentional Drift)

  2. Polyglot & Containers

    Over Time

    Alert Thresholds

    Timeouts, retries, fallbacks

    Throughputs

    Across microservices

    Reliability Best Practices

Production Ready


  1. Alerts

  2. Apache & Tomcats

  3. Automated canary analysis

  4. Autoscaling

  5. Chaos

  6. Consistent naming

  7. ELB config

  8. Healthcheck

  9. Immutable machine images

  10. Squeeze Testing

  11. Staged, red/black deployments

  12. Timeouts, retries, fallbacks

The paved Road ↔ Off Road

Netflix tech stacks vs python ruby other tools

Cost of Variance


  1. Productivity Tooling

  2. Insight & triage capabilities

  3. Base Image Fragmentation

  4. Node management

  5. Lib/platform duplication

  6. Learning curve - production expertise

Solution


  1. Raise awareness of costs

  2. Constrain centralized support

  3. Prioritize by impact

  4. Seek reusable solutions

Change


Problem


  1. It only breaks when we introduce changes

  2. How do we achieve velocity with confidence?

Solution


  1. Conformity checks

  2. Red/black pipelines

  3. Automated canaries

  4. Staged deployments

  5. Squeeze tests

Hybrid Architecture


Question


  1. What is the right long term architecture?

  2. Do you care about the organizational implications?

Answer

Conway's Law

If you have four teams working on a compiler you will end up with a four pass compiler

Summary


  1. Dependecy

    1. Circuit breakers, fallbacks, chaos

    2. Simple clients

    3. Eventual Consistency

    4. Multi-region failover

  2. Scale

    1. Auto-Scaling

    2. Redundancy

    3. Partitioned workloads

    4. Failure-driven design

    5. Chaos under load

  3. Variance

    1. Engineered operations

    2. Understand the cost of variance

    3. Prioritized support by impact

  4. Change

    1. Automated delivery

    2. Integrated Practices

  5. Organization & Architecture

    1. Soultions first & team second

Last updated