[아티클 리뷰] 데이터 엔지니어링의 Future

후기

해당 아키텍쳐들도 작은 앱하나에서 서서히 진화하는 모습을 보여준다.
이런 로드맵이 미래에 내가 짜낼 아키텍처에 도움을 주지않을까 생각한다.
REALTIME의 벽을 뚫어낸 엔지니어들에게 경이로움을 표한다.

Index

Data Infra & Data Engineering Introduction

Airflow(Java Scheduler)
BigQuery (Data Warehouse)
Kafka (Logging)

Data engineer needs to help move & process data
1. build tools
2. infra
3. frameworks
4. services
Why? Aida's article
How they build the data warehouse
1. Red shift + airflow
2. Big Query + airflow
Six stages of data pipeline maturity
1. Stage 0: None
2. Stage 1: Batch
3. Stage 2: Realtime
4. Stage 3: Integration
5. Stage 4: Automation
6. Stage 5: Decentralization

Stage 0

Requirements

You have no data warehouse
You have a monolithic architecture
You need a data warehouse up and running
Data engineering isn't your job

Problems

Queries began timing out
Users were impacting each other
MySQL was missing complex Analytical SQL function
Report Generation was breaking
ready for batch if
1. you have a monolithic architecture
2. Exceeding DB capacity

Stage 1: Batch

Requirements

You have a monolithic architecture
Data Engineering is your part-time job
Queries are timing out
Exceeding DB capacity
Need Complex analytical SQL functions
Need reports, charts, and business intelligence

Problems

Large number of airflow jobs for loading all tables.
Missing and inaccurate create_time & modify_time
DBA ops impacting pipeline (schemas)
Hard Deletes weren't propagating( batch load)
MySQL replication latency was causing data quality issues
Periodic Loads cause occasional MYSQL timeouts.

Stage 2: Realtime

Requirements

Loads are taking too long
Pipeline is no longer stable
Many complicated workflows
Data latency is becoming an issue
Data engineering is your fulltime job
You already have Apache Kafka in your organization

Problems

Pipeline for Datastore was still on Airflow
No pipeline at all for Cassandra or Bigtable
BigQuery needed logging data
Elastic Search needed data
GraphDB needed data

Stage 3: Integration

Requirements

You have microservices
You have a diverse database ecosystem
You have many specialized derived data systems
You have a team of data engineers
You have a mature SRE organization

Metcalfe's law

Value of Network is in increasing number of nodes.

Problems

Add new Channel to replica MySQL
Create and configure Kafka topics
Add new Debezium connector to Kafka connect
Create destination dataset in BigQuery
Add new KCBQ connector to Kafka connect
Create BigQuery views
Configure data quality checks for new tables
Grant access to BigQuery dataset
Deploy stream processors or workflows

Stage 4: Automation

Requirements

Your SREs can't keep up
You're spending a lot of time on manual toil
You don't have time for the fun stuff

Automated Ops

Terraform
Ansible
Helm
Salt
CloudFormation
Chef
Puppet
Spinnaker

Problem

Regulations

Setting up a data catalog

Location
Schema
Ownership
Lineage
Encryption
Versioning

Why we are doing this?

Who get the access to this data?
How long can this data persist?
Is this data allowed in the system?
Which geographies must data be persisted in?
Should columns be masked?

Configure your access

RBAC
IAM
ACL (Access Control List) → Kafka Airbag

Configure your policies

Rolebased access controls
Identity access management
Access control lists

Automate management

New user access
New data access
Service account access
Temporary access
Unused Access

Detect Violations

Auditing
Data loss prevention

Progress

Users can find the data that they need

Automated data management and ops

Problems

Data engineering still manages configs and deployments

Stage 5: Decentralization

Requirements

You have fully automated realtime data pipeline
People still come to you to get data loaded

Partial Decentralization

Raw tools are exposed to other engineering teams
Requires Git, YAML, JSON, pull requests, terraform commands

Full Decentralization

Polished tools are exposed to everyone
Security and compliance manage access and policy
Data engineering manages data tooling and infra
Everyone manages data pipelines and data warehouse

Previous[아티클 리뷰] 카오스 마스터하기 - 넷플릭스의 MSA 여정 - 조쉬에반스 Next[아티클 리뷰] MSA 제대로 디자인하기 Design Microservice Architectures the Right Way- Michael bryznek

Last updated 1 year ago