[아티클 리뷰] 데이터 엔지니어링의 Future
후기
해당 아키텍쳐들도 작은 앱하나에서 서서히 진화하는 모습을 보여준다.
이런 로드맵이 미래에 내가 짜낼 아키텍처에 도움을 주지않을까 생각한다.
REALTIME의 벽을 뚫어낸 엔지니어들에게 경이로움을 표한다.
Index
Data Infra & Data Engineering Introduction
Airflow(Java Scheduler)
BigQuery (Data Warehouse)
Kafka (Logging)
Data engineer needs to help move & process data
build tools
infra
frameworks
services
Why? Aida's article
How they build the data warehouse
Red shift + airflow
Big Query + airflow
Six stages of data pipeline maturity
Stage 0: None
Stage 1: Batch
Stage 2: Realtime
Stage 3: Integration
Stage 4: Automation
Stage 5: Decentralization
Stage 0
Requirements
You have no data warehouse
You have a monolithic architecture
You need a data warehouse up and running
Data engineering isn't your job
Problems
Queries began timing out
Users were impacting each other
MySQL was missing complex Analytical SQL function
Report Generation was breaking
ready for batch if
you have a monolithic architecture
Exceeding DB capacity
Stage 1: Batch
Requirements
You have a monolithic architecture
Data Engineering is your part-time job
Queries are timing out
Exceeding DB capacity
Need Complex analytical SQL functions
Need reports, charts, and business intelligence
Problems
Large number of airflow jobs for loading all tables.
Missing and inaccurate create_time & modify_time
DBA ops impacting pipeline (schemas)
Hard Deletes weren't propagating( batch load)
MySQL replication latency was causing data quality issues
Periodic Loads cause occasional MYSQL timeouts.
Stage 2: Realtime
Requirements
Loads are taking too long
Pipeline is no longer stable
Many complicated workflows
Data latency is becoming an issue
Data engineering is your fulltime job
You already have Apache Kafka in your organization
Problems
Pipeline for Datastore was still on Airflow
No pipeline at all for Cassandra or Bigtable
BigQuery needed logging data
Elastic Search needed data
GraphDB needed data
Stage 3: Integration
Requirements
You have microservices
You have a diverse database ecosystem
You have many specialized derived data systems
You have a team of data engineers
You have a mature SRE organization
Metcalfe's law
Value of Network is in increasing number of nodes.
Problems
Add new Channel to replica MySQL
Create and configure Kafka topics
Add new Debezium connector to Kafka connect
Create destination dataset in BigQuery
Add new KCBQ connector to Kafka connect
Create BigQuery views
Configure data quality checks for new tables
Grant access to BigQuery dataset
Deploy stream processors or workflows
Stage 4: Automation
Requirements
Your SREs can't keep up
You're spending a lot of time on manual toil
You don't have time for the fun stuff
Automated Ops
Terraform
Ansible
Helm
Salt
CloudFormation
Chef
Puppet
Spinnaker
Problem
Regulations
Setting up a data catalog
Location
Schema
Ownership
Lineage
Encryption
Versioning
Why we are doing this?
Who get the access to this data?
How long can this data persist?
Is this data allowed in the system?
Which geographies must data be persisted in?
Should columns be masked?
Configure your access
RBAC
IAM
ACL (Access Control List) → Kafka Airbag
Configure your policies
Rolebased access controls
Identity access management
Access control lists
Automate management
New user access
New data access
Service account access
Temporary access
Unused Access
Detect Violations
Auditing
Data loss prevention
Progress
Users can find the data that they need
Automated data management and ops
Problems
Data engineering still manages configs and deployments
Stage 5: Decentralization
Requirements
You have fully automated realtime data pipeline
People still come to you to get data loaded
Partial Decentralization
Raw tools are exposed to other engineering teams
Requires Git, YAML, JSON, pull requests, terraform commands
Full Decentralization
Polished tools are exposed to everyone
Security and compliance manage access and policy
Data engineering manages data tooling and infra
Everyone manages data pipelines and data warehouse
Last updated