[아티클 리뷰] 데이터 엔지니어링의 Future

후기

  1. 해당 아키텍쳐들도 작은 앱하나에서 서서히 진화하는 모습을 보여준다.

  2. 이런 로드맵이 미래에 내가 짜낼 아키텍처에 도움을 주지않을까 생각한다.

  3. REALTIME의 벽을 뚫어낸 엔지니어들에게 경이로움을 표한다.

Index


Data Infra & Data Engineering Introduction

  • Airflow(Java Scheduler)

  • BigQuery (Data Warehouse)

  • Kafka (Logging)

  1. Data engineer needs to help move & process data

    1. build tools

    2. infra

    3. frameworks

    4. services

    Why? Aida's article

    How they build the data warehouse

    1. Red shift + airflow

    2. Big Query + airflow

  2. Six stages of data pipeline maturity

    1. Stage 0: None

    2. Stage 1: Batch

    3. Stage 2: Realtime

    4. Stage 3: Integration

    5. Stage 4: Automation

    6. Stage 5: Decentralization

Stage 0

Requirements


  1. You have no data warehouse

  2. You have a monolithic architecture

  3. You need a data warehouse up and running

  4. Data engineering isn't your job

Problems


  1. Queries began timing out

  2. Users were impacting each other

  3. MySQL was missing complex Analytical SQL function

  4. Report Generation was breaking

  5. ready for batch if

    1. you have a monolithic architecture

    2. Exceeding DB capacity

Stage 1: Batch

Requirements


  1. You have a monolithic architecture

  2. Data Engineering is your part-time job

  3. Queries are timing out

  4. Exceeding DB capacity

  5. Need Complex analytical SQL functions

  6. Need reports, charts, and business intelligence

Problems


  1. Large number of airflow jobs for loading all tables.

  2. Missing and inaccurate create_time & modify_time

  3. DBA ops impacting pipeline (schemas)

  4. Hard Deletes weren't propagating( batch load)

  5. MySQL replication latency was causing data quality issues

  6. Periodic Loads cause occasional MYSQL timeouts.

Stage 2: Realtime


Requirements


  1. Loads are taking too long

  2. Pipeline is no longer stable

  3. Many complicated workflows

  4. Data latency is becoming an issue

  5. Data engineering is your fulltime job

  6. You already have Apache Kafka in your organization

Problems


  1. Pipeline for Datastore was still on Airflow

  2. No pipeline at all for Cassandra or Bigtable

  3. BigQuery needed logging data

  4. Elastic Search needed data

  5. GraphDB needed data

Stage 3: Integration


Requirements


  1. You have microservices

  2. You have a diverse database ecosystem

  3. You have many specialized derived data systems

  4. You have a team of data engineers

  5. You have a mature SRE organization

Metcalfe's law

Value of Network is in increasing number of nodes.

Problems


  1. Add new Channel to replica MySQL

  2. Create and configure Kafka topics

  3. Add new Debezium connector to Kafka connect

  4. Create destination dataset in BigQuery

  5. Add new KCBQ connector to Kafka connect

  6. Create BigQuery views

  7. Configure data quality checks for new tables

  8. Grant access to BigQuery dataset

  9. Deploy stream processors or workflows

Stage 4: Automation


Requirements


  1. Your SREs can't keep up

  2. You're spending a lot of time on manual toil

  3. You don't have time for the fun stuff

Automated Ops


  1. Terraform

  2. Ansible

  3. Helm

  4. Salt

  5. CloudFormation

  6. Chef

  7. Puppet

  8. Spinnaker

Problem


  1. Regulations

Setting up a data catalog


  1. Location

  2. Schema

  3. Ownership

  4. Lineage

  5. Encryption

  6. Versioning

Why we are doing this?

  1. Who get the access to this data?

  2. How long can this data persist?

  3. Is this data allowed in the system?

  4. Which geographies must data be persisted in?

  5. Should columns be masked?

Configure your access


  1. RBAC

  2. IAM

  3. ACL (Access Control List) → Kafka Airbag

Configure your policies


  1. Rolebased access controls

  2. Identity access management

  3. Access control lists

Automate management


  1. New user access

  2. New data access

  3. Service account access

  4. Temporary access

  5. Unused Access

Detect Violations


  1. Auditing

  2. Data loss prevention

Progress


Users can find the data that they need

Automated data management and ops

Problems

  1. Data engineering still manages configs and deployments

Stage 5: Decentralization


Requirements


  1. You have fully automated realtime data pipeline

  2. People still come to you to get data loaded

Partial Decentralization


  1. Raw tools are exposed to other engineering teams

  2. Requires Git, YAML, JSON, pull requests, terraform commands

Full Decentralization


  1. Polished tools are exposed to everyone

  2. Security and compliance manage access and policy

  3. Data engineering manages data tooling and infra

  4. Everyone manages data pipelines and data warehouse

Last updated