Data processing guide

What data processing means for DevOps and SRE

You process data whenever you parse application logs, aggregate metrics, transform an API export, or prepare a file for import. The goal is reliable extract → transform → load work: pull data from a source, shape it correctly, and put it where downstream systems can use it. Mistakes show up as broken dashboards, failed imports, or silent data corruption — so validation and reproducible pipelines matter as much as one-off shell one-liners.

Common data formats

Formats trade human readability, schema enforcement, size, and tooling support. SadServers scenarios emphasize text formats you can inspect and fix on a Linux VM; large-scale analytics stacks add binary columnar formats.

CSV

Comma-separated values — rows and columns in plain text. Ubiquitous for spreadsheets, database exports, and vendor dumps. Weaknesses: ambiguous quoting, embedded commas/newlines, no nested structure, encoding issues (UTF-8 vs Latin-1). Always check headers, delimiter, and whether the first row is data or column names.

JSON

JavaScript Object Notation — the lingua franca of modern APIs and DevOps tooling. Config files, REST responses, Kubernetes manifests, CI artifacts, and log shippers often emit JSON. Supports nesting, arrays, and typed values (string, number, boolean, null, object, array).

JSON in DevOps and APIs: you will constantly curl an endpoint and pipe through jq; parse kubectl get ... -o json; read structured logs from Fluent Bit or CloudWatch; validate webhook payloads; and diff deployment configs. Learn paths (.items[].metadata.name), filtering (select()), and how to handle missing keys vs empty arrays. Invalid JSON (trailing commas, single quotes, NaN) is a top cause of pipeline failures — validate early.

Parquet

Columnar binary format for analytics. Excellent compression and fast scans of selected columns — standard in Spark, DuckDB, BigQuery, and data lakes. Not human-readable on the shell; use parquet-tools, DuckDB, or Python (pandas/pyarrow) to inspect. Schema is embedded in the file.

Avro

Row-based binary format with a compact schema (often in a schema registry). Common in Kafka pipelines and Hadoop ecosystems. Good schema evolution rules — important when producers and consumers upgrade independently. Less common in ad-hoc shell work than JSON or CSV.

Protocol Buffers (Protobuf)

IDL + binary serialization — define messages in .proto files, generate code, serialize efficiently on the wire. Used heavily in gRPC microservices and some observability stacks. Not something you hand-edit; you decode with generated stubs or reflection tools. Strong typing and backward-compatible field numbering when schemas evolve carefully.

Format comparison (quick reference)

Human-readable on VM: CSV, JSON — yes; Parquet, Avro, Protobuf — no
Nested data: JSON, Avro, Protobuf — yes; CSV — poor (flatten or JSON column)
Analytics at scale: Parquet — best; CSV/JSON — workable but heavy
API / DevOps default: JSON (and YAML for config, parsed similarly)

Manipulating files on the command line

SadServers scenarios train practical skills you use daily:

Filter — keep rows/objects matching a condition (jq 'select(...)', awk, grep, csvgrep)
Extract — pull one field or column (jq '.field', cut, csvcut)
Validate — confirm structure (jq empty, JSON Schema, row counts, header checks)
Transform — map, flatten, aggregate (jq, mlr, SQL via SQLite/DuckDB)
Join / merge — combine datasets on keys (SQL JOIN, mlr join)

Also practice related tags: json, csv, and sql scenarios on SadServers.

ETL pipelines

ETL (Extract, Transform, Load) is a core data architecture pattern:

Extract — pull from sources: databases, APIs, log archives, S3 buckets, message queues
Transform — clean, deduplicate, type-cast, join, aggregate, apply business rules
Load — write to a warehouse, database, dashboard, or downstream file

ELT is a variant: load raw data first into a warehouse (e.g. Snowflake, BigQuery), then transform with SQL inside the platform. Same idea — separate ingestion, transformation, and serving — different execution order.

Good pipelines are idempotent (re-running does not duplicate data), partitioned (by date or tenant for incremental runs), and monitored (row counts, null rates, SLA alerts).

Pipeline tooling

Tools orchestrate when and how jobs run; they do not replace understanding your data.

Apache Airflow — DAG-based workflow scheduler; Python tasks; huge ecosystem; self-hosted
Apache Spark — distributed batch and streaming engine for large datasets; Scala/Python API
dbt — SQL-centric transforms in the warehouse; versioned models and tests
Prefect / Dagster — modern Python orchestrators with strong observability
Apache Flink / Kafka Streams — stream processing on unbounded data
Apache Beam — unified batch/stream API; runners include Spark and Flink
DuckDB / SQLite — single-node SQL on files — ideal for VM-scale work and SadServers-style tasks

Data lake

A data lake is storage (usually object storage like S3 or GCS) holding raw and processed datasets in open formats — often Parquet or JSON — organized by paths such as s3://bucket/raw/events/year=2024/month=06/. Unlike a rigid warehouse schema, lakes accept diverse sources first; governance and schema discipline are applied in transformation layers (Spark, dbt, Trino/Presto). A data warehouse is more curated and SQL-optimized; many organizations use both: lake for raw/archive, warehouse for analytics.

Batch processing vs streaming

Batch processing handles finite chunks on a schedule — hourly CSV imports, nightly Spark jobs, Airflow DAGs at 2 AM. Higher latency, simpler reasoning, easier replay. Fits reporting, billing, and historical backfills.

Stream processing handles continuous events — Kafka topics, click streams, metrics, logs. Lower latency, more complex (late data, ordering, state). Tools: Flink, Kafka Streams, Spark Structured Streaming, ksqlDB. Use when dashboards or alerts must reflect events within seconds or minutes.

Lambda architecture (batch + speed layers) and kappa architecture (stream-only, reprocess from log) are design patterns for combining both; many teams run batch for correctness and streaming for freshness on the same event log.

Other fundamental concepts

Schema — column names, types, nullability; enforce early to avoid garbage in
Schema evolution — adding fields without breaking consumers (Avro, Protobuf excel here)
Data quality — uniqueness, ranges, referential checks; fail the job or quarantine bad rows
Partitioning — split by date/region so jobs process increments, not full history
Columnar vs row storage — columnar (Parquet) for analytics scans; row (CSV, Avro) for record-at-a-time
Compression — gzip, snappy, zstd — trade CPU for disk and network
Lineage — track which job produced which table (debugging and compliance)
Exactly-once vs at-least-once — delivery semantics in distributed pipelines

Learning resources

jq manual — jqlang.org/manual
Miller (mlr) — miller.readthedocs.io (CSV/JSON/tabular processing)
Apache Airflow — airflow.apache.org/docs
Apache Spark — spark.apache.org/docs
DuckDB — duckdb.org/docs (SQL on files)
Parquet format — parquet.apache.org

Practice scenarios

Hands-on Data Processing scenarios on live Linux VMs: data processing

Troubleshooting →