Data processing guide
What data processing means for DevOps and SRE
You process data whenever you parse application logs, aggregate metrics, transform an API export, or prepare a file for import. The goal is reliable extract → transform → load work: pull data from a source, shape it correctly, and put it where downstream systems can use it. Mistakes show up as broken dashboards, failed imports, or silent data corruption — so validation and reproducible pipelines matter as much as one-off shell one-liners.
Common data formats
Formats trade human readability, schema enforcement, size, and tooling support. SadServers scenarios emphasize text formats you can inspect and fix on a Linux VM; large-scale analytics stacks add binary columnar formats.
CSV
Comma-separated values — rows and columns in plain text. Ubiquitous for spreadsheets, database exports, and vendor dumps. Weaknesses: ambiguous quoting, embedded commas/newlines, no nested structure, encoding issues (UTF-8 vs Latin-1). Always check headers, delimiter, and whether the first row is data or column names.
JSON
JavaScript Object Notation — the lingua franca of modern APIs and DevOps tooling. Config files, REST responses, Kubernetes manifests, CI artifacts, and log shippers often emit JSON. Supports nesting, arrays, and typed values (string, number, boolean, null, object, array).
JSON in DevOps and APIs: you will constantly
curl an endpoint and pipe through jq; parse
kubectl get ... -o json; read structured logs from Fluent Bit or
CloudWatch; validate webhook payloads; and diff deployment configs. Learn paths
(.items[].metadata.name), filtering (select()), and
how to handle missing keys vs empty arrays. Invalid JSON (trailing commas, single
quotes, NaN) is a top cause of pipeline failures — validate early.
Parquet
Columnar binary format for analytics. Excellent compression and
fast scans of selected columns — standard in Spark, DuckDB, BigQuery, and data lakes.
Not human-readable on the shell; use parquet-tools, DuckDB, or Python
(pandas/pyarrow) to inspect. Schema is embedded in the file.
Avro
Row-based binary format with a compact schema (often in a schema registry). Common in Kafka pipelines and Hadoop ecosystems. Good schema evolution rules — important when producers and consumers upgrade independently. Less common in ad-hoc shell work than JSON or CSV.
Protocol Buffers (Protobuf)
IDL + binary serialization — define messages in .proto
files, generate code, serialize efficiently on the wire. Used heavily in gRPC
microservices and some observability stacks. Not something you hand-edit; you decode
with generated stubs or reflection tools. Strong typing and backward-compatible
field numbering when schemas evolve carefully.
Format comparison (quick reference)
- Human-readable on VM: CSV, JSON — yes; Parquet, Avro, Protobuf — no
- Nested data: JSON, Avro, Protobuf — yes; CSV — poor (flatten or JSON column)
- Analytics at scale: Parquet — best; CSV/JSON — workable but heavy
- API / DevOps default: JSON (and YAML for config, parsed similarly)
Manipulating files on the command line
SadServers scenarios train practical skills you use daily:
- Filter — keep rows/objects matching a condition (
jq 'select(...)',awk,grep,csvgrep) - Extract — pull one field or column (
jq '.field',cut,csvcut) - Validate — confirm structure (
jq empty, JSON Schema, row counts, header checks) - Transform — map, flatten, aggregate (
jq,mlr, SQL via SQLite/DuckDB) - Join / merge — combine datasets on keys (SQL
JOIN,mlr join)
Also practice related tags: json, csv, and sql scenarios on SadServers.
ETL pipelines
ETL (Extract, Transform, Load) is a core data architecture pattern:
- Extract — pull from sources: databases, APIs, log archives, S3 buckets, message queues
- Transform — clean, deduplicate, type-cast, join, aggregate, apply business rules
- Load — write to a warehouse, database, dashboard, or downstream file
ELT is a variant: load raw data first into a warehouse (e.g. Snowflake, BigQuery), then transform with SQL inside the platform. Same idea — separate ingestion, transformation, and serving — different execution order.
Good pipelines are idempotent (re-running does not duplicate data), partitioned (by date or tenant for incremental runs), and monitored (row counts, null rates, SLA alerts).
Pipeline tooling
Tools orchestrate when and how jobs run; they do not replace understanding your data.
- Apache Airflow — DAG-based workflow scheduler; Python tasks; huge ecosystem; self-hosted
- Apache Spark — distributed batch and streaming engine for large datasets; Scala/Python API
- dbt — SQL-centric transforms in the warehouse; versioned models and tests
- Prefect / Dagster — modern Python orchestrators with strong observability
- Apache Flink / Kafka Streams — stream processing on unbounded data
- Apache Beam — unified batch/stream API; runners include Spark and Flink
- DuckDB / SQLite — single-node SQL on files — ideal for VM-scale work and SadServers-style tasks
Data lake
A data lake is storage (usually object storage like S3 or GCS) holding
raw and processed datasets in open formats — often Parquet or JSON — organized by
paths such as s3://bucket/raw/events/year=2024/month=06/. Unlike a
rigid warehouse schema, lakes accept diverse sources first; governance and schema
discipline are applied in transformation layers (Spark, dbt, Trino/Presto). A
data warehouse is more curated and SQL-optimized; many organizations
use both: lake for raw/archive, warehouse for analytics.
Batch processing vs streaming
Batch processing handles finite chunks on a schedule — hourly CSV imports, nightly Spark jobs, Airflow DAGs at 2 AM. Higher latency, simpler reasoning, easier replay. Fits reporting, billing, and historical backfills.
Stream processing handles continuous events — Kafka topics, click streams, metrics, logs. Lower latency, more complex (late data, ordering, state). Tools: Flink, Kafka Streams, Spark Structured Streaming, ksqlDB. Use when dashboards or alerts must reflect events within seconds or minutes.
Lambda architecture (batch + speed layers) and kappa architecture (stream-only, reprocess from log) are design patterns for combining both; many teams run batch for correctness and streaming for freshness on the same event log.
Other fundamental concepts
- Schema — column names, types, nullability; enforce early to avoid garbage in
- Schema evolution — adding fields without breaking consumers (Avro, Protobuf excel here)
- Data quality — uniqueness, ranges, referential checks; fail the job or quarantine bad rows
- Partitioning — split by date/region so jobs process increments, not full history
- Columnar vs row storage — columnar (Parquet) for analytics scans; row (CSV, Avro) for record-at-a-time
- Compression — gzip, snappy, zstd — trade CPU for disk and network
- Lineage — track which job produced which table (debugging and compliance)
- Exactly-once vs at-least-once — delivery semantics in distributed pipelines
Learning resources
- jq manual — jqlang.org/manual
- Miller (mlr) — miller.readthedocs.io (CSV/JSON/tabular processing)
- Apache Airflow — airflow.apache.org/docs
- Apache Spark — spark.apache.org/docs
- DuckDB — duckdb.org/docs (SQL on files)
- Parquet format — parquet.apache.org
Practice scenarios
Hands-on Data Processing scenarios on live Linux VMs: data processing