Data processing cheatsheet
JSON — jq (DevOps essential)
| Command | Description |
|---|---|
jq . file.json | Pretty-print JSON |
jq empty file.json | Validate syntax (exit 0 if valid) |
jq '.items[].name' file.json | Extract nested field |
jq 'select(.status == "ok")' file.json | Filter objects |
jq -r '.[].id' file.json | Raw strings (no quotes) |
curl -s api/host | jq '.data' | Parse API response |
jq -s 'add' *.json | Merge JSON arrays from files |
JSON — flatten and reshape
# Flatten nested object to key paths
jq -r 'paths(scalars) as $p | [($p | map(tostring) | join(".")), getpath($p)] | @tsv' file.json
# Kubernetes-style: all pod names
kubectl get pods -o json | jq -r '.items[].metadata.name'
CSV — inspect and slice
| Command | Description |
|---|---|
head -5 file.csv | Preview header and rows |
wc -l file.csv | Row count (includes header) |
cut -d, -f1,3 file.csv | Columns 1 and 3 (simple CSV) |
csvcut -c name,email file.csv | Extract columns by name |
csvgrep -c status -m active file.csv | Filter rows |
csvstat file.csv | Column stats and types |
CSV / JSON — Miller (mlr)
mlr --csv filter '$status == "active"' data.csv
mlr --json put '$total = $qty * $price' orders.json
mlr --csv join -l id -r uid then values.csv keys.csv
SQL on files
# SQLite: import CSV then query
sqlite3 :memory: <<'EOF'
.mode csv
.import users.csv users
SELECT country, COUNT(*) FROM users GROUP BY country;
EOF
# DuckDB: query CSV/Parquet directly
duckdb -c "SELECT * FROM 'data.csv' WHERE amount > 100"
duckdb -c "SELECT * FROM 'lake/events.parquet' LIMIT 10"
Validation quick checks
| Command | Description |
|---|---|
jq empty file.json | JSON syntax valid? |
python3 -m json.tool < file.json > /dev/null | Alternate JSON validate |
head -1 file.csv | tr ',' '\n' | wc -l | Column count from header |
awk -F, 'NF != expected {print NR}' file.csv | Find ragged CSV rows |
file -bi file.csv | Guess encoding/MIME |
Binary formats — inspect
| Command | Description |
|---|---|
duckdb -c "DESCRIBE SELECT * FROM 'f.parquet'" | Parquet schema |
parquet-tools meta file.parquet | Parquet metadata (if installed) |
protoc --decode=Msg schema.proto < bin.pb | Decode Protobuf message |
awk — quick row processing
awk -F, '$3 > 100 {sum += $3} END {print sum}' sales.csv
awk -F, 'NR==1 || $2 == "error"' log.csv # header + matching rows
Pipeline vocabulary
| Term | Meaning |
|---|---|
| ETL | Extract → Transform → Load |
| Batch | Scheduled jobs on bounded datasets |
| Streaming | Continuous processing of event streams |
| Data lake | Object storage + open formats (often Parquet) |
| Idempotent | Re-run safe — no duplicate loads |
Pro tips
- Validate JSON with
jq emptybefore downstream steps — fail fast - CSV with commas inside fields must be quoted — prefer
csvkitormlrover naivecut - API work:
curl -s+jqis the DevOps default; learn-rfor bare strings - For joins and aggregates on VM-scale data, SQL (SQLite/DuckDB) beats nested shell loops
- Practice json, csv, and sql SadServers scenarios
Practice scenarios
Hands-on Data Processing scenarios on live Linux VMs: data processing