SadServers
  • Scenarios
  • Labs
    All Labs Linux & Bash Web Servers Databases Data Processing Docker Kubernetes CI/CD Infrastructure as Code Tooling / Applications
  • Dashboard
  • Solutions
    For Individuals For Businesses
  • Ranking
  • Newsletter
  • Documentation
    FAQ Support Pro Accounts Pro+ Accounts Business Accounts Gift API CLI/TUI Privacy Troubleshooting Interviews
  • Blog
  • Pricing
  • Gift
    Gift Purchase Gift Redeem
  • About
Log In - Sign Up
  1. Labs
  2. Data Processing
  3. Cheatsheet

Guide

Concepts and learning path

Troubleshooting

Failure modes and fixes

Cheatsheet

Commands to keep handy

Data processing cheatsheet

JSON — jq (DevOps essential)

CommandDescription
jq . file.jsonPretty-print JSON
jq empty file.jsonValidate syntax (exit 0 if valid)
jq '.items[].name' file.jsonExtract nested field
jq 'select(.status == "ok")' file.jsonFilter objects
jq -r '.[].id' file.jsonRaw strings (no quotes)
curl -s api/host | jq '.data'Parse API response
jq -s 'add' *.jsonMerge JSON arrays from files

JSON — flatten and reshape

# Flatten nested object to key paths jq -r 'paths(scalars) as $p | [($p | map(tostring) | join(".")), getpath($p)] | @tsv' file.json # Kubernetes-style: all pod names kubectl get pods -o json | jq -r '.items[].metadata.name'

CSV — inspect and slice

CommandDescription
head -5 file.csvPreview header and rows
wc -l file.csvRow count (includes header)
cut -d, -f1,3 file.csvColumns 1 and 3 (simple CSV)
csvcut -c name,email file.csvExtract columns by name
csvgrep -c status -m active file.csvFilter rows
csvstat file.csvColumn stats and types

CSV / JSON — Miller (mlr)

mlr --csv filter '$status == "active"' data.csv mlr --json put '$total = $qty * $price' orders.json mlr --csv join -l id -r uid then values.csv keys.csv

SQL on files

# SQLite: import CSV then query sqlite3 :memory: <<'EOF' .mode csv .import users.csv users SELECT country, COUNT(*) FROM users GROUP BY country; EOF # DuckDB: query CSV/Parquet directly duckdb -c "SELECT * FROM 'data.csv' WHERE amount > 100" duckdb -c "SELECT * FROM 'lake/events.parquet' LIMIT 10"

Validation quick checks

CommandDescription
jq empty file.jsonJSON syntax valid?
python3 -m json.tool < file.json > /dev/nullAlternate JSON validate
head -1 file.csv | tr ',' '\n' | wc -lColumn count from header
awk -F, 'NF != expected {print NR}' file.csvFind ragged CSV rows
file -bi file.csvGuess encoding/MIME

Binary formats — inspect

CommandDescription
duckdb -c "DESCRIBE SELECT * FROM 'f.parquet'"Parquet schema
parquet-tools meta file.parquetParquet metadata (if installed)
protoc --decode=Msg schema.proto < bin.pbDecode Protobuf message

awk — quick row processing

awk -F, '$3 > 100 {sum += $3} END {print sum}' sales.csv awk -F, 'NR==1 || $2 == "error"' log.csv # header + matching rows

Pipeline vocabulary

TermMeaning
ETLExtract → Transform → Load
BatchScheduled jobs on bounded datasets
StreamingContinuous processing of event streams
Data lakeObject storage + open formats (often Parquet)
IdempotentRe-run safe — no duplicate loads

Pro tips

  • Validate JSON with jq empty before downstream steps — fail fast
  • CSV with commas inside fields must be quoted — prefer csvkit or mlr over naive cut
  • API work: curl -s + jq is the DevOps default; learn -r for bare strings
  • For joins and aggregates on VM-scale data, SQL (SQLite/DuckDB) beats nested shell loops
  • Practice json, csv, and sql SadServers scenarios

Practice scenarios

Hands-on Data Processing scenarios on live Linux VMs: data processing

SadServersSadServers

Real-world Linux and DevOps scenarios for hands-on learning and technical assessment.

Uptime Robot ratio (30 days)
Product
  • Scenarios
  • For Individuals
  • For Businesses
  • Pricing
Resources
  • FAQ
  • Blog
  • Newsletter
Company
  • About Us
  • Support
  • Privacy Policy
  • Terms of Service
  • Contact
Connect With Us
info@sadservers.com

Made in Canada 🇨🇦
Updated: 2026-06-13 16:06 UTC – 2d2950a