Data processing troubleshooting

Invalid JSON / parse error from jq

Common causes: trailing commas, single-quoted keys, NaN/Infinity, HTML error page instead of JSON, or truncated download. Validate: jq empty file.json or python3 -m json.tool. Inspect the first bytes: head -c 200 file.json. If from an API, check HTTP status and Content-Type before parsing. Fix upstream or sanitize with a strict parser — do not regex-repair complex JSON.

jq returns null / empty for expected field

Path may be wrong or structure varies between records. Explore with jq 'keys' on one object or jq 'type' on the root. Use optional access: .field?. For arrays, check whether you need .items[] vs .data. API versions change shapes — jq 'paths' helps discover available paths.

CSV columns misaligned / ragged rows

Unquoted commas, embedded newlines, or wrong delimiter (semicolon in EU files). Preview: head file.csv. Find bad rows: awk -F, 'NF != N {print NR, $0}' file.csv (set N to header column count). Use csvstat or mlr --csv check. Re-export with proper quoting from the source, or parse with a real CSV library — not plain split(',').

Encoding / garbled characters in CSV or JSON

UTF-8 vs Latin-1 mismatch shows mojibake (e.g. Ã© instead of é). Check: file -bi file.csv. Convert: iconv -f ISO-8859-1 -t UTF-8 file.csv -o fixed.csv. Ensure export tools and LANG=UTF-8 are consistent. JSON must be UTF-8 per spec.

SQL import or query fails on CSV

SQLite .import needs consistent columns and escaped quotes. Import failures often mean header mismatch or embedded delimiters. Try: .mode csv, explicit table schema, or DuckDB read_csv_auto('file.csv') which infers types. Cast columns explicitly when numbers are stored as strings with currency symbols.

Out of memory on large JSON or CSV

Loading multi-GB files into one jq or Python list exhausts RAM. Stream instead: jq -c '.[]' per line for NDJSON; process CSV with mlr --csv stream or chunked SQL loads. For analytics scale, convert to Parquet and use DuckDB/Spark. On a VM, split files: split -l 100000 big.csv chunk_.

Duplicate or missing rows after ETL

Non-idempotent loads re-insert on retry. Use primary keys and INSERT OR REPLACE / merge semantics. Missing rows: check filter predicates, timezone boundaries on date partitions, and inner joins that drop non-matching keys. Compare row counts at each stage — extract vs transform vs load.

Schema drift between pipeline runs

Producers add columns or change types; consumers break silently. Validate schema at ingest (required fields, type checks). Version API responses. For Avro/Protobuf, use schema registry rules. For CSV/JSON, fail the job when unexpected columns appear or document counts drop sharply.

Batch job slow; streaming alternative?

If latency requirements are minutes/hours, batch is fine — optimize partition pruning, avoid full-table scans, and compress inputs. If sub-minute freshness is required, move ingestion to a log (Kafka) and use stream processing. Many issues labeled “streaming” are fixed by incremental batch (process only where date = yesterday).

Debugging workflow

1. Identify format and sample

file data.*
head -3 data.csv
jq . data.json 2>&1 | head -20

2. Validate structure

jq empty data.json
csvstat data.csv
wc -l data.csv

3. Isolate transform step

# Run each stage with row-count checks
jq 'select(...)' raw.json | wc -l
sqlite3 :memory: ".import --csv t data.csv" "SELECT COUNT(*) FROM t"

Practice scenarios

Hands-on Data Processing scenarios on live Linux VMs: data processing

Cheatsheet →