Data processing troubleshooting
Invalid JSON / parse error from jq
Common causes: trailing commas, single-quoted keys, NaN/Infinity, HTML error
page instead of JSON, or truncated download. Validate:
jq empty file.json or python3 -m json.tool.
Inspect the first bytes: head -c 200 file.json. If from an API,
check HTTP status and Content-Type before parsing. Fix upstream
or sanitize with a strict parser — do not regex-repair complex JSON.
jq returns null / empty for expected field
Path may be wrong or structure varies between records. Explore with
jq 'keys' on one object or jq 'type' on the root.
Use optional access: .field?. For arrays, check whether you need
.items[] vs .data. API versions change shapes —
jq 'paths' helps discover available paths.
CSV columns misaligned / ragged rows
Unquoted commas, embedded newlines, or wrong delimiter (semicolon in EU files).
Preview: head file.csv. Find bad rows:
awk -F, 'NF != N {print NR, $0}' file.csv (set N to header column count).
Use csvstat or mlr --csv check. Re-export with proper
quoting from the source, or parse with a real CSV library — not plain
split(',').
Encoding / garbled characters in CSV or JSON
UTF-8 vs Latin-1 mismatch shows mojibake (e.g. é instead of
é). Check: file -bi file.csv. Convert:
iconv -f ISO-8859-1 -t UTF-8 file.csv -o fixed.csv. Ensure
export tools and LANG=UTF-8 are consistent. JSON must be UTF-8 per spec.
SQL import or query fails on CSV
SQLite .import needs consistent columns and escaped quotes. Import
failures often mean header mismatch or embedded delimiters. Try:
.mode csv, explicit table schema, or DuckDB
read_csv_auto('file.csv') which infers types. Cast columns explicitly
when numbers are stored as strings with currency symbols.
Out of memory on large JSON or CSV
Loading multi-GB files into one jq or Python list exhausts RAM.
Stream instead: jq -c '.[]' per line for NDJSON;
process CSV with mlr --csv stream or chunked SQL loads. For
analytics scale, convert to Parquet and use DuckDB/Spark. On a VM, split files:
split -l 100000 big.csv chunk_.
Duplicate or missing rows after ETL
Non-idempotent loads re-insert on retry. Use primary keys and
INSERT OR REPLACE / merge semantics. Missing rows: check filter
predicates, timezone boundaries on date partitions, and inner joins that drop
non-matching keys. Compare row counts at each stage — extract vs transform vs load.
Schema drift between pipeline runs
Producers add columns or change types; consumers break silently. Validate schema at ingest (required fields, type checks). Version API responses. For Avro/Protobuf, use schema registry rules. For CSV/JSON, fail the job when unexpected columns appear or document counts drop sharply.
Batch job slow; streaming alternative?
If latency requirements are minutes/hours, batch is fine — optimize partition
pruning, avoid full-table scans, and compress inputs. If sub-minute freshness is
required, move ingestion to a log (Kafka) and use stream processing. Many issues
labeled “streaming” are fixed by incremental batch (process only
where date = yesterday).
Debugging workflow
1. Identify format and sample
file data.*
head -3 data.csv
jq . data.json 2>&1 | head -202. Validate structure
jq empty data.json
csvstat data.csv
wc -l data.csv3. Isolate transform step
# Run each stage with row-count checks
jq 'select(...)' raw.json | wc -l
sqlite3 :memory: ".import --csv t data.csv" "SELECT COUNT(*) FROM t"Practice scenarios
Hands-on Data Processing scenarios on live Linux VMs: data processing