ELK stack troubleshooting

Filebeat not shipping logs

Check filebeat test config and filebeat test output. Wrong file paths or permissions (Filebeat runs as root or filebeat user — must read log files). Registry corruption can skip or repeat lines — inspect /var/lib/filebeat/registry. Verify Logstash listens on 5044: ss -tlnp | grep 5044. Firewall between hosts blocks Beats silently — test with telnet logstash-host 5044 or nc -zv.

Logstash pipeline not starting

Run /usr/share/logstash/bin/logstash -t for syntax errors. Grok pattern mismatch does not stop the pipeline but leaves unparsed message fields — look for _grokparsefailure in events. JVM heap too small causes OOM — tune -Xms/-Xmx in /etc/logstash/jvm.options. Check journalctl -u logstash -e for plugin or permission errors on pipeline config files.

Elasticsearch cluster red or yellow

curl localhost:9200/_cluster/health?pretty — red means primary shards missing (investigate immediately). Yellow on a single node is expected: replica shards cannot assign without a second node. Fix prod yellow by adding nodes or setting number_of_replicas: 0 on indices (dev only). Use _cat/shards?v and _cluster/allocation/explain for stuck shards. Disk watermark triggers read-only mode — see disk full below.

No documents in Elasticsearch

Trace the pipeline: Filebeat → Logstash → ES. Confirm Filebeat registry advancing and Logstash receiving events (temporary stdout output). Wrong index name in Logstash output vs your search query. Mapping conflicts reject documents — check ES logs for mapper_parsing_exception. Timestamp in the future puts docs outside your search time window. Refresh: indices are near-real-time — /_refresh or wait ~1s after index.

Disk full / flood stage watermark

Elasticsearch blocks writes when disk crosses flood-stage watermark. Delete old indices, add disk, or adjust watermarks in elasticsearch.yml (not a long-term fix). Implement ILM or cron to drop indices older than retention. Monitor /_cat/allocation?v and host disk with df -h. See the disk volumes lab.

Authentication / TLS errors

With security enabled, Filebeat and Logstash need username/password or API keys in output blocks, plus CA cert for TLS. Errors like 401 Unauthorized or certificate verify failed in beat/logstash logs. Test ES auth: curl -u elastic:PASS localhost:9200. Align cipher suites and hostname verification between components.

Grok / parsing wrong fields

Log format changed (nginx custom format, JSON logs) — update grok pattern or switch to json { source => "message" } filter. Use Grok Debugger (Elastic docs) or stdout { codec => rubydebug } to inspect raw events. Multiline stack traces need codec => multiline in Filebeat or Logstash multiline filter.

Elasticsearch won’t start

Common causes: insufficient vm.max_map_count on Linux (set to 262144+), wrong Java version, corrupt data path, or port 9200 already in use. Logs: /var/log/elasticsearch/ or journalctl -u elasticsearch -e. Bootstrap checks fail on memory locking and file descriptors — follow hints in the error message.

Debugging workflow

1. Cluster health and indices

curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/indices?v | head

2. Filebeat → Logstash path

filebeat test output
journalctl -u filebeat -n 30 --no-pager
ss -tlnp | grep 5044

3. Logstash pipeline and ES output

/usr/share/logstash/bin/logstash -t
journalctl -u logstash -n 50 --no-pager
curl -s 'localhost:9200/logs-*/_count'

Practice scenarios

Hands-on ELK Stack scenarios on live Linux VMs: elk

Cheatsheet →