logstash-output-opensearch
logstash-output-opensearch copied to clipboard
[BUG] Logstash hangs when writing in a read-only index
Describe the bug Logstash 7.6.13 with Opensearch output plugin (8.4 does the same) stops processing its queue when receiving a 429 error from OpenSearch. In my case, this error was generated when Logstash tries to write into a read-only index. Then, it tries to re-send the bulk request infinitely, not processing the new events. If I remove the read-only from the index, Logstash operates normally again.
To Reproduce Steps to reproduce the behavior:
- Set the read-only flag on an OpenSearch index:
PUT <myindex>/_settings
{
"index.blocks.read_only_allow_delete": true
}
- Ship some logs to Logstash that are supposed to be written in the read-only index
- Check Logstash logs. A 429 error should appear like this:
[2022-12-02T14:15:01,887][INFO ][logstash.outputs.opensearch][main][opensearch_output_with_project] Retrying failed action {:status=>429, :action=>["index", {:_id=>nil, :_index=>"project-docker-prod-syslog-2022.08.08", :routing=>nil}, {"syslog_severity_code"=>5, "structure"=>"syslog", "syslog_severity"=>"notice", "prospector"=>{"type"=>"log"}, "env_type"=>"real", "@timestamp"=>2022-08-08T04:00:04.000Z, "beat"=>{"version"=>"6.8.23", "hostname"=>"e8912732644d"}, "source"=>"/var/log/messages", "env"=>"prod", "aws"=>{"instance"=>{"id"=>"i-0123456789abcdefg"}}, "syslog_facility"=>"user-level", "syslog_facility_code"=>1, "syslog_timestamp"=>"Aug 08 04:00:04", "tags"=>["beats_input_codec_plain_applied"], "syslog_hostname"=>"serveurid123456", "@version"=>"1", "syslog_message"=>"Test4.", "log"=>{"file"=>{"path"=>"/var/log/messages"}}, "offset"=>4100, "market"=>"docker", "syslog_program"=>"systemd", "input"=>{"type"=>"log"}, "project"=>"project"}], :error=>{"type"=>"cluster_block_exception", "reason"=>"index [project-docker-prod-syslog-2022.08.08] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}
- Ship more logs to the same or another index, nothing will happen as Logstash will keep trying to re-send the bulk request:
[2022-12-02T14:07:49,021][INFO ][logstash.outputs.opensearch][main][opensearch_output_with_project] Retrying individual bulk actions that failed or were rejected by the previous bulk request {:count=>1}
Expected behavior The plugin should put this event in another queue and keeps processing the main queue to avoid interruption. Instead, the OpenSearch plugin retries the bulk request infinitely as it's a 429 error: https://www.rubydoc.info/gems/logstash-output-opensearch/LogStash/Outputs/OpenSearch
Plugins
logstash-codec-avro
logstash-codec-cef
logstash-codec-collectd
logstash-codec-dots
logstash-codec-edn
logstash-codec-edn_lines
logstash-codec-es_bulk
logstash-codec-fluent
logstash-codec-graphite
logstash-codec-json
logstash-codec-json_lines
logstash-codec-line
logstash-codec-msgpack
logstash-codec-multiline
logstash-codec-netflow
logstash-codec-plain
logstash-codec-rubydebug
logstash-filter-aggregate
logstash-filter-anonymize
logstash-filter-cidr
logstash-filter-clone
logstash-filter-csv
logstash-filter-date
logstash-filter-de_dot
logstash-filter-dissect
logstash-filter-dns
logstash-filter-drop
logstash-filter-elasticsearch
logstash-filter-fingerprint
logstash-filter-geoip
logstash-filter-grok
logstash-filter-http
logstash-filter-json
logstash-filter-json_encode
logstash-filter-kv
logstash-filter-memcached
logstash-filter-metrics
logstash-filter-mutate
logstash-filter-prune
logstash-filter-ruby
logstash-filter-sleep
logstash-filter-split
logstash-filter-syslog_pri
logstash-filter-throttle
logstash-filter-translate
logstash-filter-truncate
logstash-filter-urldecode
logstash-filter-useragent
logstash-filter-uuid
logstash-filter-xml
logstash-input-azure_event_hubs
logstash-input-beats
└── logstash-input-elastic_agent (alias)
logstash-input-cloudwatch_logs
logstash-input-couchdb_changes
logstash-input-dead_letter_queue
logstash-input-elasticsearch
logstash-input-exec
logstash-input-file
logstash-input-ganglia
logstash-input-gelf
logstash-input-generator
logstash-input-graphite
logstash-input-heartbeat
logstash-input-http
logstash-input-http_poller
logstash-input-imap
logstash-input-jms
logstash-input-pipe
logstash-input-redis
logstash-input-s3
logstash-input-snmp
logstash-input-snmptrap
logstash-input-sqs
logstash-input-stdin
logstash-input-syslog
logstash-input-tcp
logstash-input-twitter
logstash-input-udp
logstash-input-unix
logstash-integration-elastic_enterprise_search
├── logstash-output-elastic_app_search
└── logstash-output-elastic_workplace_search
logstash-integration-jdbc
├── logstash-input-jdbc
├── logstash-filter-jdbc_streaming
└── logstash-filter-jdbc_static
logstash-integration-kafka
├── logstash-input-kafka
└── logstash-output-kafka
logstash-integration-rabbitmq
├── logstash-input-rabbitmq
└── logstash-output-rabbitmq
logstash-output-cloudwatch
logstash-output-csv
logstash-output-elasticsearch
logstash-output-email
logstash-output-file
logstash-output-graphite
logstash-output-http
logstash-output-lumberjack
logstash-output-nagios
logstash-output-null
logstash-output-opensearch
logstash-output-pipe
logstash-output-redis
logstash-output-s3
logstash-output-sns
logstash-output-sqs
logstash-output-stdout
logstash-output-tcp
logstash-output-udp
logstash-output-webhdfs
logstash-patterns-core
Host/Environment (please complete the following information):
- OS: CentOS Linux 7 (AltArch)
- Version 7
- Container in AWS ECS Fargate
Additional context Some instances run Filebeat and ship logs to Logstash OSS. Logstash OSS ships the logs to the OpenSearch cluster. Logstash keeps re-sending the bulk request to OpenSearch when the error occurs and doesn't send ACK to Filebeat anymore. Filebeat stops sending logs as it doesn't receive ACK from Logstash.
Restarting Logstash works for a while, until it runs into the same error. Making the target index writable again solves the issue for a time. After some time, a policy migrates the index to warm storage, making the index read-only. And we don't want to remove that.
It looks like the plugin should detect if the documents failed to write because the index is read-only.
@Kluk , I'd like to better understand your suggested behavior. If the index is read-only, then all the events will fail to write, correct? What events would you expect to process?
@dlvenable thank you for your reply. I use Logstash to write in multiple indices. If one is in read-only, Logstash hangs and stops writing even in the others which are writable.
We facing the same issue.
We are moving hot indices to warm storage after a certain age. It happens that some log batches get delayed and Logstash tries to send those logs to warm indices (read-only).
OpenSearch returns 429 errors and Logstash hangs in an infinite loop, retrying every few minutes. It stops processing any other batch and it will not recover without manual intervention.
We looked at implementing a DLQ using the dlq_custom_codes
option, as documented for the Elasticsearch output plugin.
By default, DLQ is only supported for 400 and 404s.
However, it seems the dlq_custom_codes
option is not supported by the OpenSearch output plugin.
Is there a way to consider 429s (or errors related to read-only indices) as unretryable errors and send documents to a DLQ?
Hello @AdrienKuhn,
As this issue is still open after a few months, I came up with a workaround.
I assume that the name of your indices is generated with the timestamp provided with the log events. I did the same, and as I also had delayed logs, Logstash tried to write in read-only indices.
To go around that, I don't use the date from the logs anymore, I use Logstash host-based date. So, it always writes in the latest index (until Logstash runs on a server with the correct date configured). To do that, I've created a filter in the pipeline which is applied to every event. I called it filter-date.conf:
filter {
# Create the date fields
# log.date is a metadata field (not forwarded to OpenSearch) and is only used to create the indices name
# log.received_at is the date and time when logstash received the entry
mutate {
id => log_date_weekly
add_field => [ "[log][received_at]", "%{@timestamp}" ]
add_field => [ "[@metadata][log][date]", "%{+xxxx.}w%{+ww}" ]
}
}
As you can see, I added two fields. If you don't want to ship the Logstash date to OpenSearch, you can only create the second one. It's a metadata field. Metadata fields are not shipped to OpenSearch. I use weekly indices, that's why I use the format {+xxxx}.{+ww}. xxxx for the week year and ww for the week of the week year. Of course, you can use the usual YYYY.MM.dd for daily indices.
And then, I only use the metadata field log.date in the opensearch output:
index => "my-index-%{[@metadata][log][date]}"
This way, you keep the log event date in the event shipped to OpenSearch, but the event is written in a writable index. I hope this helps
Hi @Kluk,
Thank you, it's helping a lot! We implemented something similar.
We are leveraging Filebeat to read logs from multiple sources and to forward logs to Logstash.
In the case of the docker/container logs forwarded by Filebeat, the Logstash @timestamp
field from the event is the original timestamp from the log and not the timestamp at which Logstash received the event. It seems the docker
Filebeat input may already be doing some processing at the source.
However, it doesn't seem to always be the case for logs coming from other Filebeat inputs.
To make sure we are always using the Logstash host date, we ended up with the following:
filter {
ruby {
code => "
event.set('[logstash][processed_at]', DateTime.now.iso8601(3))
event.set('[@metadata][logstash][date]', DateTime.now.strftime('%Y.%m.%d'))
"
}
}
and passing the metadata field to the output.
index => "my-index-%{[@metadata][logstash][date]}"
We now have visibility on the processing delay by comparing @timestamp
and [logstash][processed_at]
on each log.
I am wondering if it is expected for OpenSearch to return 429 / Too Many Requests errors when sending logs to a read-only index, but this workaround should fix the issue on our side!
Hi @AdrienKuhn,
I'm glad it helped!
I'm using Filebeat as well. It can process the logs regarding how it's configured.
I keep both dates to get the processing delay as well ;)
I wondered the same. The 429 error seems not the correct error code OpenSearch should send. But I actually had the same behaviour with Elasticsearch. I guess you know that OpenSearch is a fork of Elasticsearch. I suppose that this behaviour comes from here. They just never looked into it after they forked it.