esbulk icon indicating copy to clipboard operation
esbulk copied to clipboard

Index failing with `connection reset by peer`

Open bnewbold opened this issue 5 years ago • 1 comments
trafficstars

I twice attempted to import over 140 million documents into a local, single-node ES 6.8 cluster using a command like the following:

zcat /srv/fatcat/snapshots/release_export_expanded.json.gz |  pv -l | parallel -j20 --linebuffer --round-robin --pipe ./fatcat_transform.py elasticsearch-releases - - | esbulk -verbose -size 10000 -id ident -w 6 -index qa_release_v03b -type release

This is with esbulk 0.5.1. I will retry with the latest 0.6.0.

The index almost completed, but after more than 100m documents, failed with an error like:

2020/01/31 11:49:40 Post http://localhost:9200/_bulk: net/http: HTTP/1.x transport connection broken: write tcp [::1]:56970->[::1]:9200: write: connection reset by peer                                                                      
Warning: unable to close filehandle properly: Broken pipe during global destruction

(the "Warning" part might be one of the other pipeline commands)

I suspect this is actually a problem on the Elasticsearch side... maybe something like a GC pause? I looked in ES logs and see that there were garbage collects up until the time of failure, and none after, but no particularly large or noticeable GC right around the failure.

I would expect the esbulk HTTP retries to resolve any such issues; I assume in this case all the retries failed. Perhaps longer, more, or exponential back-offs would help. Unfortunately, I suspect that this failure may be difficult to reproduce reliably, as it has only occurred with these very large imports.

esbulk has been really useful, thank you for making it available and any maintenance time you can spare!

bnewbold avatar Jan 31 '20 20:01 bnewbold

As a follow-up on this issue, if I recall correctly the root issue was having individual batches that were too large (in bytes, not number of documents) and ES would refuse them. Worked around this by decreasing batch size.

bnewbold avatar Mar 27 '20 17:03 bnewbold