eland
eland copied to clipboard
understanding to_csv and memory usage
Hi, I need to store an elasticsearch index, which is way larger than the memory available, to a csv file. I thought eland would do this efficiently for me under the hood, but it appears this is not the case. The memory usage increases to the point where it's killed by the OS. I think I must be missing something. Can you help me understand how to do this?
- Does eland indeed do this automagically or not?
- With pandas this can be achieved through the parameter
chunksize
. I've tried a lot of values, but this appears to do nothing, memory grows always in the same way. - The last line I get in the logs before the OS kills the process is the
DELETE
request
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:48,182 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.378s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,169 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.158s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,884 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.006s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,887 - INFO: DELETE https://es-coordinator:9200/_pit [status:200 request:0.003s]
Hi @falcorocks we are paginating the results from ES in batches and then constructing the huge pandas data frame here https://github.com/elastic/eland/blob/15a300728876022b206161d71055c67b500a0192/eland/operations.py#L1220-L1228
and then doing to_csv
. Might result in huge memory usage because the entire index is being dumped to pandas data frame before conversion.
The best option would be writing to disk when each chunk of data is retrieved from ES.
I am hoping py / pandas
wouldn't load the entire file before appending causing to overload memory again.
@sethmlarson any thoughts ?
@falcorocks The to_csv
improvements have been continued in #579 and released in 8.11.0. This may resolve your issue.