eland icon indicating copy to clipboard operation
eland copied to clipboard

understanding to_csv and memory usage

Open falcorocks opened this issue 2 years ago • 1 comments

Hi, I need to store an elasticsearch index, which is way larger than the memory available, to a csv file. I thought eland would do this efficiently for me under the hood, but it appears this is not the case. The memory usage increases to the point where it's killed by the OS. I think I must be missing something. Can you help me understand how to do this?

  1. Does eland indeed do this automagically or not?
  2. With pandas this can be achieved through the parameter chunksize. I've tried a lot of values, but this appears to do nothing, memory grows always in the same way.
  3. The last line I get in the logs before the OS kills the process is the DELETE request
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:48,182 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.378s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,169 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.158s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,884 - INFO: POST https://es-coordinator:9200/_search [status:200 request:0.006s]
[lq0jg ](https://swarmpit.soccrates.xyz/#/tasks/lq0jgu4ewxdcoqcauan7yr2nu?log=1) 2022-03-29 13:44:49,887 - INFO: DELETE https://es-coordinator:9200/_pit [status:200 request:0.003s]

falcorocks avatar Mar 29 '22 11:03 falcorocks

Hi @falcorocks we are paginating the results from ES in batches and then constructing the huge pandas data frame here https://github.com/elastic/eland/blob/15a300728876022b206161d71055c67b500a0192/eland/operations.py#L1220-L1228

and then doing to_csv. Might result in huge memory usage because the entire index is being dumped to pandas data frame before conversion.

The best option would be writing to disk when each chunk of data is retrieved from ES. I am hoping py / pandas wouldn't load the entire file before appending causing to overload memory again.

@sethmlarson any thoughts ?

V1NAY8 avatar Mar 29 '22 12:03 V1NAY8

@falcorocks The to_csv improvements have been continued in #579 and released in 8.11.0. This may resolve your issue.

bartbroere avatar Dec 01 '23 16:12 bartbroere