dify huge csv file occupies huge memory while indexing document

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

4Core/8G instance
huge csv file (350,000 lines)
upload csv file
indexing_processor will take huge memory while embedding segments
docker container will restart due to running out of memory

✔️ Expected Behavior

documents status will turning to avaliable

❌ Actual Behavior

docker container will restart due to running out of memory documents status will pending in indexing

10,000 embeddings takes 10076 * 1536 * 8byte / 1024 / 1024 = 117Mb memory (float take 8 bytes memory in python) 350,000 embeddings will takes 117Mb * 35 / 1024 = 4Gb memory

log embeddings memory:

objgraph embeddings result:

key result:

full result:

Jan 21 '25 02:01 rayshaw001

The issue you're experiencing with high memory usage during the indexing of a large CSV file in Dify is likely due to the memory requirements for handling a large number of embeddings. Here are a few suggestions to help manage memory usage:

Batch Processing: Instead of processing all 350,000 lines at once, consider processing the CSV file in smaller batches. This can help reduce the peak memory usage by only loading a subset of the data into memory at any given time.
Increase Memory Allocation: If possible, increase the memory allocation for your Docker container. This can be done by adjusting the Docker settings to allow more memory for the container running Dify.
Optimize Embedding Size: If feasible, consider reducing the dimensionality of the embeddings. This will decrease the memory required for each embedding, thus reducing the overall memory footprint.
Use Swap Space: Configure your system to use swap space, which can help handle memory spikes by temporarily using disk space as additional memory. However, this may slow down processing due to the slower read/write speeds of disk storage compared to RAM.
Monitor and Log Memory Usage: Continue using tools like objgraph to monitor memory usage and identify potential memory leaks or inefficiencies in the code.

Implementing these strategies can help mitigate the memory issues you're facing during the document indexing process.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Jan 21 '25 02:01 dosubot[bot]

regarding this:

Batch Processing: Consider processing data in smaller batches to reduce peak memory usage.

The cause for using Batch processing in the first place is to throw a (huge) bunch of files on something and let the application do the rest. So, why should a human resize the batches? This needs manual (recurring) intervention. Why can´t we throw 500 files into the batch and dify is handling the working on the individual files, as dify surely knows better which resources are currently available

Apr 15 '25 10:04 mamema