unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/Guard against excessive memory usage when partitioning PDFs

Open flash1293 opened this issue 2 years ago • 3 comments
trafficstars

Is your feature request related to a problem? Please describe. Using the OCR strategy when partitioning PDFs, processing of some PDF files will allocate a large amount of memory that isn't available in all environments (e.g. when running via Google cloud run with limited resources).

For example, the following 23MB PDF causes memory usage of >10GB when partitioning: https://drive.google.com/file/d/1lr-Pwh3QTVfdY4F6R-fk4tVU9FNSK27p/view?usp=sharing

Describe the solution you'd like

Unstructured should employ sensitive defaults to avoid this kind of situations (e.g. a max size of a page when rendered in memory). This could also be configurable as optional argument on the partitioning method.

In cases where this isn't feasible, the partitioning method should raise a descriptive exception so the caller can handle the situation gracefully instead of crashing the process.

The most important aspect is giving a way to limit the amount of memory unstructured will use during partitioning.

Describe alternatives you've considered

Alternatively, the partitioning can be run in a separate memory-limited process which is controlled by another process. In case the partitioning process runs out of memory, the orchestration process can handle the situation.

flash1293 avatar Nov 20 '23 17:11 flash1293