private-gpt icon indicating copy to clipboard operation
private-gpt copied to clipboard

Optimize load_documents function with multiprocessing

Open jiangzhuo opened this issue 1 year ago • 2 comments

This pull request introduces multiprocessing in the load_documents() function to improve performance by utilizing multiple cores for document loading.

Key Changes

  • The load_documents() function now uses Python's multiprocessing module, allowing for concurrent document loading across multiple CPU cores.
  • This modification takes advantage of machines with multiple cores and/or hyper-threading capabilities, leading to significant potential improvements in document loading times.
  • It enhances performance, especially when dealing with a large number of documents.

This update is expected to significantly speed up the document loading process, contributing to overall system efficiency and user experience. Please review and provide your feedback.

^^^This commit message is generated by ChatGPT^^^

jiangzhuo avatar May 18 '23 17:05 jiangzhuo

seems duplicated with https://github.com/imartinez/privateGPT/pull/255 But I prefer to use the os.cpu_count() and all cores.

jiangzhuo avatar May 18 '23 18:05 jiangzhuo

Enhanced the load_documents() function by adding a progress bar using the tqdm library.

This change improves user experience by providing real-time feedback on the progress of document loading. Now, users can easily track the progress of this operation, especially when loading a large number of documents.

Also solve this https://github.com/imartinez/privateGPT/issues/257

jiangzhuo avatar May 18 '23 18:05 jiangzhuo

FYI, this breaks ingesting when importing Markdown for instance.

[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data]   Unzipping tokenizers/punkt.zip.
Loading new documents:   0%|                             | 0/33 [00:04<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
    nltk.find(f"{package_category}/{package_name}")
  File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt

  Searched in:
    - '/home/privategpt/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

That first error can be fixed with

    # Avoid conflicts during parallel download
    try:
        nltk.data.find("tokenizers/punkt")
    except LookupError:
        try:
            nltk.download("punkt")
        except FileExistsError as error:
            logger.debug(f"NLTK punkt tokenizer already downloaded? Error message: {error}")
            pass

but then other errors occur:

[nltk_data]     /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/privategpt/nltk_data...
...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
...
  File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 755, in load
    resource_val = pickle.load(opened_resource)
_pickle.UnpicklingError: invalid load key, '\x00'.

mdeweerd avatar May 21 '23 22:05 mdeweerd

Same issue here with .odt files as well. :(

aceilort avatar Jun 04 '23 22:06 aceilort