private-gpt
private-gpt copied to clipboard
Optimize load_documents function with multiprocessing
This pull request introduces multiprocessing in the load_documents()
function to improve performance by utilizing multiple cores for document loading.
Key Changes
- The
load_documents()
function now uses Python's multiprocessing module, allowing for concurrent document loading across multiple CPU cores. - This modification takes advantage of machines with multiple cores and/or hyper-threading capabilities, leading to significant potential improvements in document loading times.
- It enhances performance, especially when dealing with a large number of documents.
This update is expected to significantly speed up the document loading process, contributing to overall system efficiency and user experience. Please review and provide your feedback.
^^^This commit message is generated by ChatGPT^^^
seems duplicated with https://github.com/imartinez/privateGPT/pull/255 But I prefer to use the os.cpu_count() and all cores.
Enhanced the load_documents() function by adding a progress bar using the tqdm library.
This change improves user experience by providing real-time feedback on the progress of document loading. Now, users can easily track the progress of this operation, especially when loading a large number of documents.
Also solve this https://github.com/imartinez/privateGPT/issues/257
FYI, this breaks ingesting when importing Markdown for instance.
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Downloading package punkt to /home/privategpt/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Unzipping tokenizers/punkt.zip.
Loading new documents: 0%| | 0/33 [00:04<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 21, in _download_nltk_package_if_not_present
nltk.find(f"{package_category}/{package_name}")
File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt
Searched in:
- '/home/privategpt/nltk_data'
- '/usr/local/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
That first error can be fixed with
# Avoid conflicts during parallel download
try:
nltk.data.find("tokenizers/punkt")
except LookupError:
try:
nltk.download("punkt")
except FileExistsError as error:
logger.debug(f"NLTK punkt tokenizer already downloaded? Error message: {error}")
pass
but then other errors occur:
[nltk_data] /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/privategpt/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /home/privategpt/nltk_data...
...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
...
File "/usr/local/lib/python3.10/site-packages/nltk/data.py", line 755, in load
resource_val = pickle.load(opened_resource)
_pickle.UnpicklingError: invalid load key, '\x00'.
Same issue here with .odt files as well. :(