private-gpt
private-gpt copied to clipboard
Cannot ingest markdown files
Describe the bug and how to reproduce it I was trying to ingest markdown files from one of my documentations. Also, I have tried different markdown files and they all end with the same error. I am using the latest commit from the main branch.
This is how my .env looks like:
PERSIST_DIRECTORY=/Users/victor/pyprojects/privateGPT/db
EMBEDDINGS_MODEL_NAME=all-mpnet-base-v2
MODEL_TYPE=GPT4All
MODEL_PATH=/Users/victor/local_llms/ggml-gpt4all-j-v1.3-groovy.bin
MODEL_N_CTX=1000
Expected behavior
- Ingest runs through without issues.
Environment (please complete the following information):
- OS / hardware: MacOSX 13.4 (Intel i9)
- Python version 3.11.3
Additional context
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 0%| | 0/1 [00:05<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/victor/pyprojects/privateGPT/ingest.py", line 89, in load_single_document
return loader.load()[0]
^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 70, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 12, in _get_elements
from unstructured.partition.md import partition_md
File "/Users/victor/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 9, in
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/victor/pyprojects/privateGPT/ingest.py", line 167, in
Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.
Same issue linux mint 21.1, python 3.10.
I think this error exists with UnstructuredHTMLLoader and UnstructuredMarkdownLoader both.
https://github.com/hwchase17/langchain/issues/5264
I created a markdownloader copying TextLoader. It uses marko for converting into html and then BeautifulSoup to extract text. Seems to be working for me.
https://github.com/abhishekbhakat/privateGPT/tree/main
Same issue on Mac M1, also with python 3.11.3. Renamed every .md file to .txt and that works.
You can also just replace: ".md": (UnstructuredMarkdownLoader, {}), with ".md": (TextLoader, {}), inside ingest.py which is effectively the same thing and you aren't renaming your files.
That worked for me
I'm actually getting a different error, but renaming to .txt seems to fix this as well. I copied the current README.md for the project into the source_documents folder to test this. I just cloned the project this morning and I am running python 3.11.3 with an M1 Mac.
Here is the error I'm getting:
Creating new vectorstore
Loading documents from source_documents
Loading new documents: 67%|██████████████▋ | 2/3 [00:02<00:01, 1.40s/it]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 89, in load_single_document
return loader.load()[0]
^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/langchain/document_loaders/markdown.py", line 25, in _get_elements
return partition_md(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/md.py", line 52, in partition_md
return partition_html(
^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/html.py", line 91, in partition_html
layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/common.py", line 73, in document_to_element_list
num_pages = len(document.pages)
^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
self._pages = self._read()
^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 116, in _read
element = _parse_tag(tag_elem)
^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 222, in _parse_tag
return _text_to_element(text, tag_elem.tag, ancestortags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 237, in _text_to_element
elif is_narrative_tag(text, tag):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/documents/html.py", line 265, in is_narrative_tag
return tag not in HEADING_TAGS and is_possible_narrative_text(text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 86, in is_possible_narrative_text
if (sentence_count(text, 3) < 2) and (not contains_verb(text)) and language == "en":
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 189, in contains_verb
pos_tags = pos_tag(text)
^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 57, in pos_tag
parts_of_speech.extend(_pos_tag(tokens))
^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 165, in pos_tag
tagger = _get_tagger(lang)
^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/__init__.py", line 107, in _get_tagger
tagger = PerceptronTagger()
^^^^^^^^^^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 169, in __init__
self.load(AP_MODEL_LOC)
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/tag/perceptron.py", line 252, in load
self.model.weights, self.tagdict, self.classes = load(loc)
^^^^^^^^^
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/site-packages/nltk/data.py", line 755, in load
resource_val = pickle.load(opened_resource)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: pickle data was truncated
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 167, in <module>
main()
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 157, in main
texts = process_documents()
^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 119, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/josh/Projects/Coding/privateGPT/ingest.py", line 108, in load_documents
for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/Users/josh/.pyenv/versions/3.11.3/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
_pickle.UnpicklingError: pickle data was truncated
I'm getting the same error as @joshrouwhorst when I try to ingest .html files. A single PDF ingests fine but adding this additional files doesn't. Renaming the file to .txt also fixes the problem.
I'm getting a similar result with EPUB files.
Hi @joshrouwhorst and @andrewchch , I have found a stackoverflow issue and it solves your problem: https://stackoverflow.com/questions/56049033/what-can-be-the-reasons-of-having-an-unpicklingerror-while-running-pos-tag-fro just
import nltk
nltk.download('averaged_perceptron_tagger')
in the terminal and it works well for me in ubuntu 20.04 when ingesting MS word. not sure if it also works for .zip file since my data is in .docx format. you can try it @vicdotdevelop
Cannot upload epub files to work with.
Error Please install extra dependencies that are required for the EpubReader:
pip install EbookLib html2text
Even if I install the required dependencies mention above, still getting the same error in every attempt.