langchain
langchain copied to clipboard
Cannot load python files for Directory Loader
System Info
I am using version 0.0.171
of Langchain.
Running a mac, M1, 2021, OS Ventura. Can do most all of Langchain operations without errors.
Except for this issue. Installed through pyenv, python 3.11.
aiohttp==3.8.4
aiosignal==1.3.1
anyio==3.6.2
appnope==0.1.3
argilla==1.7.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
backcall==0.2.0
backoff==2.2.1
beautifulsoup4==4.12.2
-e git+ssh://[email protected]/mad-start/big-macs-llm.git@2998ca685b68d74ef20a12fe74c0f4cab6e48dcb#egg=big_macs_llm
bleach==6.0.0
certifi==2023.5.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
comm==0.1.3
commonmark==0.9.1
contourpy==1.0.7
cryptography==40.0.2
cycler==0.11.0
dataclasses-json==0.5.7
datasets==2.12.0
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
Deprecated==1.2.13
dill==0.3.6
einops==0.6.1
et-xmlfile==1.1.0
executing==1.2.0
fastjsonschema==2.16.3
filelock==3.12.0
fonttools==4.39.4
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.5.0
h11==0.14.0
httpcore==0.16.3
httpx==0.23.3
huggingface-hub==0.14.1
idna==3.4
iniconfig==2.0.0
ipykernel==6.23.1
ipython==8.13.2
ipython-genutils==0.2.0
ipywidgets==8.0.6
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.2.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
langchain==0.0.171
lxml==4.9.2
Markdown==3.4.3
MarkupSafe==2.1.2
marshmallow==3.19.0
marshmallow-enum==1.5.1
matplotlib==3.7.1
matplotlib-inline==0.1.6
mistune==2.0.5
monotonic==1.6
mpmath==1.3.0
msg-parser==1.2.0
multidict==6.0.4
multiprocess==0.70.14
mypy-extensions==1.0.0
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.4.0
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.1
nltk==3.8.1
notebook==6.5.4
notebook_shim==0.2.3
numexpr==2.8.4
numpy==1.23.5
olefile==0.46
openai==0.27.6
openapi-schema-pydantic==1.2.4
openpyxl==3.1.2
packaging==23.1
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pdf2image==1.16.3
pdfminer.six==20221105
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.5.0
platformdirs==3.5.1
pluggy==1.0.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==12.0.0
pycparser==2.21
pydantic==1.10.7
Pygments==2.15.1
pypandoc==1.11
pyparsing==3.0.9
pyrsistent==0.19.3
pytest==7.3.1
python-dateutil==2.8.2
python-docx==0.8.11
python-dotenv==1.0.0
python-json-logger==2.0.7
python-magic==0.4.27
python-pptx==0.6.21
pytz==2023.3
PyYAML==6.0
pyzmq==25.0.2
qtconsole==5.4.3
QtPy==2.3.1
regex==2023.5.5
requests==2.30.0
responses==0.18.0
rfc3339-validator==0.1.4
rfc3986==1.5.0
rfc3986-validator==0.1.1
rich==13.0.1
scikit-learn==1.2.2
scipy==1.10.1
Send2Trash==1.8.2
sentence-transformers==2.2.2
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
SQLAlchemy==2.0.13
stack-data==0.6.2
sympy==1.12
tabulate==0.9.0
tenacity==8.2.2
terminado==0.17.1
text-generation==0.5.2
threadpoolctl==3.1.0
tiktoken==0.4.0
tinycss2==1.2.1
tokenizers==0.13.3
torch==2.0.1
torchvision==0.15.2
tornado==6.3.2
tqdm==4.65.0
traitlets==5.9.0
transformers==4.29.2
typer==0.9.0
typing-inspect==0.8.0
typing_extensions==4.5.0
tzdata==2023.3
unstructured==0.6.8
uri-template==1.2.0
urllib3==2.0.2
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.7
wrapt==1.14.1
XlsxWriter==3.1.0
xxhash==3.2.0
yarl==1.9.2
Who can help?
@eyurtsev Thank you:
I got this code from Langchain instructions here. While I am able to load and split a python file one at a time, I cannot do so for DirectoryLoaders that have *.py
in the glob pattern. I tested this out without langchain and it worked just fine.
from langchain.document_loaders.text import TextLoader
from langchain.document_loaders.directory import DirectoryLoader
loader = DirectoryLoader('../../../src', glob="**/*.py", loader_cls=TextLoader)
directory_loader.load()
and
from langchain.document_loaders.directory import DirectoryLoader
from langchain.document_loaders import PythonLoader
loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
directory_loader.load()
yields an error:
ValueError: Invalid file ../../../src/my_library/__init__.py. The FileType.UNK file type is not supported in partition.
I looked up this error on the unstructured issues page and then I ran the following code with unstructured and it didn't error out and displayed the contents of the python module.
from unstructured.partition.text import partition_text
elements = partition_text(filename='setup.py')
print("\n\n".join([str(el) for el in elements]))
Information
- [ ] The official example notebooks/scripts
- [ ] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [X] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
I have written more detail above, but this can be reproduced like this.
from langchain.document_loaders.directory import DirectoryLoader
from langchain.document_loaders import PythonLoader
loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
directory_loader.load()
Expected behavior
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[73], line 6
2 from langchain.document_loaders import PythonLoader
5 loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
----> 6 directory_loader.load()
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/langchain/document_loaders/directory.py:103, in DirectoryLoader.load(self)
101 else:
102 for i in items:
--> 103 self.load_file(i, p, docs, pbar)
105 if pbar:
106 pbar.close()
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/langchain/document_loaders/directory.py:69, in DirectoryLoader.load_file(self, item, path, docs, pbar)
67 logger.warning(e)
68 else:
---> 69 raise e
70 finally:
71 if pbar:
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/langchain/document_loaders/directory.py:63, in DirectoryLoader.load_file(self, item, path, docs, pbar)
61 if _is_visible(item.relative_to(path)) or self.load_hidden:
62 try:
---> 63 sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
64 docs.extend(sub_docs)
65 except Exception as e:
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:70, in UnstructuredBaseLoader.load(self)
68 def load(self) -> List[Document]:
69 """Load file."""
---> 70 elements = self._get_elements()
71 if self.mode == "elements":
72 docs: List[Document] = list()
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py:104, in UnstructuredFileLoader._get_elements(self)
101 def _get_elements(self) -> List:
102 from unstructured.partition.auto import partition
--> 104 return partition(filename=self.file_path, **self.unstructured_kwargs)
File ~/.pyenv/versions/3.11.3/envs/big-macs-llm/lib/python3.11/site-packages/unstructured/partition/auto.py:206, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, ssl_verify, ocr_languages, pdf_infer_table_structure, xml_keep_tags)
204 else:
205 msg = "Invalid file" if not filename else f"Invalid file {filename}"
--> 206 raise ValueError(f"{msg}. The {filetype} file type is not supported in partition.")
208 for element in elements:
209 element.metadata.url = url
ValueError: Invalid file ../../../src/__init__.py. The FileType.UNK file type is not supported in partition.