langchain
langchain copied to clipboard
DirectoryLoader doesn't support including unix file patterns
Problem
The current DirectoryLoader
class relies on the python glob
and rglob
utilities to load the filepaths. These utilities in python don't support advanced file patterns, for example specifying files with multiple extensions. For example, consider a sample directory with these files.
- a.py
- b.js
- c.json
- d.yml
Currently, there is no way to load only the files with .py
or .yml
extension.
Proposed Solution
Preferred
Include the wcmatch library as a dependency that replaces the built-in glob and rglob, and supports all unix supported options for specifying file patterns. For example, with wcmatch
, users can include a pattern like ['*.py', *'.yml']
to include files with .py
or .yml
extension.
Alternate
Add an include
or exclude
list to the DirectoryLoader
interface, so that users can specify the file patterns to include or exclude.
Hi @3coins ! I was actually facing the same issue and am glad I'm not the only one to wish for multi patterns globbing :)
I took the liberty of creating a PR to address this issue. I love the solutions you have proposed and in my PR I actually went with the alternate solution. While I agree wcmatch is pretty neat, I hesitate at introducing new dependencies into the project, when we could handle it ourselves? (To be fair though, wcmatch seems to be a single dependency library and we wouldn't expect it to break much). Maybe @eyurtsev can advise?
Let me know what you think and happy to receive any feedback and/or edits!
Just some food for thought I had while thinking about this issue... I'd imagine that one might usually incorporate LangChain in a larger application, and in terms of system design, one might actually prefer to handle the globbing or filtering of file types at the upstream application level? Basically abstracting it away from LangChain.
What do you think?
Still, no harm having this issue as a nice to have feature :)
Hi, @3coins! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about the DirectoryLoader
class not supporting advanced file patterns. There have been some proposed solutions, including including the wcmatch
library as a dependency or adding an include
or exclude
list to the DirectoryLoader
interface. User marcusyatim
has created a pull request to address the issue and suggests handling the file filtering at the upstream application level.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution and understanding!