langchain icon indicating copy to clipboard operation
langchain copied to clipboard

DirectoryLoader doesn't support including unix file patterns

Open 3coins opened this issue 1 year ago • 2 comments

Problem

The current DirectoryLoader class relies on the python glob and rglob utilities to load the filepaths. These utilities in python don't support advanced file patterns, for example specifying files with multiple extensions. For example, consider a sample directory with these files.

- a.py
- b.js
- c.json
- d.yml

Currently, there is no way to load only the files with .py or .yml extension.

Proposed Solution

Preferred

Include the wcmatch library as a dependency that replaces the built-in glob and rglob, and supports all unix supported options for specifying file patterns. For example, with wcmatch, users can include a pattern like ['*.py', *'.yml'] to include files with .py or .yml extension.

Alternate

Add an include or exclude list to the DirectoryLoader interface, so that users can specify the file patterns to include or exclude.

3coins avatar Apr 18 '23 05:04 3coins

Hi @3coins ! I was actually facing the same issue and am glad I'm not the only one to wish for multi patterns globbing :)

I took the liberty of creating a PR to address this issue. I love the solutions you have proposed and in my PR I actually went with the alternate solution. While I agree wcmatch is pretty neat, I hesitate at introducing new dependencies into the project, when we could handle it ourselves? (To be fair though, wcmatch seems to be a single dependency library and we wouldn't expect it to break much). Maybe @eyurtsev can advise?

Let me know what you think and happy to receive any feedback and/or edits!

marcusyatim avatar May 17 '23 09:05 marcusyatim

Just some food for thought I had while thinking about this issue... I'd imagine that one might usually incorporate LangChain in a larger application, and in terms of system design, one might actually prefer to handle the globbing or filtering of file types at the upstream application level? Basically abstracting it away from LangChain.

What do you think?

Still, no harm having this issue as a nice to have feature :)

marcusyatim avatar May 17 '23 10:05 marcusyatim

Hi, @3coins! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about the DirectoryLoader class not supporting advanced file patterns. There have been some proposed solutions, including including the wcmatch library as a dependency or adding an include or exclude list to the DirectoryLoader interface. User marcusyatim has created a pull request to address the issue and suggests handling the file filtering at the upstream application level.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and understanding!

dosubot[bot] avatar Sep 08 '23 16:09 dosubot[bot]