langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add multi file patterns globbing for DirectoryLoader()

Open marcusyatim opened this issue 1 year ago • 2 comments

Add multi file patterns globbing for DirectoryLoader()

This PR replaces the old glob arg with a new arg file_pattern: Optional[set] = None that specifies the file pattern(s) you want to glob. E.g. {".pdf"} or {".pdf", ".docx"}, etc.

Or, if you want to load all files in the directory, can simply leave out the arg.

The globbing is done with Path.glob(), or Path.rglob(), as per before. The added algorithm allows for globbing to be done once, and not glob as many times as number of patterns. Resulting in fast performance.

@hwchase17 @eyurtsev

marcusyatim avatar May 17 '23 09:05 marcusyatim

Linking to original issue

marcusyatim avatar May 17 '23 09:05 marcusyatim

@marcusyatim thank for helping out with the feature request! I outlined a few places where changes are required before we can merge in.

eyurtsev avatar May 17 '23 15:05 eyurtsev

stale

baskaryan avatar Aug 11 '23 21:08 baskaryan