langchain docs: document_loaders classification

Problem statement: the document_loaders section is too long and hard to comprehend. Proposal: group document_loaders by 3 classes: (see Files changed tab)

UPDATE: I've completely reworked the document_loader classification. Now this PR changes only one file!

FYI @eyurtsev @hwchase17

May 03 '23 21:05 leo-gan

The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way. "Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control

hmm i think formatter and i think csv or word... but not like google drive. like google drive could have csv files in it

i would be down to split out the ones which related to a certain file type. eg csv/pdf/ppt/etc. and then other ones could load in from various locations (eg from drive or website etc) and use formatters under the hood

this may be related to some of the stuff @eyurtsev is working on?

May 04 '23 05:05 hwchase17

How about splitting it into 3 classes?

formatters: CSV, PDF, ...

controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...

external sources: Guttenberg, iFixit, ... I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

May 05 '23 04:05 hwchase17

Hello @leo-gan :wave:

Thanks for helping with the docs!

I am slowly making changes to implement the plan that's outlined here: https://github.com/hwchase17/langchain/pull/2833#issuecomment-1509097756

The high level is to decouple the code that loads raw data (bytes) from the code that parses the raw data to generate documents.

It'll still be possible to define arbitrary document loaders, but it'll also become easier to re-use existing parsers in a document loader (or even existing blob loaders). Not sure that this would change the documentation much.

May 05 '23 16:05 eyurtsev

How about splitting it into 3 classes?

formatters: CSV, PDF, ...

controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...

external sources: Guttenberg, iFixit, ... I still don't like the class names. That means the mental picture is not good

what is definition of those categories? eg why is microsoft word (.docx) not a format?

@hwchase17 I've completely reworked the document_loader classification. Please, check it out. One good side effect: Now this PR changes only one file.

May 08 '23 15:05 leo-gan

@hwchase17 any comments? If you are busy, maybe @dev2049 can help? TNX

May 12 '23 17:05 leo-gan

langchain langchain copied to clipboard

docs: document_loaders classification

langchain
langchain copied to clipboard