langchain
langchain copied to clipboard
docs: document_loaders classification
Problem statement: the document_loaders section is too long and hard to comprehend.
Proposal: group document_loaders by 3 classes: (see Files changed tab)
UPDATE: I've completely reworked the document_loader classification. Now this PR changes only one file!
FYI @eyurtsev @hwchase17
The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way. "Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control
hmm i think formatter and i think csv or word... but not like google drive. like google drive could have csv files in it
i would be down to split out the ones which related to a certain file type. eg csv/pdf/ppt/etc. and then other ones could load in from various locations (eg from drive or website etc) and use formatters under the hood
this may be related to some of the stuff @eyurtsev is working on?
How about splitting it into 3 classes?
formatters: CSV, PDF, ...controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...external sources: Guttenberg, iFixit, ... I still don't like the class names. That means themental pictureis not good
what is definition of those categories? eg why is microsoft word (.docx) not a format?
Hello @leo-gan :wave:
Thanks for helping with the docs!
I am slowly making changes to implement the plan that's outlined here: https://github.com/hwchase17/langchain/pull/2833#issuecomment-1509097756
The high level is to decouple the code that loads raw data (bytes) from the code that parses the raw data to generate documents.
It'll still be possible to define arbitrary document loaders, but it'll also become easier to re-use existing parsers in a document loader (or even existing blob loaders). Not sure that this would change the documentation much.
How about splitting it into 3 classes?
formatters: CSV, PDF, ...controllable sources: Google Drive, Microsoft Word, Facebook Chat, ...external sources: Guttenberg, iFixit, ... I still don't like the class names. That means themental pictureis not goodwhat is definition of those categories? eg why is microsoft word (.docx) not a format?
@hwchase17 I've completely reworked the document_loader classification. Please, check it out. One good side effect: Now this PR changes only one file.
@hwchase17 any comments? If you are busy, maybe @dev2049 can help? TNX