langchain
langchain copied to clipboard
Add SharePoint Loader
- Added a loader (
SharePointLoader
) that can pull documents (pdf
,docx
,doc
) from the SharePoint Document Library. - Added a Base Loader (
O365BaseLoader
) to be used for all Loaders that use O365 Package - Code refactoring on
OneDriveLoader
to use the newO365BaseLoader
.
@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.
General strategy is here: https://github.com/hwchase17/langchain/pull/2833#issuecomment-1509097756
TLDR; If you're able to implement a BlobLoader
(interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.
class BlobLoader(ABC):
"""Abstract interface for blob loaders implementation.
Implementer should be able to load raw content from a storage system according
to some criteria and return the raw content lazily as a stream of blobs.
"""
@abstractmethod
def yield_blobs(
self,
) -> Iterable[Blob]:
"""A lazy loader for raw data represented by LangChain's Blob object.
Returns:
A generator over blobs
"""
Implementation for local file system:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39
@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.
General strategy is here: #2833 (comment)
TLDR; If you're able to implement a
BlobLoader
(interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.class BlobLoader(ABC): """Abstract interface for blob loaders implementation. Implementer should be able to load raw content from a storage system according to some criteria and return the raw content lazily as a stream of blobs. """ @abstractmethod def yield_blobs( self, ) -> Iterable[Blob]: """A lazy loader for raw data represented by LangChain's Blob object. Returns: A generator over blobs """
Implementation for local file system:
https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39
Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.
@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code. General strategy is here: #2833 (comment) TLDR; If you're able to implement a
BlobLoader
(interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.class BlobLoader(ABC): """Abstract interface for blob loaders implementation. Implementer should be able to load raw content from a storage system according to some criteria and return the raw content lazily as a stream of blobs. """ @abstractmethod def yield_blobs( self, ) -> Iterable[Blob]: """A lazy loader for raw data represented by LangChain's Blob object. Returns: A generator over blobs """
Implementation for local file system: https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39
Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.
Hey @eyurtsev and @hwchase17. Finally, I had a chance to work again on this PR. I decoupled the loading and parser process using FileSystemBlobLoader
and BaseBlobParser
. Could you guys review it ?
Are you able to solve the issue? =]
Are you able to solve the issue? =]
Hi @HoiDam. Are you talking about that comment (https://github.com/hwchase17/langchain/pull/4284#pullrequestreview-1417033243) ? This comment is outdated because that file does not even exists anymore.
Hi @hwchase17 and @eyurtsev can this be reviewed and merged soon?
Did anyone have a chance to look at this? Would love to have this merged :-)
@netoferraz great initiative!
Is this loader adding metadata from SPO, itβs possible in the native module O365 .
Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!
Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!
yes! working to revive
Yeah! @baskaryan ! Are you guys willing to accept this PR? Please, just clarify to me what needs to be done, ok? By the way, thank you @guidorietbroek
Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!
It would be great if we can move forward with this PR! I'm not sure if the maintainers have any intentions to accept this work.
The latest updates on your projects. Learn more about Vercel for Git βοΈ
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
langchain | β Ready (Inspect) | Visit Preview | π¬ Add feedback | Aug 19, 2023 0:55am |
updated, would love one more review @netoferraz and @eyurtsev!
@netoferraz :wave: thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )
There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader
that can be replicated here. The way it would look would be to declare something like
O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.
Then one could compose it with GenericLoader
to apply any sort of parser to content that can be fetched from O365
Not a requirement for merging this PR as we can re-use the existing code at a later point. :)
@netoferraz π thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )
There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called
FileSystemBlobLoader
that can be replicated here. The way it would look would be to declare something likeO365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.
Then one could compose it with
GenericLoader
to apply any sort of parser to content that can be fetched from O365Not a requirement for merging this PR as we can re-use the existing code at a later point. :)
Thank you, @eyurtsev ! @baskaryan If you understand that we need to do some additional work based on the @eyurtsev review, let me know, ok? Otherwise, seems we could move to approve this work.