langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add SharePoint Loader

Open netoferraz opened this issue 1 year ago β€’ 2 comments

  • Added a loader (SharePointLoader) that can pull documents (pdf, docx, doc) from the SharePoint Document Library.
  • Added a Base Loader (O365BaseLoader) to be used for all Loaders that use O365 Package
  • Code refactoring on OneDriveLoader to use the new O365BaseLoader.

netoferraz avatar May 07 '23 12:05 netoferraz

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here: https://github.com/hwchase17/langchain/pull/2833#issuecomment-1509097756

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

eyurtsev avatar May 09 '23 02:05 eyurtsev

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code.

General strategy is here: #2833 (comment)

TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system:

https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

netoferraz avatar May 10 '23 12:05 netoferraz

@netoferraz Thank you for the contribution! As a heads up I'm in the process of adding a few abstractions to the document flow to decouple loading from parsing code. General strategy is here: #2833 (comment) TLDR; If you're able to implement a BlobLoader (interface and file system implementation shown below), it'll make it easier for users to add arbitrary parsers on top the loading interface.

class BlobLoader(ABC):
    """Abstract interface for blob loaders implementation.

    Implementer should be able to load raw content from a storage system according
    to some criteria and return the raw content lazily as a stream of blobs.
    """

    @abstractmethod
    def yield_blobs(
        self,
    ) -> Iterable[Blob]:
        """A lazy loader for raw data represented by LangChain's Blob object.

        Returns:
            A generator over blobs
        """

Implementation for local file system: https://github.com/hwchase17/langchain/blob/master/langchain/document_loaders/blob_loaders/file_system.py#L39

Hi @eyurtsev ! I'll try to bring those concepts to this loader and implement it.

Hey @eyurtsev and @hwchase17. Finally, I had a chance to work again on this PR. I decoupled the loading and parser process using FileSystemBlobLoader and BaseBlobParser. Could you guys review it ?

netoferraz avatar Jun 05 '23 01:06 netoferraz

Are you able to solve the issue? =]

HoiDam avatar Jun 09 '23 08:06 HoiDam

Are you able to solve the issue? =]

Hi @HoiDam. Are you talking about that comment (https://github.com/hwchase17/langchain/pull/4284#pullrequestreview-1417033243) ? This comment is outdated because that file does not even exists anymore.

netoferraz avatar Jun 09 '23 10:06 netoferraz

Hi @hwchase17 and @eyurtsev can this be reviewed and merged soon?

laveshnk-crypto avatar Jun 13 '23 09:06 laveshnk-crypto

Did anyone have a chance to look at this? Would love to have this merged :-)

willemmulder avatar Jul 03 '23 18:07 willemmulder

@netoferraz great initiative!

Is this loader adding metadata from SPO, it’s possible in the native module O365 .

guidorietbroek avatar Jul 04 '23 04:07 guidorietbroek

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

vicondoa avatar Aug 16 '23 18:08 vicondoa

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

yes! working to revive

baskaryan avatar Aug 18 '23 18:08 baskaryan

Yeah! @baskaryan ! Are you guys willing to accept this PR? Please, just clarify to me what needs to be done, ok? By the way, thank you @guidorietbroek

netoferraz avatar Aug 18 '23 19:08 netoferraz

Any thoughts on resurrecting this so it can be merged? I'd like to send a PR to improve some of the auth handling and didn't want to step on this because I'd also like Sharepoint support. Thanks!

It would be great if we can move forward with this PR! I'm not sure if the maintainers have any intentions to accept this work.

netoferraz avatar Aug 18 '23 19:08 netoferraz

The latest updates on your projects. Learn more about Vercel for Git β†—οΈŽ

Name Status Preview Comments Updated (UTC)
langchain βœ… Ready (Inspect) Visit Preview πŸ’¬ Add feedback Aug 19, 2023 0:55am

vercel[bot] avatar Aug 18 '23 19:08 vercel[bot]

updated, would love one more review @netoferraz and @eyurtsev!

baskaryan avatar Aug 19 '23 00:08 baskaryan

@netoferraz :wave: thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

eyurtsev avatar Aug 19 '23 03:08 eyurtsev

@netoferraz πŸ‘‹ thank you for the contribution! I left a few comments on the PR, overall looks good to me, so okay merging as is. (cc @baskaryan )

There's a BlobLoader abstraction in the codebase that would fit the requirements here pretty well with an implementation for the file system called FileSystemBlobLoader that can be replicated here. The way it would look would be to declare something like

O365BlobLoader, it will take a bunch of attribtues in the init like auth, and filters, and yield blobs.

Then one could compose it with GenericLoader to apply any sort of parser to content that can be fetched from O365

Not a requirement for merging this PR as we can re-use the existing code at a later point. :)

Thank you, @eyurtsev ! @baskaryan If you understand that we need to do some additional work based on the @eyurtsev review, let me know, ok? Otherwise, seems we could move to approve this work.

netoferraz avatar Aug 19 '23 12:08 netoferraz