langchain
langchain copied to clipboard
Add Spark DataFrame as a Document Loader
Add Spark DataFrame as a Document Loader
This is currently a work in progress PR on adding Spark DataFrames as a Document Loader (tests haven't been added yet). Langchain already has a Pandas DF loader and so extended support for Spark seemed to be the next step. The core issue is that Spark DataFrames are usually not just stored on one worker, so instead of doing a major code change to allow for yield functionality with Document Loaders, I simply just checked how much memory is available and set the max size for this DocumentLoader list as a certain fraction of it. It is currently set to 1/2 but it should be set to something like 1/10 or 1/20 for regular usage.
Thanks for the contribution! Here is a reference for how to add tests with optional dependencies:
https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md#working-with-optional-dependencies