langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add Spark DataFrame as a Document Loader

Open rithwik-db opened this issue 2 years ago • 1 comments

Add Spark DataFrame as a Document Loader

This is currently a work in progress PR on adding Spark DataFrames as a Document Loader (tests haven't been added yet). Langchain already has a Pandas DF loader and so extended support for Spark seemed to be the next step. The core issue is that Spark DataFrames are usually not just stored on one worker, so instead of doing a major code change to allow for yield functionality with Document Loaders, I simply just checked how much memory is available and set the max size for this DocumentLoader list as a certain fraction of it. It is currently set to 1/2 but it should be set to something like 1/10 or 1/20 for regular usage.

rithwik-db avatar May 26 '23 17:05 rithwik-db

Thanks for the contribution! Here is a reference for how to add tests with optional dependencies:

https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md#working-with-optional-dependencies

eyurtsev avatar May 26 '23 17:05 eyurtsev