langchain
langchain copied to clipboard
Create BaseBlobSplitter and BaseBlobTransformer
This PR creates a BaseBlobSplitter and a BaseBlobTransformer, which can be used to split and transform blobs. It creates an instance of the BaseBlobTransformer called YoutubeToAudioTransformer, which can be used to download a YouTube URL to audio. In addition, it creates an instance of BaseBlobSplitter called AudioSplitter, which can be used to split an audio file into smaller blobs. Both are useful for an end-to-end workflow that combines (1) YouTube link to Audio, (2) Audio to Audio splits, and (3) then use OpenAIWhisperParser here to create Documents from the splits.
@rlancemartin looking great, left suggestions to rename a few things.
I want to propose something radical -- let's take out all file I/O for the first iteration, so there's no file artifacts generated during the entire process and instead everything is streaming in memory.
Did you add file IO to avoid re-downloading / re-processing content or for another reason?
@rlancemartin left a few comments for minor changes, we're looking pretty good -- let me know if makes sense -- should be able to implement and then re-lint and we can merge
We merged a refactor here: 1/ audio loading in memory will require something other than yt_dlp. 2/ exposing the splitter is probably not required b/c there are no obvious use-cases for it; it's more of an internal transformation for the parser due to the OpenAI Whisper API side limit.