langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Create BaseBlobSplitter and BaseBlobTransformer

Open rlancemartin opened this issue 2 years ago • 1 comments

This PR creates a BaseBlobSplitter and a BaseBlobTransformer, which can be used to split and transform blobs. It creates an instance of the BaseBlobTransformer called YoutubeToAudioTransformer, which can be used to download a YouTube URL to audio. In addition, it creates an instance of BaseBlobSplitter called AudioSplitter, which can be used to split an audio file into smaller blobs. Both are useful for an end-to-end workflow that combines (1) YouTube link to Audio, (2) Audio to Audio splits, and (3) then use OpenAIWhisperParser here to create Documents from the splits.

rlancemartin avatar Jun 02 '23 20:06 rlancemartin

@rlancemartin looking great, left suggestions to rename a few things.

I want to propose something radical -- let's take out all file I/O for the first iteration, so there's no file artifacts generated during the entire process and instead everything is streaming in memory.

Did you add file IO to avoid re-downloading / re-processing content or for another reason?

eyurtsev avatar Jun 03 '23 01:06 eyurtsev

@rlancemartin left a few comments for minor changes, we're looking pretty good -- let me know if makes sense -- should be able to implement and then re-lint and we can merge

eyurtsev avatar Jun 06 '23 14:06 eyurtsev

We merged a refactor here: 1/ audio loading in memory will require something other than yt_dlp. 2/ exposing the splitter is probably not required b/c there are no obvious use-cases for it; it's more of an internal transformation for the parser due to the OpenAI Whisper API side limit.

rlancemartin avatar Jun 07 '23 04:06 rlancemartin