thepipe icon indicating copy to clipboard operation
thepipe copied to clipboard

Switched video backend to use yt-dlp

Open skyler14 opened this issue 1 year ago • 0 comments

This builds on the globbing branch PR I submitted earlier at:

https://github.com/emcf/thepipe/pull/26

it replaces pytube and broadly expands the amount of websites supported for automatically scraping videos.

I also attach some basic video metadata to the scraping, definable via the YoutubeEnum

It additionally extends on the text_only flag api in a backwards compatible way. While that can be used like a standard true false flag you can also specify whether to check for existing captions on platforms (and if a platform distinguishes between uploaded transcripts or ai generated use that information). Whisper is the failsafe in all cases, and the default behavior for when text_only mode is to search for english uploaded, any lang uploaded, english ai-generated, any language ai-generated, then whisper transcribe. doing --text_only transcribe disables all of the subtitle downloading and forces whisper only

finally, scrape tweet checks for a video and tries to download the video if found as well, its only been minimally tested though

while not tested, part of this stack should support entire playlist parsing I believe

future plans in branches that will build on this successful PR:

  • examine optimize and improve playlist parsing
  • implement scrape_youtube to be able to be natively invoked for websites with text and embedded videos
  • compare YouTubeMetadata to the info available native to yt-dlp api to see what other data can/should be supported for attaching to the chunks
  • implement a more flexible keyframe mechanism for downloaded videos

skyler14 avatar Sep 06 '24 00:09 skyler14