markitdown
markitdown copied to clipboard
fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue
Hi! had problem with youtube part, especially when I wanted to paste url on terminal (MacOS: zsh, bash - Linux: bash). Also in the README nothing's mentioned about Youtube, I added it until doc completes.
What’s Changed:
This PR introduces several important updates to improve the reliability and functionality of the YouTube transcript fetching process and URL handling:
-
Retry Logic for YouTube Transcript Fetching:
- I've added a retry mechanism around the YouTube transcript fetching operation. This helps to handle intermittent failures or network issues more gracefully by retrying the operation a few times before failing.
-
Fixed URL Decoding Issue:
- There was an issue where YouTube URLs with escape characters (like
\?and\=) were not being processed correctly, especially when pasted from the terminal. This fix ensures that URLs are properly decoded usingurllib.parse.unquote(), so URLs likehttps://www.youtube.com/watch\?v\=videoIDare handled properly.
- There was an issue where YouTube URLs with escape characters (like
-
Improved Metadata and Description Extraction:
- I’ve also improved the logic for extracting metadata and descriptions from YouTube pages. This makes the extraction process more reliable, particularly when dealing with different YouTube page layouts.
-
Error Handling Improvements:
- Enhanced error handling for the YouTube transcript fetching process, so the system can recover better from failures or missing data.
-
Refactored
_findKeyFunction:- The
_findKeyfunction has been refactored to simplify its code and make it more efficient by usingjson.items()for dictionary iteration instead of a more complex recursive method.
- The
Why This Change is Needed:
- Reliability: The retry mechanism will improve the reliability of fetching transcripts, which can fail due to network issues or API rate limiting.
- Correct URL Processing: With the URL decoding fix, users can now paste URLs directly from the terminal without worrying about escape sequences, ensuring URLs are parsed correctly.
- Better Metadata Handling: The improvements to metadata and description extraction will ensure that we get more accurate data from YouTube pages.
- Resiliency: The improved error handling will help the application deal with temporary issues without failing entirely, making the process more robust.
@iw4p please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree
Thanks. This looks good. There appear to be a test error unrelated to this PR, which I will fix, then re-run these tests and merge.
Hi! Thank you.