Added language detection for non-english youtube videos
Description
Invoking app.add() on non-english youtube videos would cause a NoTranscriptFound error. There is already a pull request that fixes the issue by specify an additional argument in the config file, but it has not been merged until today. Ideally app.add() also automatically detects the language in the youtube transcript, instead having to manually specify the language in the beginning.
Fixes #385
Type of change
Please delete options that are not relevant.
- [x] New feature (non-breaking change which adds functionality)
How Has This Been Tested?
I simply invoked app.add() on various non-english youtube videos, and the NoTranscriptFound error no longer appears.
Please delete options that are not relevant.
- [x] Unit Test
Checklist:
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] Any dependent changes have been merged and published in downstream modules
- [x] I have checked my code and corrected any misspellings
Maintainer Checklist
- [ ] closes #xxxx (Replace xxxx with the GitHub issue number)
- [ ] Made sure Checks passed
@anantoj thanks for providing that fix! I ran & reviewed your code locally ... it looks good to me.
However in my case, the test_youtube_video/test_load_data fails. Is this the same for you too?
@LeonieFreisinger Thanks for pointing out the issue. I was not aware that the test suite uses a dummy VIDEO_ID, which breaks the video_id extraction. There's also another issue caused a recent commit 78301ee that modifies the metadata. I fixed all the issues in my recent commit and test_youtube_video/test_load_data passed on my end. Let me know if it's all good for you
@anantoj thanks for your fast answer!
I tested your code with this url: https://www.youtube.com/watch?v=RG7Edrbhdpk. I get an error "Could not retrieve a transcript for the video". Can you have a look into it?
Besides that, your code works fine for me. Thanks!
@LeonieFreisinger it seems that your video just doesn't have a transcript, which causes the issue.
It's also a warning raised by youtube_transcript_api instead of an error. The error, at least on my end, is a ValueError on embedchain's Youtube video loader here which is raised when the transcript is not found.
@anantoj You are right! So it should be fine. @deshraj @taranjeet The changes in the PR look good to me. Can you please re-trigger the workflows and merge the PR?
@anantoj Can you please resolve the merge conflicts so that we can merge the PR?
Hi @Dev-Khant, I've resolved the conflicts
@anantoj can you please fix the lint issues in the PR? You can run make lint format to format the code. Once that is fixed, we are ready to merge the PR.
Hi @deshraj, I fixed the lint issues.