mem0 icon indicating copy to clipboard operation
mem0 copied to clipboard

Added language detection for non-english youtube videos

Open anantoj opened this issue 1 year ago • 2 comments

Description

Invoking app.add() on non-english youtube videos would cause a NoTranscriptFound error. There is already a pull request that fixes the issue by specify an additional argument in the config file, but it has not been merged until today. Ideally app.add() also automatically detects the language in the youtube transcript, instead having to manually specify the language in the beginning.

Fixes #385

Type of change

Please delete options that are not relevant.

  • [x] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

I simply invoked app.add() on various non-english youtube videos, and the NoTranscriptFound error no longer appears.

Please delete options that are not relevant.

  • [x] Unit Test

Checklist:

  • [x] My code follows the style guidelines of this project
  • [x] I have performed a self-review of my own code
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [x] My changes generate no new warnings
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x] New and existing unit tests pass locally with my changes
  • [x] Any dependent changes have been merged and published in downstream modules
  • [x] I have checked my code and corrected any misspellings

Maintainer Checklist

  • [ ] closes #xxxx (Replace xxxx with the GitHub issue number)
  • [ ] Made sure Checks passed

anantoj avatar May 09 '24 06:05 anantoj

@anantoj thanks for providing that fix! I ran & reviewed your code locally ... it looks good to me. However in my case, the test_youtube_video/test_load_data fails. Is this the same for you too?

LeonieFreisinger avatar May 13 '24 17:05 LeonieFreisinger

@LeonieFreisinger Thanks for pointing out the issue. I was not aware that the test suite uses a dummy VIDEO_ID, which breaks the video_id extraction. There's also another issue caused a recent commit 78301ee that modifies the metadata. I fixed all the issues in my recent commit and test_youtube_video/test_load_data passed on my end. Let me know if it's all good for you

anantoj avatar May 14 '24 03:05 anantoj

@anantoj thanks for your fast answer! I tested your code with this url: https://www.youtube.com/watch?v=RG7Edrbhdpk. I get an error "Could not retrieve a transcript for the video". Can you have a look into it? Besides that, your code works fine for me. Thanks!

LeonieFreisinger avatar May 17 '24 07:05 LeonieFreisinger

@LeonieFreisinger it seems that your video just doesn't have a transcript, which causes the issue.

It's also a warning raised by youtube_transcript_api instead of an error. The error, at least on my end, is a ValueError on embedchain's Youtube video loader here which is raised when the transcript is not found.

anantoj avatar May 17 '24 07:05 anantoj

@anantoj You are right! So it should be fine. @deshraj @taranjeet The changes in the PR look good to me. Can you please re-trigger the workflows and merge the PR?

LeonieFreisinger avatar May 17 '24 07:05 LeonieFreisinger

@anantoj Can you please resolve the merge conflicts so that we can merge the PR?

Dev-Khant avatar Jun 04 '24 06:06 Dev-Khant

Hi @Dev-Khant, I've resolved the conflicts

anantoj avatar Jun 04 '24 08:06 anantoj

@anantoj can you please fix the lint issues in the PR? You can run make lint format to format the code. Once that is fixed, we are ready to merge the PR.

deshraj avatar Jun 05 '24 17:06 deshraj

Hi @deshraj, I fixed the lint issues.

anantoj avatar Jun 06 '24 06:06 anantoj