edx-platform icon indicating copy to clipboard operation
edx-platform copied to clipboard

fix importing transcript from YouTube TNL-9460

Open iamCristYe opened this issue 3 years ago • 26 comments

Description

This PR fixes TNL-9460. In studio, teachers can now import subtitles from YouTube.

https://user-images.githubusercontent.com/12642018/175310264-d2d0de87-4e1e-4283-b030-a0d53ab12a9d.mp4

Supporting information

TNL-9460

Testing instructions

Insert a video in Studio, and try importing subtitles from YouTube.

Deadline

None

iamCristYe avatar Jun 23 '22 13:06 iamCristYe

Thanks for the pull request, @iamCristYe! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

openedx-webhooks avatar Jun 23 '22 13:06 openedx-webhooks

@iamCristYe Thanks so much for opening this. @connorhaugh FYI this PR

natabene avatar Jul 20 '22 19:07 natabene

@connorhaugh Any updates?

iamCristYe avatar Aug 09 '22 05:08 iamCristYe

It looks like there's an issue with the CLA :thinking: @natabene

mariajgrimaldi avatar Sep 13 '22 19:09 mariajgrimaldi

@mariajgrimaldi CLA is actually fine, it is green now when I look at it.

natabene avatar Sep 14 '22 13:09 natabene

Oh, maybe there was some error on my part @natabene

mariajgrimaldi avatar Sep 14 '22 13:09 mariajgrimaldi

Any updates?

iamCristYe avatar Oct 06 '22 07:10 iamCristYe

I can review this one. But first, can you update your branch and fix the tests? Thanks

mariajgrimaldi avatar Oct 10 '22 18:10 mariajgrimaldi

Oops, try fetching the latest commits accidentally closed the PR

iamCristYe avatar Oct 11 '22 03:10 iamCristYe

@mariajgrimaldi Could you please initiate the workflows?

iamCristYe avatar Oct 11 '22 03:10 iamCristYe

Done @iamCristYe! But they're still failing. Let me know if I have to approve the execution again

mariajgrimaldi avatar Oct 11 '22 15:10 mariajgrimaldi

Also, can you link TNL-9460?

mariajgrimaldi avatar Oct 11 '22 15:10 mariajgrimaldi

@mariajgrimaldi Looks like you need to approve the workflows every time I update something☹. Also, I don't understand the tests about "good_id_2", as I don't see a YouTube video with that ID. I'm not clear how I can link it with TNL-9460 as well.

iamCristYe avatar Oct 12 '22 02:10 iamCristYe

@iamCristYe Here you go.

natabene avatar Oct 12 '22 02:10 natabene

@natabene Thank you. But the tests are going to fail anyway because of that "good_id_2" YouTube video. I really don't know what's that about.

iamCristYe avatar Oct 12 '22 02:10 iamCristYe

Do you have the TNL-9460 issue at hand? @iamCristYe, so this PR context is public? Unfortunately, I haven't been able to find it. I'll test this locally and then take a look into the tests.

mariajgrimaldi avatar Oct 12 '22 12:10 mariajgrimaldi

@mariajgrimaldi thanks for your comments. For more contexts, you may refer to https://github.com/openedx/edx-platform/pull/28509 and https://openedx.atlassian.net/browse/TNL-9460

Basically the old API is no longer working, so I introduced a fix using the new approach. Let me know if you have further questions. Thanks

iamCristYe avatar Oct 13 '22 04:10 iamCristYe

@iamCristYe: hello! Do these recommendations make sense to you? Let me know.

mariajgrimaldi avatar Oct 19 '22 13:10 mariajgrimaldi

@mariajgrimaldi Yes I'm working on it

iamCristYe avatar Oct 19 '22 13:10 iamCristYe

@mariajgrimaldi Sorry for the delay. Let me see if I can introduce tests after other parts are accepted.

iamCristYe avatar Nov 01 '22 08:11 iamCristYe

@iamCristYe I work in the same group as Connor Haugh and have started looking at this PR, especially with regards to unit tests. I'm well along with a unit test for the new functionality in get_transcript_link_from_youtube(), and hope to be able to share it by mid-day tomorrow. Have you started on this already? If so, we should compare notes to avoid duplication of effort.

That method is responsible for scraping the HTML returned for the youtube video page to gain subtitle information for a youtube video. The URL used to be available from an unsupported timedtext API call, but this capability was lost last December. get_transcript_link_from_youtube() works around this loss by scraping the URL from the HTML, for the required timedtext magic incantation. While the API call remains largely the same, it now calls for query parameters that can't be independently generated and must be scraped.

The unit test I'm working on mocks the GET operation, returning canned HTML, and verifies that ``get_transcript_link_from_youtube()` and its underlying regex machinery & \0026 (ampersand) decoding correctly extract the desired URL.

bszabo avatar Nov 01 '22 18:11 bszabo

@bszabo thanks for sharing the updates. I haven't started working on the unit tests. So maybe we can integrate your unit tests into my code as a single PR?

iamCristYe avatar Nov 02 '22 01:11 iamCristYe

@iamCristYe I tried pushing my unit tests onto a new branch on this PR's repo, but don't appear to have appropriate permissions. Can you help with this?

Do you have access to https://github.com/openedx/edx-platform/pull/31235? This is a feature branch I created off of the edx-platform repo. The new unit tests file is contained in that PR and is named xmodule/tests/test_transcripts_utils.py

bszabo avatar Nov 03 '22 17:11 bszabo

@bszabo I've sent an invitation to you so that you can have write access to https://github.com/iamCristYe/edx-platform/tree/fix-youtube

If that doesn't work, please let me know and maybe I can manually merge your edits?

iamCristYe avatar Nov 04 '22 04:11 iamCristYe

I just created PR https://github.com/iamCristYe/edx-platform/pull/1 with my proposed unit tests. @iamCristYe is specified as a reviewer, but I was unable to add @mariajgrimaldi as one. The PR is for a peer branch in this repo, with a request to merge onto the branch being reviewed here (fix-youtube).

bszabo avatar Nov 04 '22 14:11 bszabo

Thank you @bszabo Let me review the details of the tests

iamCristYe avatar Nov 04 '22 14:11 iamCristYe

@iamCristYe @mariajgrimaldi Where do we stand with regards to this PR? I've run a branch with these changes in it against devstack and confirmed that for a small set of youtube videos the Studio successfully obtains English-language transcript files from youtube videos. (And that Studio could not do so without the fix). Given that the underlying mechanism for these changes is regex-based HTML scraping, the possibility exists that minor changes will be required once the feature is rolled out and tested with a broader set of videos. This just suggests that the earlier the changes are rolled out, the sooner we'll arrive at a robust solution.

This is a gentle nudge to wrap up the PR phase of this work so we can enter the production feedback stage. Thanks!

bszabo avatar Nov 07 '22 19:11 bszabo

Hello again @iamCristYe @bszabo. Sorry for the delay, I was away for a couple of days. I left just a few more comments. Also, can we add the tests to this PR or is there something left to do?

mariajgrimaldi avatar Nov 10 '22 14:11 mariajgrimaldi

Hi @mariajgrimaldi. I, too, am trying to understand what work remains. From your perspective, other than having the new comments you posted addressed, and having the unit tests merged, is there any other outstanding work blocking a merge?

bszabo avatar Nov 10 '22 16:11 bszabo

@bszabo: I don't think so!

mariajgrimaldi avatar Nov 10 '22 16:11 mariajgrimaldi