Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Download transcripts of khan academy

Open huu4ontocord opened this issue 2 years ago • 30 comments

Is it possible someone who has some experience with with youtube or other video scraping to create transcripts of free/opensource lectures from youtube or other video sites preferably where there is a single speaker https://www.youtube.com/@khanacademy/playlists

Khan Academy is a 501(c)(3) nonprofit organization with the mission of providing a free, world-class education for anyone, anywhere. Our interactive practice problems, articles, and videos help students succeed in math, biology, chemistry, physics, history, economics, finance, grammar, and many other topics.

In this kind of format potentially https://karpathy.ai/lexicap/

You could use yt-donwload or https://github.com/m1guelpf/yt-whisper

Please connect with rallio with results

huu4ontocord avatar Dec 30 '22 22:12 huu4ontocord

Sounds interesting I've never scraped video before, but I'm about to look into it now. Will post if I get a solution. Anyone else with a concrete solution feel free to pick this up, though.

Shtoner avatar Dec 30 '22 23:12 Shtoner

Would the CreativeCommons NonCommercial ShareAlike license on their videos be a problem? https://support.khanacademy.org/hc/en-us/articles/202262954-Can-I-use-Khan-Academy-s-videos-name-materials-links-in-my-project-

It seems like the ShareAlike license on the Khan Academy videos might entangle the transcripts and things that incorporate them. https://www.theregister.com/2022/10/19/github_copilot_copyright/

CryptoFewka avatar Dec 31 '22 00:12 CryptoFewka

@ontocord there's a pip package youtube-transcript-api which lets us fetch the transcript of a video along with time stamps.

shreydan avatar Dec 31 '22 07:12 shreydan

I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.

gokaykucuk avatar Dec 31 '22 14:12 gokaykucuk

Working on this for another project right now: Here is some code for this functionality

References:

  • Modified code from: https://www.geeksforgeeks.org/python-downloading-captions-from-youtube/
# importing modules
from youtube_transcript_api import YouTubeTranscriptApi

# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
yt_id = "Sqqt1kU52I8"
srt = YouTubeTranscriptApi.get_transcript(yt_id)

# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
filename_out = f'yt_subtitles_{yt_id}.txt'
with open(filename_out, "w") as f:
    # iterating through each element of list srt
    for line in srt:
        # writing each element of srt on a new line
        print(line['text'])
        f.write("{}\n".format(line))

jon-chun avatar Dec 31 '22 14:12 jon-chun

You'll also need to get all YouTube Video IDs if you want to scrape a particular YT User's Channel

References:

  • https://stackoverflow.com/questions/73827182/youtube-api-get-all-videos-of-channel-with-more-than-500-videos
  • https://developers.google.com/youtube/v3/getting-started
  • https://www.youtube.com/channel/UCvShfJtvC2owV0AFi_qyy
import pandas as pd
import requests
import datetime

api_key = '***'
channel_id = 'UCXDi1F7Q-cJQ4mGGavtfYRQ'
channel_id = 'UCvShfJtvC2owV0AFi_qyykA' # The Helix Center

# build dataframe
df = pd.DataFrame(columns=['channel_id',
                           'video_id',
                           'video_title'
                           'published_date',
                           'type'])


# first request
my_url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&key=' + api_key
response = requests.get(url=my_url).json()
print(my_url)
total_results = response['pageInfo']['totalResults']

# save the channel_id and video_id in a dataframe
for i in response['items']:

    channel_id = i['snippet']['channelId']
    video_id = i['id']['videoId']
    published_date = i['snippet']['publishedAt']
    video_title = i['snippet']['title']
    vid_type = i['id']['kind']

    df = pd.concat([df, pd.DataFrame([{
        'channel_id': channel_id,
        'video_id': video_id,
        'video_title': video_title,
        'published_date': published_date,
        'type': vid_type
    }])], ignore_index=True)

while df['video_id'][len(df)-1] != df['video_id'][len(df)-2]:
    url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&publishedBefore=' + published_date + '&key=' + api_key

    response = requests.get(url=url).json()
    total_results = response['pageInfo']['totalResults']

    for i in response['items']:
        channel_id = i['snippet']['channelId']
        video_id = i['id']['videoId']
        published_date = i['snippet']['publishedAt']
        video_title = i['snippet']['title']
        vid_type = i['id']['kind']

        df = pd.concat([df, pd.DataFrame([{
            'channel_id': channel_id,
            'video_id': video_id,
            'video_title': video_title,
            'published_date': published_date,
            'type': vid_type
        }])], ignore_index=True)

# because the last row is a duplicate we need to delete the last row
df.drop(df.tail(1).index, inplace=True)

# df.to_csv('C:\\Users\\...\\data\\video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')

df.to_csv('./data/video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')

jon-chun avatar Dec 31 '22 14:12 jon-chun

I am comparing OpenAI Whisper transcription vs YouTube-generated transcription currently.

Research suggests that OpenAI Whisper is better but this is largely heresay. Google is constantly upgrading the transcription models I suspect, so all opinions may be out of date quickly.

jon-chun avatar Dec 31 '22 14:12 jon-chun

I came here to post an issue about this idea. Seems you guys are already on track. What I had in mind is Whisper and a lot of podcast content as it's one of the low hanging fruits in terms of Q&A. Let's see what @yk has to say about this. I envision us having a working prototype that extracts Q&A from speech.

futurisold avatar Jan 01 '23 09:01 futurisold

@leoentersthevoid That is a wonderful idea.

In fact I just looked into the possibility and this is definitely feasible.

I wonder if we can use Youtube based podcasts OR scraping podcast audio data from podcast platforms like Apple/google podcasts for training. In terms of licence perspective.

musabgultekin avatar Jan 01 '23 14:01 musabgultekin

@christophschuhmann do you have ideas about what the license situation of transcribing YT videos and podcasts looks like?

Also, I definitely see potential in collecting diverse QA data. Podcasts, interviews, and the like seem like a good source, except they might be a bit too lengthy, and too informal, but I guess that can be fixed.

We just need to make sure that we also go beyond this QA data. ChatGPT can not only answer questions, but also write email & code, be your friend, etc. and the task diversity of the training data needs to reflect that.

yk avatar Jan 01 '23 20:01 yk

This concept with the Chatbot would be beneficial as part of the whole, if it would take the data from the user and relate the information that the user already has a base understanding of. Then it can find relational concepts within educational videos from the scrapped data and pass new information to the user so that they can quickly grasp the new information that they normally wouldn't look for. I do believe relatability learning is the fastest and most efficient way of learning things. If you already have the information from the user in a secure location by scrapping it from all sources that the user agrees to than it would speed this up dramatically.

Alternatively there could be a platform/webpage that you could link accounts to for individual users to speed up the data harvest and you could store the data there for iterations. This could be equipped with web tools to show a users progress and available paths of information based on the information available.

Panda20090 avatar Jan 02 '23 01:01 Panda20090

@Panda20090 fully agree. The best place for this might actually be once we add retrieval to the assistant. then the entire licensing problem also vanishes.

yk avatar Jan 02 '23 11:01 yk

@marianna13 has code for doing YT subtitles too and has scraped. We ned this data to create augmented Q/A training data for immediate experiments. @Shtoner Please discuss with @Rallio67 in the discord if you can get this data to him.

Re long term plans for scraping infra @Panda20090 , please open another issue - or better discuss in the LAION/video2dataset repo and discord channel. Very cool ideas.

Re licensing, it depends on the country. Where LAION is located as a research non-profit, it is using the text data mining exception as I am told. @christophschuhmann

huu4ontocord avatar Jan 05 '23 18:01 huu4ontocord

#259

Shtoner avatar Jan 07 '23 15:01 Shtoner

We can mass download the youtube subtitles associated with the videos. That way we will also have a the translations for the same english text in multiple languages, so there is no need to automatically transcribe them. We should also get the annotations if possible and overwrite the parts of the speech where Khan makes a slight error (those annotations usually overlap errors made on screen so the video does not have to be rerecorded).

I think it would be worthwile to also scrape the user questions and answers below the videos (after filtering for quality). Those are already in a chat Q A format.

I am currently going through their Terms of Service to ensure it is fine to use their content for training language models.

sedthh avatar Jan 19 '23 10:01 sedthh

YT-DLP has the option to save subtitles

SnappierSoap318 avatar Jan 19 '23 14:01 SnappierSoap318

Nice, I was thinking about https://pytube.io/en/latest/user/captions.html

sedthh avatar Jan 19 '23 14:01 sedthh

via https://www.khanacademy.org/about/tos

8. Prohibited Conduct
YOU AGREE NOT TO:
[...]
8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services;
8.8. Use bots or other automated methods to access the Services;

So we should contact either [email protected] or [email protected] for approval before scraping their data.

sedthh avatar Jan 19 '23 18:01 sedthh

Oh it means we can't scrape their data?

On Thu, 19 Jan 2023, 21:44 Richard Nagyfi, @.***> wrote:

via https://www.khanacademy.org/about/tos

  1. Prohibited Conduct YOU AGREE NOT TO: [...] 8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services; 8.8. Use bots or other automated methods to access the Services;

So we should contact either @.*** or @.*** for approval before scraping their data.

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397444024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJGDSC7RKMAMRXSDV5LWTGDQTANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>

marianna13 avatar Jan 19 '23 18:01 marianna13

Yt-dlp is better. It has more options and less limitations.

On Thu, 19 Jan 2023, 17:46 Richard Nagyfi, @.***> wrote:

Nice, I was thinking about https://pytube.io/en/latest/user/captions.html

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397088964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJABWJBQNUVKNJE43IDWTFHSPANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>

marianna13 avatar Jan 19 '23 18:01 marianna13

We should ask for their approval first. And even in the case of an approval we should be extremely careful when scraping:

  1. sponsored videos (they have different licences)
Unless expressly indicated on the Services that a particular item of Licensed Educational Content is made available to Users under alternate license terms, you may not download, distribute, sell, lease, modify, or otherwise provide access to the Licensed Educational Content to any third party.
  1. user generated data (most of the users are children)

sedthh avatar Jan 19 '23 18:01 sedthh

Aren't we technically scraping their content from YouTube and not from their sit, so doesn't it come under YouTube ToS?

SnappierSoap318 avatar Jan 19 '23 18:01 SnappierSoap318

I guess the 3rd party's ToS would have precedence in that case and scraping YouTube only should be ok?

Does anyone else have a take on this?

sedthh avatar Jan 19 '23 19:01 sedthh

#! /bin/bash

CHANNEL='https://www.youtube.com/@khanacademy'

VIDEO_URLS=$(yt-dlp -j --flat-playlist "$CHANNEL" | jq -r '.url')

for VIDEO_URL in $VIDEO_URLS
do
    youtube-dl --write-sub --all-subs --skip-download "$VIDEO_URL"
done

by the way, yt-dlp does not do the trick.

fcolecumberri avatar Jan 25 '23 01:01 fcolecumberri

@fcolecumberri yt-dlp works from the command line for me. I am currently downloading Khan Academy's audio files. Planning to use Whisper for the text.

Shtoner avatar Feb 05 '23 07:02 Shtoner

@Shtoner I meant that yt-dlp didn't work well with the --write-sub --all-subs flags.

fcolecumberri avatar Feb 05 '23 15:02 fcolecumberri

I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.

I've tried this with a few videos for translation and it seems to work for a few minutes then get stuck repeating the same thing over and over. Dunno if anyone has had better experience with it, but I couldn't get it to work without breaking the files into segments first, and messing up breaking them in the wrong places.

bitplane avatar Feb 24 '23 12:02 bitplane

I think it would be legal to scrape the Khan academy site of anything that doesn't require a login to access. This 9th circuit decision against Linkedin is pretty clear. A company was legally allowed to scrape all all no login required data from linkedin even when this was against the TOS. There seems to be a good amount of videos and content on Khan that doesn't require a login. https://www.shrm.org/resourcesandtools/hr-topics/technology/pages/scraping-public-data-from-linkedin-is-legal.aspx

"The 9th Circuit's latest decision relied on the Supreme Court's determination in Van Buren that when information is publicly accessible, no authorization to use that data is required.

The appellate court distinguished between access to publicly available profile information on LinkedIn, which cannot be "unauthorized," and access to information on sites which are restricted to users who sign in to the site with a username and password.

Tse said that what it boils down to is that companies that maintain publicly available information on their websites cannot rely on the CFAA to prohibit others from scraping that data, even if the companies subsequently revoke access to the information, or if data scraping is a violation of the websites' terms of use."

escottgoodwin avatar Apr 18 '23 01:04 escottgoodwin

@Shtoner This issue is currently assigned to you. Are you still working on it?

andreaskoepf avatar May 05 '23 12:05 andreaskoepf

No, I just left a comment about legality of downloading transcripts.

escottgoodwin avatar May 05 '23 15:05 escottgoodwin