Open-Assistant
Open-Assistant copied to clipboard
Download transcripts of khan academy
Is it possible someone who has some experience with with youtube or other video scraping to create transcripts of free/opensource lectures from youtube or other video sites preferably where there is a single speaker https://www.youtube.com/@khanacademy/playlists
Khan Academy is a 501(c)(3) nonprofit organization with the mission of providing a free, world-class education for anyone, anywhere. Our interactive practice problems, articles, and videos help students succeed in math, biology, chemistry, physics, history, economics, finance, grammar, and many other topics.
In this kind of format potentially https://karpathy.ai/lexicap/
You could use yt-donwload or https://github.com/m1guelpf/yt-whisper
Please connect with rallio with results
Sounds interesting I've never scraped video before, but I'm about to look into it now. Will post if I get a solution. Anyone else with a concrete solution feel free to pick this up, though.
Would the CreativeCommons NonCommercial ShareAlike license on their videos be a problem? https://support.khanacademy.org/hc/en-us/articles/202262954-Can-I-use-Khan-Academy-s-videos-name-materials-links-in-my-project-
It seems like the ShareAlike license on the Khan Academy videos might entangle the transcripts and things that incorporate them. https://www.theregister.com/2022/10/19/github_copilot_copyright/
@ontocord there's a pip package youtube-transcript-api which lets us fetch the transcript of a video along with time stamps.
I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.
Working on this for another project right now: Here is some code for this functionality
References:
- Modified code from: https://www.geeksforgeeks.org/python-downloading-captions-from-youtube/
# importing modules
from youtube_transcript_api import YouTubeTranscriptApi
# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
yt_id = "Sqqt1kU52I8"
srt = YouTubeTranscriptApi.get_transcript(yt_id)
# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
filename_out = f'yt_subtitles_{yt_id}.txt'
with open(filename_out, "w") as f:
# iterating through each element of list srt
for line in srt:
# writing each element of srt on a new line
print(line['text'])
f.write("{}\n".format(line))
You'll also need to get all YouTube Video IDs if you want to scrape a particular YT User's Channel
References:
- https://stackoverflow.com/questions/73827182/youtube-api-get-all-videos-of-channel-with-more-than-500-videos
- https://developers.google.com/youtube/v3/getting-started
- https://www.youtube.com/channel/UCvShfJtvC2owV0AFi_qyy
import pandas as pd
import requests
import datetime
api_key = '***'
channel_id = 'UCXDi1F7Q-cJQ4mGGavtfYRQ'
channel_id = 'UCvShfJtvC2owV0AFi_qyykA' # The Helix Center
# build dataframe
df = pd.DataFrame(columns=['channel_id',
'video_id',
'video_title'
'published_date',
'type'])
# first request
my_url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&key=' + api_key
response = requests.get(url=my_url).json()
print(my_url)
total_results = response['pageInfo']['totalResults']
# save the channel_id and video_id in a dataframe
for i in response['items']:
channel_id = i['snippet']['channelId']
video_id = i['id']['videoId']
published_date = i['snippet']['publishedAt']
video_title = i['snippet']['title']
vid_type = i['id']['kind']
df = pd.concat([df, pd.DataFrame([{
'channel_id': channel_id,
'video_id': video_id,
'video_title': video_title,
'published_date': published_date,
'type': vid_type
}])], ignore_index=True)
while df['video_id'][len(df)-1] != df['video_id'][len(df)-2]:
url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&publishedBefore=' + published_date + '&key=' + api_key
response = requests.get(url=url).json()
total_results = response['pageInfo']['totalResults']
for i in response['items']:
channel_id = i['snippet']['channelId']
video_id = i['id']['videoId']
published_date = i['snippet']['publishedAt']
video_title = i['snippet']['title']
vid_type = i['id']['kind']
df = pd.concat([df, pd.DataFrame([{
'channel_id': channel_id,
'video_id': video_id,
'video_title': video_title,
'published_date': published_date,
'type': vid_type
}])], ignore_index=True)
# because the last row is a duplicate we need to delete the last row
df.drop(df.tail(1).index, inplace=True)
# df.to_csv('C:\\Users\\...\\data\\video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')
df.to_csv('./data/video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')
I am comparing OpenAI Whisper transcription vs YouTube-generated transcription currently.
Research suggests that OpenAI Whisper is better but this is largely heresay. Google is constantly upgrading the transcription models I suspect, so all opinions may be out of date quickly.
I came here to post an issue about this idea. Seems you guys are already on track. What I had in mind is Whisper and a lot of podcast content as it's one of the low hanging fruits in terms of Q&A. Let's see what @yk has to say about this. I envision us having a working prototype that extracts Q&A from speech.
@leoentersthevoid That is a wonderful idea.
In fact I just looked into the possibility and this is definitely feasible.
I wonder if we can use Youtube based podcasts OR scraping podcast audio data from podcast platforms like Apple/google podcasts for training. In terms of licence perspective.
@christophschuhmann do you have ideas about what the license situation of transcribing YT videos and podcasts looks like?
Also, I definitely see potential in collecting diverse QA data. Podcasts, interviews, and the like seem like a good source, except they might be a bit too lengthy, and too informal, but I guess that can be fixed.
We just need to make sure that we also go beyond this QA data. ChatGPT can not only answer questions, but also write email & code, be your friend, etc. and the task diversity of the training data needs to reflect that.
This concept with the Chatbot would be beneficial as part of the whole, if it would take the data from the user and relate the information that the user already has a base understanding of. Then it can find relational concepts within educational videos from the scrapped data and pass new information to the user so that they can quickly grasp the new information that they normally wouldn't look for. I do believe relatability learning is the fastest and most efficient way of learning things. If you already have the information from the user in a secure location by scrapping it from all sources that the user agrees to than it would speed this up dramatically.
Alternatively there could be a platform/webpage that you could link accounts to for individual users to speed up the data harvest and you could store the data there for iterations. This could be equipped with web tools to show a users progress and available paths of information based on the information available.
@Panda20090 fully agree. The best place for this might actually be once we add retrieval to the assistant. then the entire licensing problem also vanishes.
@marianna13 has code for doing YT subtitles too and has scraped. We ned this data to create augmented Q/A training data for immediate experiments. @Shtoner Please discuss with @Rallio67 in the discord if you can get this data to him.
Re long term plans for scraping infra @Panda20090 , please open another issue - or better discuss in the LAION/video2dataset repo and discord channel. Very cool ideas.
Re licensing, it depends on the country. Where LAION is located as a research non-profit, it is using the text data mining exception as I am told. @christophschuhmann
#259
We can mass download the youtube subtitles associated with the videos. That way we will also have a the translations for the same english text in multiple languages, so there is no need to automatically transcribe them. We should also get the annotations if possible and overwrite the parts of the speech where Khan makes a slight error (those annotations usually overlap errors made on screen so the video does not have to be rerecorded).
I think it would be worthwile to also scrape the user questions and answers below the videos (after filtering for quality). Those are already in a chat Q A format.
I am currently going through their Terms of Service to ensure it is fine to use their content for training language models.
YT-DLP has the option to save subtitles
Nice, I was thinking about https://pytube.io/en/latest/user/captions.html
via https://www.khanacademy.org/about/tos
8. Prohibited Conduct
YOU AGREE NOT TO:
[...]
8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services;
8.8. Use bots or other automated methods to access the Services;
So we should contact either [email protected] or [email protected] for approval before scraping their data.
Oh it means we can't scrape their data?
On Thu, 19 Jan 2023, 21:44 Richard Nagyfi, @.***> wrote:
via https://www.khanacademy.org/about/tos
- Prohibited Conduct YOU AGREE NOT TO: [...] 8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services; 8.8. Use bots or other automated methods to access the Services;
So we should contact either @.*** or @.*** for approval before scraping their data.
— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397444024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJGDSC7RKMAMRXSDV5LWTGDQTANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>
Yt-dlp is better. It has more options and less limitations.
On Thu, 19 Jan 2023, 17:46 Richard Nagyfi, @.***> wrote:
Nice, I was thinking about https://pytube.io/en/latest/user/captions.html
— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397088964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJABWJBQNUVKNJE43IDWTFHSPANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>
We should ask for their approval first. And even in the case of an approval we should be extremely careful when scraping:
- sponsored videos (they have different licences)
Unless expressly indicated on the Services that a particular item of Licensed Educational Content is made available to Users under alternate license terms, you may not download, distribute, sell, lease, modify, or otherwise provide access to the Licensed Educational Content to any third party.
- user generated data (most of the users are children)
Aren't we technically scraping their content from YouTube and not from their sit, so doesn't it come under YouTube ToS?
I guess the 3rd party's ToS would have precedence in that case and scraping YouTube only should be ok?
Does anyone else have a take on this?
#! /bin/bash
CHANNEL='https://www.youtube.com/@khanacademy'
VIDEO_URLS=$(yt-dlp -j --flat-playlist "$CHANNEL" | jq -r '.url')
for VIDEO_URL in $VIDEO_URLS
do
youtube-dl --write-sub --all-subs --skip-download "$VIDEO_URL"
done
by the way, yt-dlp does not do the trick.
@fcolecumberri yt-dlp works from the command line for me. I am currently downloading Khan Academy's audio files. Planning to use Whisper for the text.
@Shtoner I meant that yt-dlp didn't work well with the --write-sub --all-subs
flags.
I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.
I've tried this with a few videos for translation and it seems to work for a few minutes then get stuck repeating the same thing over and over. Dunno if anyone has had better experience with it, but I couldn't get it to work without breaking the files into segments first, and messing up breaking them in the wrong places.
I think it would be legal to scrape the Khan academy site of anything that doesn't require a login to access. This 9th circuit decision against Linkedin is pretty clear. A company was legally allowed to scrape all all no login required data from linkedin even when this was against the TOS. There seems to be a good amount of videos and content on Khan that doesn't require a login. https://www.shrm.org/resourcesandtools/hr-topics/technology/pages/scraping-public-data-from-linkedin-is-legal.aspx
"The 9th Circuit's latest decision relied on the Supreme Court's determination in Van Buren that when information is publicly accessible, no authorization to use that data is required.
The appellate court distinguished between access to publicly available profile information on LinkedIn, which cannot be "unauthorized," and access to information on sites which are restricted to users who sign in to the site with a username and password.
Tse said that what it boils down to is that companies that maintain publicly available information on their websites cannot rely on the CFAA to prohibit others from scraping that data, even if the companies subsequently revoke access to the information, or if data scraping is a violation of the websites' terms of use."
@Shtoner This issue is currently assigned to you. Are you still working on it?
No, I just left a comment about legality of downloading transcripts.