Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Train on Youtube Video Subtitles

Open wakaztahir opened this issue 3 years ago • 44 comments

I had this idea :

Youtube video subtitles are pure written form of human spoken words plus there are so many great videos out there , courses and content that is very precise in teaching and solving different problems and its very human like

I think Youtube subtitles could be a great addition to the training dataset. Programming courses would be a great addition.

wakaztahir avatar Jan 02 '23 05:01 wakaztahir

+1 from me. If it becomes a task, I'd like to contribute.

totuta avatar Jan 02 '23 07:01 totuta

I like it, there is also #184 related. The issue is a bit, we're trying to build an assistant, not a general language model, so we'll have to be very careful to heavily filter to real "instruction-fulfillment" interactions. @totuta I've assigned you, @wakaztahir would you also be up for contributing actively?

The output of this task should not be data itself, but code & instructions to scrape, filter, clean, & process, that data.

@christophschuhmann what are the license considerations here?

yk avatar Jan 02 '23 12:01 yk

@wakaztahir, I will work to come up with an MVP solution v.0.1

Although will start to work on my local, where would be the right place for scripts to reside? @yk?

totuta avatar Jan 04 '23 05:01 totuta

Hi everyone! I already have a working pipeline for YouTube subtitles extraction and I already got subtitles for 1.5M+ videos. Regarding copyright, I don't think that subtitles have strict copyright infringements (see here).

marianna13 avatar Jan 04 '23 09:01 marianna13

@marianna13 that's great. Can you share the script and possibly the result? And anything I can help here?

totuta avatar Jan 04 '23 21:01 totuta

When caputring this information, try to think of how you can convert the dialog into instruciton->answer paths if at all possible. So not just one instruction->answer, but a series. So for example, if this is a how-to (see howto100M), there might be dialog like "How might we make this cake? Well first, we need to get the ingredients. As you can see, I'm doing XYZ". What do we do next? We need to mix , etc. " And then you can break that dialog up into [instruciton->answer][instruction->answer] . If you have a notebook to test it and to show how it is done, that would be good. Please put into the data_aug folder. Also, the goal is to eventually create code or recipie to do the full pipeline [scrape -> convert to instructions -> post process/filter -> convert into training data] . Please ping in the discord data channels if you want to discuss.

huu4ontocord avatar Jan 05 '23 18:01 huu4ontocord

I'd really love to work on this but I'm really busy currently

I am really interested in this project.

I also have some other ideas I'd like to add.

I recently scraped open trivia database using their API with a script from Open AI in golang and I have thousands of trivia question answers and very straight forward with each question having incorrect answers and a single answer and some just true false.

Also was thinking if open AI could be used to help with better data scraping since proper data is everything , what better way of training an AI using an another AI.

wakaztahir avatar Jan 05 '23 19:01 wakaztahir

Hi @totuta ! I'm just using yt-dlp and multi threading but if you want I will share the full script :)

marianna13 avatar Jan 06 '23 17:01 marianna13

Hey @ontocord,what do you mean by "convert dialog to instruction" ? Do you mean we can build an additional language model specifically for this task?

marianna13 avatar Jan 06 '23 17:01 marianna13

@marianna13 I'm not sure which ones you've already transcribed, but I was just starting on some of the JRE clips and getting them to text files. I talked to @Rallio67 and know they have been doing Lex Fridman and Stephen Wolfram videos also. Maybe there could be a list somewhere or someone who keeps track of what's already been/being transcribed?

Shtoner avatar Jan 07 '23 02:01 Shtoner

@marianna13 Sure, could you please share the script? Where do you think, in this repo, is the best place to keep it?

@Shtoner I agree with you. We have to come up with a systematic way of keeping (track of) already transcribed data. @yk What do you suggest?

totuta avatar Jan 07 '23 06:01 totuta

@Shtoner I also have downloaded JRE and Lex Ftiedman as well as other podcasts and shows. I can send you a lost of chanbels I scraped if you want but its quite long ;) @totuta sure! Maybe we will add a folder like data scripts

marianna13 avatar Jan 07 '23 07:01 marianna13

Also I've got a Python script that scrapes a site for NBA's players stats from a certain year that I might be able to modify or share with you guys to include other stats, and maybe from different websites. Also just linking this here so these kind of similar issues will be seen at or around the same time.

Shtoner avatar Jan 07 '23 15:01 Shtoner

@totuta thanks for hanging on :) I'd roughly follow what's outlined here. The description is very extensive. Take what you need and leave the rest :)

yk avatar Jan 07 '23 21:01 yk

@totuta thanks for hanging on :) I'd roughly follow what's outlined here. The description is very extensive. Take what you need and leave the rest :)

Thanks @yk! btw,

@totuta sure! Maybe we will add a folder like data scripts

What is the best practice for me and @marianna13 to have a folder and work there?

  • make a branch from the upstream(LAION-AI/Open-Assistant, not a forked one) and work there together

I would like to set up directories(under openassistant/, and work there together and eventually push to the origin/master What is the recommended way?

totuta avatar Jan 08 '23 03:01 totuta

@marianna13

Hey @ontocord,what do you mean by "convert dialog to instruction" ? Do you mean we can build an additional language model specifically for this task?

if it is a bunch of diaog about a particular subject, such as fixing a car, you could use a question generator and answerer model to generate a qa pair from the background data. you could also detect phrases using either keyword or embedder such as "Now to do XYZ, you need to ABC "... and so forth, and conver that to an instruction either by simple rules or few shot -> "Human: Please give me the steps for doing XYZ. Assistant: Ok, you do ABC".

huu4ontocord avatar Jan 10 '23 05:01 huu4ontocord

I would like to set up directories(under openassistant/, and work there together and eventually push to the origin/master What is the recommended way?

I'm not sure where the "not a fork" comes from. I'd recommend to create a fork, work there in a directory under openassistant/datasets/ or so and then raise a PR to origin/main when you're ready

yk avatar Jan 10 '23 10:01 yk

Where are we on this project @Shtoner @totuta @marianna13 ? Have we found some good candidate youtube transcripts to convert to dialog?

huu4ontocord avatar Jan 15 '23 05:01 huu4ontocord

The problem with YT subtitles is that it's just plain text with no punctuation. Is there a way to convert it to dialog? @ontocord

marianna13 avatar Jan 15 '23 12:01 marianna13

we should use AI for that , maybe Chat GPT would be great , "hey chat gpt, convert this plain text to instruction and response that could be fed into another model" but since chat gpt isn't available , use their other OpenAI models (text-davinci)

This way we can convert thousands of text samples , Any text samples to instruction and response , but cannot verify the quality of data

wakaztahir avatar Jan 15 '23 16:01 wakaztahir

I converted some good (human-generated) captions from Lex Friedman to dialogue. Here's what I got:

A: Right, but I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. 
B: Are you sure about that? 
A: I'm betting my life on it. 
B: The following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of "Maps of Meaning", 12 Rules For Life" and "Beyond Order." This is the Lex Fridman podcast, to support it please check out our sponsors in the description. And now dear friends, here's Jordan Peterson. Dostoevsky wrote in "The Idiot", spoken through the character of Prince Myshkin that beauty will save the world. Solzhenitsyn actually mentioned this in his Nobel Prize acceptance speech. What do you think does Dostoevsky meant by that? Was he right? 
A: Well, I guess it's the divine that saves the world, let's say, you could say that by definition. And then you might say, "Well, are there pointers to that which will save the world or that which eternally saves the world?" And the answer to that in all likelihood is yes. And that's maybe truth and love and justice and the classical virtues, beauty perhaps in some sense foremost among them. That's a difficult case to make, but definitely a pointer. 
B: Which direction is the arrow pointing? 
A: Well, the arrow's pointing up. And no, I think that that which it points to is, what beauty points to, it transcends beauty. It's more than beauty. 
B: And that speaks to the divine. 

Then I'm going to use this data to fine-tune a model to generate similar stuff for arbitrary captions. What do you think? What model should I use for that?

marianna13 avatar Jan 15 '23 17:01 marianna13

I used text-davinci-003 , Here's my prompt , https://beta.openai.com/playground

I will provide you some text , You must understand it and then convert it into question and answer format in json , everything that you learned , here's the text

"A: Right, but I would say, bring it on. If you gaze into the abyss long enough, you see the light, not the darkness. 
B: Are you sure about that? 
A: I'm betting my life on it. 
B: The following is a conversation with Jordan Peterson, an influential psychologist, lecturer, podcast host, and author of "Maps of Meaning", 12 Rules For Life" and "Beyond Order." This is the Lex Fridman podcast, to support it please check out our sponsors in the description. And now dear friends, here's Jordan Peterson. Dostoevsky wrote in "The Idiot", spoken through the character of Prince Myshkin that beauty will save the world. Solzhenitsyn actually mentioned this in his Nobel Prize acceptance speech. What do you think does Dostoevsky meant by that? Was he right? 
A: Well, I guess it's the divine that saves the world, let's say, you could say that by definition. And then you might say, "Well, are there pointers to that which will save the world or that which eternally saves the world?" And the answer to that in all likelihood is yes. And that's maybe truth and love and justice and the classical virtues, beauty perhaps in some sense foremost among them. That's a difficult case to make, but definitely a pointer. 
B: Which direction is the arrow pointing? 
A: Well, the arrow's pointing up. And no, I think that that which it points to is, what beauty points to, it transcends beauty. It's more than beauty. 
B: And that speaks to the divine. "

Here's the ansewer

Question: "What does Dostoevsky mean when he said that beauty will save the world?"
Answer: "Dostoevsky was suggesting that the divine will save the world, and that beauty is a pointer to that which will save the world - such as truth, love, justice, and the classical virtues." 

JSON: 
{
    "question": "What does Dostoevsky mean when he said that beauty will save the world?",
    "answer": "Dostoevsky was suggesting that the divine will save the world, and that beauty is a pointer to that which will save the world - such as truth, love, justice, and the classical virtues."
}

The data provided is not really technical , let's see when given a wikipedia paragraph

Prompt I will provide you some text , You must understand it and then convert it into question and answer format in json , everything that you learned , here's the text

"The front end analyzes the source code to build an internal representation of the program, called the intermediate representation (IR). It also manages the symbol table, a data structure mapping each symbol in the source code to associated information such as location, type and scope.

While the frontend can be a single monolithic function or program, as in a scannerless parser, it was traditionally implemented and analyzed as several phases, which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns. Most commonly today, the frontend is broken into three phases: lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis. Lexing and parsing comprise the syntactic analysis (word syntax and phrase syntax, respectively), and in simple cases, these modules (the lexer and parser) can be automatically generated from a grammar for the language, though in more complex cases these require manual modification. The lexical grammar and phrase grammar are usually context-free grammars, which simplifies analysis significantly, with context-sensitivity handled at the semantic analysis phase. The semantic analysis phase is generally more complex and written by hand, but can be partially or fully automated using attribute grammars. These phases themselves can be further broken down: lexing as scanning and evaluating, and parsing as building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree). In some cases additional phases are used, notably line reconstruction and preprocessing, but these are rare."

Answer Given By Text Davinci { "qnaPairs": [ { "question": "What is the intermediate representation (IR) of a program?", "answer": "The intermediate representation (IR) of a program is an internal representation of the program built by the front end from the source code." }, { "question": "What is a symbol table?", "answer": "A symbol table is a data structure mapping each symbol in the source code to associated information such as location, type and scope." }, { "question": "What is the traditional method for implementing a frontend?", "answer": "The traditional method for implementing a frontend is to break it into several phases which may execute sequentially or concurrently. This method is favored due to its modularity and separation of concerns." }, { "question": "What are the three phases of the frontend most commonly used today?", "answer": "The three phases of the frontend most commonly used today are lexical analysis (also known as lexing or scanning), syntax analysis (also known as scanning or parsing), and semantic analysis." }, { "question": "What are the two components of syntactic analysis?", "answer": "The two components of syntactic analysis are lexing and parsing." }, { "question": "What are the two types of grammars usually used for syntactic analysis?", "answer": "The two types of grammars usually used for syntactic analysis are context-free grammars and attribute grammars." }, { "question": "What are the two steps of parsing?", "answer": "The two steps of parsing are building a concrete syntax tree (CST, parse tree) and then transforming it into an abstract syntax tree (AST, syntax tree)." }, { "question": "Are there any additional phases used in the frontend process?", "answer": "In some cases additional phases are used in the frontend process, notably line reconstruction and preprocessing, but these are rare." } ] }

wakaztahir avatar Jan 15 '23 18:01 wakaztahir

Oh I think it's actually cool. Do you think we can use some open-source models instead of davinci (e.g. bloom)?

marianna13 avatar Jan 15 '23 18:01 marianna13

Never tested bloom , gonna give it a try , can't make a million requests to open AI

wakaztahir avatar Jan 15 '23 18:01 wakaztahir

Yeah, I don't think openAI will be suitable for us at scale :)

marianna13 avatar Jan 15 '23 18:01 marianna13

Just checked bloom , looks nice , should be able to do the job

wakaztahir avatar Jan 15 '23 18:01 wakaztahir

what prompt did you use?

marianna13 avatar Jan 15 '23 18:01 marianna13

I mean I didn't test it , I just looked it up , searched it , making an assumption here since it's also a really large language model

wakaztahir avatar Jan 15 '23 18:01 wakaztahir

Oh I see! I tried your prompt with bloom and it didn't work ...

marianna13 avatar Jan 15 '23 18:01 marianna13

There is a data augmentation script that takes TSV format that someone built in the "scripts" section of OpenAssistant https://github.com/LAION-AI/Open-Assistant/blob/main/scripts/data_augment/data_augment.py , and another one that also is supposed to create QA format in the "notebooks" section. https://github.com/LAION-AI/Open-Assistant/blob/main/notebooks/closed-book-qa/T5_closed_book_QA_generators.py. I haven't ran either yet, but I know the second one requires some pretty good hardware specs to run. To remove timestamps from a VTT, https://web.microsoftstream.com/VTTCleaner/CleanVTT.html works well. Currently seeing if clean VTT to TSV is plausible. I still believe we need to pick a place to list which data is being collected, as I have added the pr to store datasets in the repo directly, but it is not yet approved, and everyone will have to wait until their own pr has been merged to update what is being done in data collection. This could also slow people down not only by having to wait for merging, but people will be expecting fully scrubbed and ready to use data when people collecting and formatting it just need to know what else needs scraped and then made ready. Maybe a specific Discord channel data-collection solely for listing which data is already collected and being formatted to use and we can use the current data-discussion for how we get new data ideas and discussion?

Shtoner avatar Jan 16 '23 00:01 Shtoner