websummit19-transcripts
websummit19-transcripts copied to clipboard
Transcripts from WebSummit 2019 - Extracted from Otter.ai
WebSummit 2019 Transcripts
At WebSummit 2019, 46 talks (mainly from the Central Stage) were automatically transcripted in real-time using the Otter.ai platform. This repo, provides the transcripted text, as well as the code to re-download it and preprocess it.
With this dataset, you can do statistic analysis on the text of the transcripts, or even train a neural network model to produce your very own WebSummit speech.
How to use it
Each speech, is an individual .txt file inside the plain-texts folder eg: (224STLFR2BIGPLOD.txt). All the speeches are titled using their id from the otter.ai platform. If you have a different naming scheme to propose, I'm all ears! :D
Apart from that, inside the plain-texts folder, there is a data.json file, that includes all the raw data from otter.ai.
How to generate the data again:
Prepare the code:
npm installnpm run compile
Run the scripts:
npm run datagen(This downloads the transcripts from otter.ai and produces thedata.jsonfile)npm run preprocess(This reads thedata.jsonfile and creates the various.txtfiles.)
Contributions
Please fork, copy, share and contribute!