TED-Multilingual-Parallel-Corpus
TED-Multilingual-Parallel-Corpus copied to clipboard
why not english copus
Dear Sir Why not you prepare a english-foreign language corpus, i think this is the most common corpus for developer.
Regards
i also missed the english corpus. however it is easy to download any talk transcript in json format: wget --no-clobber --no-check-certificate https://www.ted.com/talks/2695/transcript.json?language=en
I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?
I am reopening the project and going to update the corpus soon.
How did you align the current corpus? Most of the alignment tools are based dictionaries or translations:
- https://github.com/danielvarga/hunalign
- https://github.com/rsennrich/bleualign or are complex like https://github.com/anoidgit/yasa More promising is a multilingual embedding, but this seems to be very hardware intensive: https://github.com/facebookresearch/LASER/tree/master/tasks/bucc
although the timestamps do not 100% match, you can use the timestamp to align the texts:
i did that for the english-hungarian to reconstruct the aligned sentences, and works pretty well for any language. there is no need for dictionaries or other tools.
here is an example: {"time":676000,"text":"A tuberkulózis előfordulási aránya Pine Ridge-ben"}, {"time":676814,"text":"The tuberculosis rate on Pine Ridge"}
and here are some of the search results of 'tuberculosis' from my index:
TEXTS:
- But let's stick first to TUBERCULOSIS. -> De maradjunk a tbc-nél. (bart weetjens how i taught rats to sniff out land mines.txt)
- Let's consider the big three: HIV, malaria, TUBERCULOSIS. -> Nézzük csak meg a nagy hármast: HIV, malária, tuberkulózis. (mark kendall demo a needle free vaccine patch that s safer and way cheaper.txt)
- This is more than HIV/AIDS, malaria and TUBERCULOSIS combined. -> Ez több, mint a HIV/AIDS, malária és tuberkolózis együtt. (josette sheeran ending hunger now.txt)
- She herself was suffering from HIV; she was suffering from TUBERCULOSIS. -> A lány szenvedett a HIV-től, szenvedett a tuberkulózistól. (gordon brown.txt)
- I began documenting the close connection between HIV/AIDS and TUBERCULOSIS. -> Elkezdtem dokumentálni a szoros kapcsolatot HIV/AIDS és tüdőbaj fertőzés között. (james nachtwey s searing pictures of war.txt)
- It would give us an unfair advantage against battling HIV/AIDS, TUBERCULOSIS and other epidemics. -> Ez hallatlanul nagy előnyhöz juttatna minket a HIV/AIDS, a tuberkulózis és más járványok elleni harcban. (andreas raptopoulos no roads there s a drone for that.txt)
- So it was the spread of TUBERCULOSIS and the spread of cholera that I was responsible for inhibiting. -> Így a tuberkulózis és a kolera terjedésének megállításáért voltam felelős. (gary slutkin let s treat violence like a contagious disease.txt)
- The TUBERCULOSIS rate on Pine Ridge is approximately eight times higher than the US national average. -> A tuberkulózis előfordulási aránya Pine Ridge-ben nagyjából nyolcszor magasabb, mint az amerikai nemzeti átlag. (aaron huey.txt)
- He was haunted by the loss of his mother and his wife, who both died of TUBERCULOSIS at the age of 24. -> Az édesanyja és a felesége halála kísértette, akik mindketten tuberkulózisban haltak meg, 24 évesen. (scott peeples why should you read edgar allan poe.txt)
I agree with @ttjslbz . @405cddd83a828cec , How can I know if the transcript is aligned to other transcripts?
the transcript is meant to be aligned to the speaker's voice. my experience is that the english-hungarian is pretty much aligned. i suppose the other ones also... it is easy to verify.
I am reopening the project and going to update the corpus soon.
Is there any update to this that I am failing to find?