COVID-QA
COVID-QA copied to clipboard
Data collection for different languages
Find official data sources for FAQ about COVID-19 in different languages and scrape them.
I have already a script for the RKI FAQ. Will share it later!
If someone needs a starting point, I already wrote scrapers for WHO and some pages of CDC: https://github.com/deepset-ai/COVID-QA/tree/master/data/scrapers
I will do some scraping for Romanian
I'll add Italian
I will look into some more german pages.
@tkh42 let me know which so we are not doing double-work. This would make sense probably https://www.infektionsschutz.de/coronavirus/faqs-coronaviruscovid-19.html
@HenrykBorzymowski Ok. Yes I have thought about doing that one too, I think I will start with https://www.bmas.de/DE/Presse/Meldungen/2020/corona-virus-arbeitsrechtliche-auswirkungen.html
Perfect people, this is taking off rather quickly :D I can invite you to our slack crawler group if you tell me your wirvsvirus slack names.
I would also suggest that you create small issues stating on which website you want to work on, so we do not have double work or do a crawler twice. state the website in the title so github can find related issues very easily! Thanks
Here is a google table in which we can track which pages we already have a scraper for etc. Please fill in and change if necessary: https://docs.google.com/spreadsheets/d/1er-7sDvgMZ484FRhPL7X6rl1fgRIRtA7fJfj-gLp3jg/edit?usp=sharing
@tkh42 Can I somehow help or motivate you creating scrapers for German Sites? :D
We already started the label process and need more questions!
@Timoeller I am finished with the BMAS one will create the pull request and continue with the next.:)
One way to "easily" get multilingual data is to machine-translate.
pip install googletrans
(and then use Translator(service_urls=["translate.google.com/gen204"])
)
These are older Google Translate Versions, and worse quality than prod, but it's free. The lower quality would only be used in the background though, not shown to the user.
A workflow like this could then work for the user: Type query in Spanish -> QA system detects Spanish query -> QA system matches with Spanish original and/or from-English-translated questions/answers -> QA system shows answers in original language with option to web-translate with Google
This would be easier than real-time translation and/or getting sufficient data in many languages.
Multilingual resource can also easily be found using linguee and checking the sources of the found sentences in the language pairs, e.g. for DE: https://www.linguee.com/english-german/search?source=auto&query=coronavirus