zamia-speech icon indicating copy to clipboard operation
zamia-speech copied to clipboard

Est_republicaine Corpus not found

Open BaderEddineB opened this issue 3 years ago • 6 comments

Hello I'm trying to download the est_republicaine corpus to train the French language model using KenLM, when I click on the link, it gives me this error "nginx error! The page you are looking for is not found" any ideo, where can have this corpus ? thanks

BaderEddineB avatar Aug 25 '20 12:08 BaderEddineB

This seems to be a problem of https://cnrtl.fr/ . I just mailed them a bug report.

svenha avatar Aug 26 '20 08:08 svenha

Ok thanks, I just found another download link, is this one: ( https://repository.ortolang.fr/api/content/export?&path=/est_republicain/4/&filename=est_republicain&scope=YW5vbnltb3Vz3 ) I would like to know if it is the same as that of cnrtl.fr ?

BaderEddineB avatar Aug 26 '20 08:08 BaderEddineB

est_repeb2 est_repeb

BaderEddineB avatar Aug 27 '20 07:08 BaderEddineB

Someone from cnrtl.fr answered my question. The official new web site for this corpus is https://www.ortolang.fr/market/corpora/est_republicain Version 4 from 2020-07-22 is the latest.

svenha avatar Aug 28 '20 09:08 svenha

Thank you very much, it looks a bit like the one i found (the pictures above). but when I run ["xmllint --xpath '// * [local-name () =" div "] [@ type =" article "] // * [local-name () =" p "or local-name () = "head"] / text () 'Year * / *. xml | perl -pe' s / ^ + // g; s / ^ (. +) / $ 1 \ n / g; chomp '> est_republicain. txt "] to extract the titles and paragraphs in the text file" est_republicain.txt ". I see that the pulling is not going well

here is the example of the "est_republicain.txt" file result: Capturekk

is it normal ? What is the problem ?

BaderEddineB avatar Aug 28 '20 14:08 BaderEddineB

The file format might have been changed. The idea is to extract text only and what you get is nearly what we need. You need to replace all sgml entities.

See https://serverfault.com/questions/440805/how-can-i-easily-convert-html-special-entities-from-a-standard-input-stream-in-l

pguyot avatar Jun 12 '22 07:06 pguyot