json-wikipedia
json-wikipedia copied to clipboard
how to convert dumps other than EN or IT?
Can you add documentation to the readme on how to sufficiently extend this solution to other languages? FR and ES did not work. E.g... $ ./scripts/convert-xml-dump-to-json.sh fr /u01/wikip/dumps.wikipedia/frwiki/frwiki-latest-pages-articles.xml.bz2 ./frwiki-latest-pages-articles.json
Converting mediawiki xml dump to json dump (./frwiki-latest-pages-articles.json)
2021-12-15 00:25:50,990 1086 [main] ERROR it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI - Parsing the mediawiki
java.lang.IllegalArgumentException: No enum constant it.cnr.isti.hpc.wikipedia.article.Language.FR
at java.base/java.lang.Enum.valueOf(Enum.java:240) ~[na:na]
at it.cnr.isti.hpc.wikipedia.article.Language.valueOf(Language.java:8) ~[json-wikipedia-2.0.0-SNAPSHOT.jar:na]
at it.cnr.isti.hpc.wikipedia.parser.ArticleParser.
@fabriziorizzo thanks for reporting the issue.
There are two issues actually:
Spanish (ES) is supported but I introduced a regression (#32 ) some time ago that I noticed thanks to your comment - could you please try to check out this PR branch https://github.com/diegoceccarelli/json-wikipedia/tree/language, compile and check if it fixes?
French is not supported - and you are right, I should add documentation on how to add a new language! I'll do. In order to support a new language you have to:
-
Provide the mapping of the xml-wikipedia dump in that particular language (e.g., what is the keyword use to indicate a disambiguation page in French? what is the keyword to indicate a redirect, etc). You provide the mapping by writing a property file called
locale-fr.properties
and putting it in the lang folder, like for example: https://github.com/diegoceccarelli/json-wikipedia/blob/language/src/main/resources/lang/locale-es.properties -
Once you added the property file into the folder open src/main/avro/article.avsc and add
FR
to the list of languages as I did forES
in #61.
Please let me know if it works, and, if you write it, it would be great if you can contribute French. Cheers