tuja-vortaro icon indicating copy to clipboard operation
tuja-vortaro copied to clipboard

Add Wiktionary as a source

Open pizzamaker opened this issue 9 years ago • 5 comments

Wiktionary can be a powerful tool here. How are you making the sources machine-readable?

pizzamaker avatar Jan 12 '16 21:01 pizzamaker

Manual DTD and XML parsing. For example, https://github.com/sstangl/tuja-vortaro/blob/master/revo/convert-to-js.py.

Wiktionary would be a good source in theory. It is much easier to edit than ReVo, and its data licensing would permit this program to be AGPLv3+, which I would like.

On the other hand, the data quality is much lower than that of either ReVo or ESPDIC. The entries that exist are extremely ill-specified, and basic words like taŭga are not found at all.

So I don't think switching the data source is a good idea until there is a team actively working on improving dictionary quality. Currently that momentum exists with ReVo, even though it is slight. I would very much like to see a libre version of PIV.

sstangl avatar Jan 24 '16 19:01 sstangl

Isn't it possible to add more than one source and eliminate duplicates?

pizzamaker avatar Jan 24 '16 20:01 pizzamaker

That would be more work but would be possible. I'm not opposed. That would certainly have the benefit that editing wiktionary would be the quickest way to improve the quality of this dictionary, especially with translations.

sstangl avatar Jan 24 '16 20:01 sstangl

In my view, the more sources, the better (more reliability). I'm also going to try to check with Yves Nevelsteen regarding the licensing of Komputeko. That would be a major, excellent addition.

pizzamaker avatar Jan 24 '16 20:01 pizzamaker

I just looked into adding Wiktionary as a source. The quality is so low that it is difficult to tell what even is an Esperanto word -- there are entries for words in various languages. The pages themselves also do not have a consistent structure. It would be very difficult to make something useful out of this.

sstangl avatar Jan 24 '16 20:01 sstangl