ety-python icon indicating copy to clipboard operation
ety-python copied to clipboard

dataset missing words

Open blueforesticarus opened this issue 6 months ago • 2 comments

The data set seems to be missing some very common words, such as "metal", plus a lot of simple words like "it" and "the".

blueforesticarus avatar Jun 01 '25 20:06 blueforesticarus

The data set seems to be missing some very common words, such as "metal", plus a lot of simple words like "it" and "the".

Interesting 🤔 thanks for letting me know @blueforesticarus

Not much can be done while we're using the dataset we're using unfortunately – a while back I thought about coming back to this project and building a whole new continuously updatable dataset for it from the ground up (directly from wiktitionary data maybe) but hard for me to justify the time spend at the moment!

Not sure this is something you (or anyone else that happens to be reading this!) would like to do maybe? Would happily feed it back into ety-python if someone were to make such a thing!

jmsv avatar Jun 10 '25 22:06 jmsv

Seems this is an unsolved problem. Wiktionary is not too friendly to scraping.

There is this project https://kaiko.getalp.org/about-dbnary/ but their dataset has a number of issues: including syntax errors, mistakes and cycles in the etym data, and the file being a 2gig rtf file that nothing can load...

burned as much time as I'm willing for the time being.

for anyone who comes along: https://github.com/droher/etymology-db/tree/master seems like another option

blueforesticarus avatar Jun 11 '25 07:06 blueforesticarus