kristian-clausal comments

Results 51 comments of


kristian-clausal

Parsing non-English Wiktionaries

That looks like it might be the exact same problem I've had with Lua/Lupa crashes. Tatu took a look at it yesterday, it might be a version issue mismatch: requirements.txt...

form_of including trailing gloss text

This is an issue that can't easily be fixed through coding. I'll take some time tomorrow to go through a list of obvious problematic cases we generated today and fix...

form_of including trailing gloss text

I spent the whole day going through a _short_ list of error-candidates for this; basically, "form of" entries that are suspicious and have a native language word + english language...

Handle pronunciation tags properly in nested structures

> On the plus side, the sense-specific pronunciations of [micrometer](https://kaikki.org/dictionary/All%20languages%20combined/meaning/m/mi/micrometer.html) are correctly separated, so I suppose this is only an issue with POS-specific pronunciations. That's because the original wiktionary article...

Handle pronunciation tags properly in nested structures

Fixed with 54a058bb77df73dfc3b638a750da87a0b91ed9c4, 05f0a5f2834655ba32b440cbd7445f242cd3c68a, and 23a7d8f20f44a7322eeb3482ba2d03a22cd9a4a7. With these commits, which didn't break anything in our tests because I don't think we have tests for Pronunciation sections, the two issues are...

Explicitly note JSON Lines output format

At first I thought we were breaking some kind of requirement by not naming the files with .jsonl, but turns out that's just a "suggestion"; the three requirements are utf-8...

crash with --num-threads set

I've merged your pull request, it seemed like a reasonable workaround.

Support for zh-pron pronunciations

Good catch, Oskari is taking a look at it, and I think the issue is pretty clearly with the "Pronunciation 1" and "Pronunciation 2" pseudo-etymology blocks. From 垃圾, what's missing...

pre-extracted data in .tsv format

Unfortunately the data we provide is not suitable to be used straightforwardly in .tsv or .csv. The JSON data is hierarchical, with big and reasonably sprawling word structures that contain...

pre-extracted data in .tsv format

> > * program a script that will do that translation by reading the json file object-by-object and then outputting it into .tsv > > I think [pyglossary](https://github.com/ilius/pyglossary) supports conversion...