David Mark Nemeskey

Results 38 comments of David Mark Nemeskey

@adbar thanks, that would be great. Let me know if can be of any help.

Unfortunately not, it was quite some time ago... I encountered this issue while processing Common Crawl data, but I do understand that having to download & parse a billion pages...

The Hungarian list is "broken" as well: contains lots of content words (e.g. civil war, nighttime, humanity), proper nouns (Adam, Spain), words with commas or dots attached to them, lowercase...

I am seeing missing numbers in both Hungarian and English WP outputs. In Hungarian, the template `{{szám|384402}}` is not expanded; an English example is `{{cvt|384402|km|mi}}` from the [Moon's WP page](https://en.wikipedia.org/wiki/Moon)....

As I understand, this script takes as input a single bz2 file, so I just gave it the full pages-articles bz2 dump. That one includes the template pages, and they...

Just to chime in with another alternative: I ended up using the [Kiwix Wikipedia dumps](https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/). The dump consists of ZIM archives, which contain the pages in HTML format. This makes...

I just ran into this issue. Nothing in the documentation mentions how to delete assignments, and I thought that keeping the source directory is OK (I purged all the others)....

Will take a look in a few days, thanks!

From README.md: "Python 3 and PyTorch 0.4 are required for the current codebase." Try running it under Python 3.

@sdraper-CS I am very curious as to what this magic actually does and why it is needed. Could you elaborate on that?