Parsing bug with multiple workers
Hey,
I've found a bug occurring when using multiple workers.
Take for example the tinywiki dataset.
When I run the following code:
const dumpster = require('dumpster-dive');
options = {
file: process.argv[2],
db: 'tinywiki',
skip_redirects: false,
skip_disambig: false,
batch_size: 1000,
workers: 4,
custom: function(doc) {
console.log(doc.title(), doc.text().length);
return {};
}
};
dumpster(options, () => console.log('Parsing is Done!'));
where I pass the script the path to the tinywiki XML file through argv[2] which is ./tests/tinywiki-latest-pages-articles.xml.
When I run it with 1 worker, I get the following print:
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 788
Redirect page 0
Disambiguation page 238
Bodmin 7921
In contradiction to what I get when I run it with 4 workers (look at what happens to the Bodmin and Big Page text lengths):
Redirect page 0
Hello 49
Toronto 524
Duplicate title 32
Duplicate title 26
Big Page 0
Disambiguation page 238
Bodmin 0
Haven't looked at how the work is divided among the workers, but my guess is that the file is getting chopped in the middle of pages, making their text unreadable by the parser maybe?
Thanks!
yeah, i've seen this too, I think it's an artifact of the file is being small. when we split the file, we probably split it in the middle of an article, so to save the article, we bump the margins a little bit each time.
if you can think of a smarter method for this, I'm all for it. I've seen this before and also thought it was a bug
Hmm let me take a look. Can you please reference the file and line of the function that does this? Would save me some time
sorry, looked briefly and couldn't find it. I may be wrong.
I don't believe though, this effects a file larger than a few pages. Please let me know if you can discover anything.
the file-reader is here, and dumpster-dive is using percentages, so it could be a rounding-error too cheers
From my understanding, it picks a specific line in the file, at the 25% let's say (for the 2nd worker from 4 workers), so obviously it is very possible that it will not pick exactly the <page> line, but another line, and may lose the entry, because it doesn't have all the data required and cannot detect the beginning of the entry which is indicated by the <page> line. Maybe it falls in the middle of a text xml tag.
I did find this occuring in a large wikidump - the simple english wikipedia.
ah, ok. shoot. I didn't think it was happening, because duplicate pages throw errors on mongo-writes, and I didn't see any. let's try to isolate it.
I don't think it's a matter of duplicates, but rather a page split into two different workers, each worker not getting all the information it needs, and just skips it (to the next <page> tag), ending up missing the entry in both workers