Workbench icon indicating copy to clipboard operation
Workbench copied to clipboard

The vast majority of documents are tiny

Open danluu opened this issue 8 years ago • 11 comments

If we look at the wikipedia dump currently hosted on Azure, the modal number of postings per document is 5, and things drop off rapidly from there:

Postings,Count
0,5
1,9013
2,161034
3,490873
4,752513
5,795627
6,458944
7,297922
8,187495
9,159601
10,122515
11,98068
12,93155
13,82168
14,80742
15,74154
16,69059
17,64268
18,67888
19,63546
20,63112

danluu avatar Dec 06 '16 20:12 danluu

The vast majority of the really small documents (2 or 3 postings) are list documents. See, for example, https://en.wikipedia.org/?curid=1333 which is a page about the day "August 8." This page contains three words. The title, "August 8" and the body words "August" and "8". This problem should go away if we rerun wikiextractor with the --lists option. We should investigate the other options at https://github.com/attardi/wikiextractor/blob/master/README.md.

MikeHopcroft avatar Dec 08 '16 06:12 MikeHopcroft

https://en.wikipedia.org/?curid=35348 is an example of a document with one posting. This is also a list document. The only posting is "130s" in the title.

MikeHopcroft avatar Dec 08 '16 07:12 MikeHopcroft

This document turns out to be 0 sized, which seems a bit surprising. It has content in it, and the content has been there for years, so it's not that we got some old empty version. The document has the following text:

A & A is a computer virus which infects COM files. It changes an infected program’s time and date stamp to the date and time of infection. When activated, the virus clears and reprints blocks of the screen. The infection code contains the string {A&A}

danluu avatar Dec 08 '16 07:12 danluu

BTW, here are the chunk files with 0 lengths after filtering are:

-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:37 Chunk-1361.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:35 Chunk-288.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:44 Chunk-4016.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:47 Chunk-5677.chunk
-rw-rw-r-- 1 danluu danluu     27 Dec  7 23:49 Chunk-6139.chunk

danluu avatar Dec 09 '16 00:12 danluu

I rebuilt the first chunk of wikipedia using the --list parameter to wikiextractor. This reduced the number of short documents significantly. Data below shows number of short documents without --list (on the left) and with --list (on the right):

image

MikeHopcroft avatar Dec 14 '16 05:12 MikeHopcroft

Now I'm investigating remaining short documents.

11291(length 2) is a stub for Floccinaucinihilipilification. 11839 (length 2) is a soft redirect page for Wikipedia:GNUStufF 12296 (length 4) is a stub for List of German proverbs 12409 (length 4) is a stub for Wikipedia:GNE Project Files 24922 (length 4) is a stub for List of Polish proverbs

18247 (length 5) is an Index of philosophy articles (A–C). Most of the content for this page is not actually in the wikipedia dump source code. Just the title.

MikeHopcroft avatar Dec 14 '16 05:12 MikeHopcroft

Does that fix change 36699652? It shouldn't be zero length anymore if the list is included, but it looks like it shouldn't have been zero length int he first place.

danluu avatar Dec 14 '16 05:12 danluu

Just rebuilt the first chunk of wikipedia, adding the -s (preserve sections) and --filter_disambig_pages flags. Here are the results: no flags: 1536 documents with 25 or fewer postings --list: 119 documents with 25 or fewer postings --list -s --filter_disambig_pages: 60 documents with 25 or fewer postings.

image

MikeHopcroft avatar Dec 14 '16 06:12 MikeHopcroft

Here are documents with 10 or fewer postings in the first chunk now:

5216: 3 11291: 2 11477: 8 12296: 4 18247: 5 18546: 6 21899: 10 24922: 4

Most of these are lists. 5216 is concerning because it has lots of text (see Khmer Language)

MikeHopcroft avatar Dec 14 '16 07:12 MikeHopcroft

I investigated the Khmer Language page. There is text in the wikipedia dump file but wikiextractor loses nearly all of the text:

<doc id="5216" url="https://en.wikipedia.org/wiki?curid=5216" title="Khmer language">
Khmer language


</doc>

MikeHopcroft avatar Dec 14 '16 07:12 MikeHopcroft

After the last set of fixes, the mode has changed from 5 to 24.

We no longer have any 0-length documents and the number of 1-length documents went down from 9013 to 88.

It seems likely that we still have problem documents, but it sounds like we're not going to go after them right now.

danluu avatar Dec 16 '16 04:12 danluu