entity-fishing icon indicating copy to clipboard operation
entity-fishing copied to clipboard

Slow loading of the Wikidata .bz2 dump

Open kermitt2 opened this issue 4 years ago • 2 comments

The Wikidata dump became very big with 1.2 billion statements which makes the initial loading of the bz2 dump into lmdb particularly slow.

To speed-up this step, we could try:

  • instead of having 2 pass on the dump, one to get the properties and one to get the statements, we could do both in one pass and solve the property resolution subsequently with the db

  • instead of reading line by line, try with larger buffer blocks

kermitt2 avatar Jun 12 '20 14:06 kermitt2

Complementary info:

  • full loading of Wikidata + 5 wikipedia languages from compiled csv files takes now 22h45m from a mechanical hard drive... (ok I should use a SSD...)

  • the Wikidata statement database in particular went up from 17GB to 62GB in a bit more than 2 years.

The good point is that the increase of Wikidata volume does not affect runtime, just the storage size.

kermitt2 avatar Jun 12 '20 22:06 kermitt2

Hi Patrice According to https://www.wikidata.org/wiki/Wikidata:Statistics, one of the main reason of the explosion of the statement db is that recently most of the published scientific articles have now an entry in wikidata They currently represent more than 22M concepts out of 71M I understand the interest to be able to build graphs between authors and articles but it is not very interesting for entity fishing given that these scholary articles have no wikipedia pages associated and have long titles that cannot be recognized by the current EF mention recognizers. Take the "Attention Is All You Need" paper for example : https://www.wikidata.org/wiki/Q30249683 So one possible optimization of the statement db size would be to be able to filter out some classes ("scholary article" being one of them) when initially building the lmdb database Let's imagine you can define such filtering constraint somewhere (or had code them?) for example in the kb.yaml file:

#dataDirectory: /home/lopez/resources/wikidata/

# Exclude scholary articles from statement db
excludedConceptStatements:
  - conceptId:
    propertyId: P31
    value: Q13442814

When filling the statement db if I detect a concept meeting the constraint ("instance of" "scholary article" for example) then I forget this concept and I don't store the statements

          	if ((propertytId != null) && (value != null)) {
			if (excludedConceptStatements != null) {
				for (Statement excludedConceptStatement : excludedConceptStatements) {
					exclude = (excludedConceptStatement.getConceptId() == null || excludedConceptStatement.getConceptId() == itemId) &&
							(excludedConceptStatement.getPropertyId() == null || excludedConceptStatement.getPropertyId() == propertytId) &&
							(excludedConceptStatement.getValue() == null || excludedConceptStatement.getValue() == value);
					if (exclude)
						break;
				}
			}
			Statement statement = new Statement(itemId, propertytId, value);
//System.out.println("Adding: " + statement.toString());
			if (!statements.contains(statement))
				statements.add(statement);
		}
...
...
			if (statements.size() > 0 && !exclude) {
				try {
					db.put(tx, KBEnvironment.serialize(itemId), KBEnvironment.serialize(statements));
					nbToAdd++;
					nbTotalAdded++;
				} catch(Exception e) {
					e.printStackTrace();
				}
			}

I think we can considerably reduce the size of the statement db

I can even propose a PR for such a mechanism

Best regards Olivier

oterrier avatar Oct 21 '20 06:10 oterrier