virtuoso-opensource icon indicating copy to clipboard operation
virtuoso-opensource copied to clipboard

Virtuoso wikidata import performance - virtuoso wikidata endpoints as part of snapquery wikidata mirror network

Open WolfgangFahl opened this issue 1 year ago • 7 comments

@TallTed

Tim Holzheim has successfully imported Wikidata into a virtuoso instance see https://cr.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso and https://wiki.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso

for the documentation. The endpoint is available at https://virtuoso.wikidata.dbis.rwth-aachen.de/sparql/ and we would love to integrate this and other virtuoso endpoints into our snapquery https://github.com/WolfgangFahl/snapquery infrastructure.

Ted suggested that i should open a ticket to get the dicussion going about how virtuoso endpoints could be made part of the snapquery wikidata mirror infrastructure. The idea is to use named parameterized queries that hide the details of the endpoints so that it does not matter wether you use blazegraph, qlever, jena, virtuoso, stardog, ... you name it. Queries should just work as specified and be monitored for non functional aspects proactively.

WolfgangFahl avatar Nov 04 '24 06:11 WolfgangFahl

Note that we (OpenLink Software [1], [2]) have also loaded Wikidata into a live Virtuoso instance, available at https://wikidata.demo.openlinksw.com/sparql.

I'm not sure whether I'm the "Ted" referenced in the last paragraph; if so, regrettably, I've forgotten the specifics of that conversation. Could you provide more detail about the "question" being asked by this issue, especially to benefit others who may have more to contribute to the "answer" than I?

TallTed avatar Nov 04 '24 15:11 TallTed

https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours has the info as well as https://www.wikidata.org/wiki/Wikidata:Scholia/Events/Hackathon_October_2024

We are well aware of the virtuoso endpoint it is already configured in the default https://github.com/WolfgangFahl/snapquery/blob/main/snapquery/samples/endpoints.yaml file.

The question here is how do we get a virtuoso endpoint that is as up-to-date as possible quickly. We intent to "rotate" images based on dumps as long as the streaming updates are not possible. So currently that would be roughly weekly. E.g. https://github.com/ad-freiburg/qlever-control/discussions/82

is an example. This is just an initial issue to start the communication as suggested by Ted in the online meeting of wikidata Search Platform mentioned above. Depending on how the Virtuoso open source project is going to be involved we might need multiple tickets for the different aspects. I suggest to stick with the import performance issue in this ticket for the time being and wait for Tim's comment.

WolfgangFahl avatar Nov 04 '24 15:11 WolfgangFahl

wait for Tim's comment

Is Tim a GitHub user? Tagging their handle seems appropriate, if so. If not, I wonder how they are to comment here? (Also if not a GitHub user, it might make sense to instead raise these threads on the OpenLink Community Forum. They would need to register there, but this could be done using various third-party IdPs.)

TallTed avatar Nov 05 '24 15:11 TallTed

The import took ~4 days and the virtuoso instance was configured with the recommendation for 64 GB RAM (highest available recommendation in the documentation) Dump used:

  • file: latest-all.nt.bz2
  • size: 166GB

To improve the import performance I want to try:

  • increasing the RAM configuration
  • splitting the dump file into smaller subsets as recommended on some doc pages for the bulk load

Is there a recommendation for a configuration that would allow the import of the dump on a single day?

I noticed that once the ram is full there occurred a lot of write lock log messages (or waiting to write unfortunately I did not save the import logs). To avoid this in the next try the RAM config will be to increase e.g. 300GB

tholzheim avatar Nov 07 '24 10:11 tholzheim

NumberOfBuffers

The virtuoso.ini config file traditionally has a table with some pre-calculated settings for NumberOfBuffers and MaxDirtyBuffers as a starting point based on the amount of free memory space.

Say you have 64 GB of free memory in your system, which corresponds to 64 * 1024 * 1024 * 1024 / 8192 = 8388608 maximum NumberOfBuffers. As you also need memory for related caches, transactions, etc., we recommend using about 2/3 of the maximum, or 559240, which got rounded down to 5450000 in the table.

The MaxDirtyBuffers is normally set to around 75% of the NumberOfBuffers.

In commit b6845d14a352a0fcf46047a4ef75d463261429bb, we enhanced the way you can specify the number of buffers:

Say you have a machine with 300GB of free memory, and you want to use about 250GB of that for database buffers, leaving around 50GB for Virtuoso's caches, transactions, etc. Instead of performing the above calculation(s), we can simply use the following settings:

[Parameters]
...
NumberOfBuffers = 250G    ; calculate max NumberOfBuffers that will fit in 250GB memory
MaxDirtyBuffers = 75%        ; allow up to 75% of NumberOfBuffers to be dirty

Or in your Dockerfile or docker-compose.yml file:

Environment:
      - VIRT_PARAMETERS_NUMBEROFBUFFERS=250G
      - VIRT_PARAMETERS_MAXDIRTYBUFFERS=75%

Splitting latest-all.nt.bz2

Splitting a big dump to smaller files will inevitably take some time; however, depending on the number of cores and threads in your CPU, such a split can greatly reduce the time it takes to bulk-load this dataset using multiple Virtuoso threaded loaders in parallel.

You can try the following perl-split.pl script written in the Perl language, to split the data into chunks of roughly the same size. Depending on the way the split program you used before was written, our script may be fractionally faster.

#!/usr/bin/perl
#
#  Simple perl script to split n-triple files like the one from Wikidata
#  into parts.
#
#  Copyright (C) OpenLink Software
#

use strict;
use warnings;

#
#  Vars
#
my $counter = 0;
my $in_file = $ARGV[0];


#
#  Number of bytes to read
#
my $chunk_sz = 500000000;


#
#  Open the source file
#
open (FH, "bzip2 -cd $in_file.nt.bz2 |  ") or die "Could not open source file. $!";

while (1) {
    my $chunk;
    my $out_file;

    #
    #  Open the next part
    #
    print "processing part $counter\n";
    $out_file = sprintf ("wikidata/%s-part-%05d.nt.gz", $in_file, $counter);
    open(OUT, "| gzip -2 >$out_file") or die "Could not open destination file";
    $counter++;

    #
    #  Read the next chunk_sz bytes
    #
    if (!eof(FH)) {
        read(FH, $chunk, $chunk_sz);
        print OUT $chunk;
    }

    #
    #  read upto next \n to complete the part
    #
    if (!eof(FH)) {
        $chunk = <FH>;
        print OUT $chunk;
    }

    close(OUT);
    last if eof(FH);
}

To use it, you can run:

$ wget wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
$ mkdir wikidata
$ perl perl-split.pl latest-all

This will inevitably take a few hours, after which you can remove the latest-all.nt.bz2 file and use the content of the wikidata directory during bulk-load.

Bulk-loading using multiple threads

Using the scripts in the initdb.d directory, loading is done single-threaded, which of course is not ideal on a machine with lots of memory, cores, and threads.

I will think about a slightly different way this can be automated.

pkleef avatar Nov 07 '24 11:11 pkleef

@tholzheim thanks Tim for showing up and bringing the dicussion forward.

WolfgangFahl avatar Nov 07 '24 14:11 WolfgangFahl

see also https://community.openlinksw.com/t/virtuoso-wikidata-mirrors-as-part-of-the-snapquery-mirror-infrastructure/4676

WolfgangFahl avatar Nov 07 '24 14:11 WolfgangFahl