hdt-java HDTManager::generateHDT is not deterministic.

HDTManager::generateHDT is not deterministic.

Open jmkeil opened this issue 6 years ago • 3 comments

trafficstars

I recognized that HDTManager::generateHDT does not generate the same output for the same input. This is not just caused by different timestamps in the header. For the same local file I got in five different runs this three different header data:

_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121121" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150874" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1976817" .

_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121096" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150849" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1975553" .

_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121071" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150824" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1974289" .

Beside different headers, there is also some difference in the serialization of the dictionary.

What is the reason for this?

Jan 18 '19 10:01 jmkeil

Do you remember if you were using an RDF file/stream with blank nodes? @jmkeil

Apr 08 '22 15:04 ate47

I do not recall exactly, but it is likely I tested it using https://github.com/HajoRijgersberg/OM/blob/d5a3326e2f0f15f69272f3ce147b469fd90a1dc2/om-2.0.rdf. Does that fit to the hdt:originalSize value?

Apr 08 '22 15:04 jmkeil

My guess was that it was due to the randomness of the BNodes naming, so it does explain the difference in term of size, but to compute the size, we are using the random bnode names, but it’s usually _:anUuid, so a fixed size. So it’s not only that I think

Apr 08 '22 16:04 ate47

hdt-java hdt-java copied to clipboard

HDTManager::generateHDT is not deterministic.

hdt-java
hdt-java copied to clipboard