hdt-java icon indicating copy to clipboard operation
hdt-java copied to clipboard

HDTManager::generateHDT is not deterministic.

Open jmkeil opened this issue 6 years ago • 3 comments

I recognized that HDTManager::generateHDT does not generate the same output for the same input. This is not just caused by different timestamps in the header. For the same local file I got in five different runs this three different header data:

_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121121" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150874" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1976817" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121096" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150849" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1975553" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121071" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150824" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1974289" .

Beside different headers, there is also some difference in the serialization of the dictionary.

What is the reason for this?

jmkeil avatar Jan 18 '19 10:01 jmkeil

Do you remember if you were using an RDF file/stream with blank nodes? @jmkeil

ate47 avatar Apr 08 '22 15:04 ate47

I do not recall exactly, but it is likely I tested it using https://github.com/HajoRijgersberg/OM/blob/d5a3326e2f0f15f69272f3ce147b469fd90a1dc2/om-2.0.rdf. Does that fit to the hdt:originalSize value?

jmkeil avatar Apr 08 '22 15:04 jmkeil

My guess was that it was due to the randomness of the BNodes naming, so it does explain the difference in term of size, but to compute the size, we are using the random bnode names, but it’s usually _:anUuid, so a fixed size. So it’s not only that I think

ate47 avatar Apr 08 '22 16:04 ate47