hdt-java
hdt-java copied to clipboard
HDTManager::generateHDT is not deterministic.
I recognized that HDTManager::generateHDT
does not generate the same output for the same input.
This is not just caused by different timestamps in the header. For the same local file I got in five different runs this three different header data:
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121121" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150874" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1976817" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121096" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150849" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1975553" .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> "121071" .
_:statistics <http://purl.org/HDT/hdt#hdtSize> "150824" .
_:statistics <http://purl.org/HDT/hdt#originalSize> "1974289" .
Beside different headers, there is also some difference in the serialization of the dictionary.
What is the reason for this?
Do you remember if you were using an RDF file/stream with blank nodes? @jmkeil
I do not recall exactly, but it is likely I tested it using https://github.com/HajoRijgersberg/OM/blob/d5a3326e2f0f15f69272f3ce147b469fd90a1dc2/om-2.0.rdf. Does that fit to the hdt:originalSize
value?
My guess was that it was due to the randomness of the BNodes naming, so it does explain the difference in term of size, but to compute the size, we are using the random bnode names, but it’s usually _:anUuid, so a fixed size. So it’s not only that I think