hdt-java icon indicating copy to clipboard operation
hdt-java copied to clipboard

hdtCat error in LongArrayDisk with large files

Open balhoff opened this issue 1 year ago • 6 comments

I'm trying to merge two HDT files using hdtCat.sh. Each file has more than 13 billion triples:

  • file 1 has 13736601325 triples
  • file 2 has 13827925785 triples

After about 25 hours I get this error:

Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 29
	at org.rdfhdt.hdt.util.disk.LongArrayDisk.get(LongArrayDisk.java:116)
	at org.rdfhdt.hdt.dictionary.impl.utilCat.CatMappingBack.set(CatMappingBack.java:77)
	at org.rdfhdt.hdt.dictionary.impl.FourSectionDictionaryCat.cat(FourSectionDictionaryCat.java:244)
	at org.rdfhdt.hdt.hdt.impl.HDTImpl.cat(HDTImpl.java:486)
	at org.rdfhdt.hdt.hdt.HDTManagerImpl.doHDTCat(HDTManagerImpl.java:329)
	at org.rdfhdt.hdt.hdt.HDTManager.catHDT(HDTManager.java:642)
	at org.rdfhdt.hdt.tools.HDTCat.cat(HDTCat.java:82)
	at org.rdfhdt.hdt.tools.HDTCat.execute(HDTCat.java:116)
	at org.rdfhdt.hdt.tools.HDTCat.main(HDTCat.java:184)

I tried both v3.0.10 and v3.0.9 with the same result. I can provide these files, but each is about 170 GB. I haven't run into this issue with any smaller files.

balhoff avatar Jun 20 '24 15:06 balhoff

Hi, could you try out this:

https://github.com/the-qa-company/qEndpoint/wiki/qEndpoint-CLI-commands#hdtdiffcat-qep-specific

it is an evolution of the tool ....

D063520 avatar Jun 21 '24 06:06 D063520

@D063520 thank you for pointing that out, I hadn't come across it yet. I'm trying it now.

balhoff avatar Jun 21 '24 16:06 balhoff

@D063520 the qEndpoint tool worked! It seems a good bit faster as well, but it uses quite a bit more RAM. I had originally been using a max heap of 150 GB, but ended up increasing it 3 times until it worked with a 400 GB heap. Now I've got an HDT file containing 27.5 billion triples.

balhoff avatar Jun 24 '24 17:06 balhoff

@D063520 actually I used hdtCat.sh from your package, rather than hdtDiffCat. Are these different?

balhoff avatar Jun 24 '24 17:06 balhoff

@ate47

D063520 avatar Jun 26 '24 11:06 D063520

If you have the -kcat it's the same, otherwise by default the qep cli is using the disk optimized version and the rdfhdt cli the memory version. The memory one is slow and not efficient

ate47 avatar Jun 26 '24 11:06 ate47