Ahmet Arslan

Results 14 comments of Ahmet Arslan

Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical...

My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks (...

It looks like, we can resolve warc folder path for a given docid deterministically. e.g. docid = clueweb09-en0000-00-35369 path = ClueWeb09_English_1/en0000/00.war.gz Then we can create miniature fusion files from the...

aha I see. So you just want to percolate the result list? Then we need ability to query arbitrary document id. I cannot think of a solution without a key-value...

Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.

I played with `Object2IntOpenHashMap` however following program `java -server -Xmx20g` resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this...

I have 64 GB :) Is there a maximum `-Xmx` value we should aim here? Can you try the loading code? I wonder how much heap it will take.

with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates...

I found a better data structure `ReferenceOpenHashSet` for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and...

Thanks @gladyscarrizales for the Spanish translation! For your future contributions, can you please mention the JIRA issue (e.g. `CONNECTORS-1256`) in the pull request title? We would like to test whether...