Ahmet Arslan comments

Results 14 comments of


                                            Ahmet Arslan

Integrate Waterloo spam scores and other static priors into index

Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical...

Integrate Waterloo spam scores and other static priors into index

My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks (...

Integrate Waterloo spam scores and other static priors into index

It looks like, we can resolve warc folder path for a given docid deterministically. e.g. docid = clueweb09-en0000-00-35369 path = ClueWeb09_English_1/en0000/00.war.gz Then we can create miniature fusion files from the...

Integrate Waterloo spam scores and other static priors into index

aha I see. So you just want to percolate the result list? Then we need ability to query arbitrary document id. I cannot think of a solution without a key-value...

Integrate Waterloo spam scores and other static priors into index

Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.

Integrate Waterloo spam scores and other static priors into index

I played with `Object2IntOpenHashMap` however following program `java -server -Xmx20g` resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this...

Integrate Waterloo spam scores and other static priors into index

I have 64 GB :) Is there a maximum `-Xmx` value we should aim here? Can you try the loading code? I wonder how much heap it will take.

Integrate Waterloo spam scores and other static priors into index

with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates...

Integrate Waterloo spam scores and other static priors into index

I found a better data structure `ReferenceOpenHashSet` for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and...

Spanish

Thanks @gladyscarrizales for the Spanish translation! For your future contributions, can you please mention the JIRA issue (e.g. `CONNECTORS-1256`) in the pull request title? We would like to test whether...