anserini icon indicating copy to clipboard operation
anserini copied to clipboard

Integrate Waterloo spam scores and other static priors into index

Open lintool opened this issue 9 years ago • 16 comments

We should develop a generic mechanism to store and use Waterloo spam scores, PageRank, HITS, and other static priors.

@iorixxx Do you have some code to contribute along these lines?

lintool avatar Oct 21 '15 13:10 lintool

Here is what I do for the spam rankings: I split huge (15BG) spamFusion file into chunks, these chunks (spam scores) are saved into a directory structure that is identical to the ClueWeb09B *.warc directory structure.

To perform the aforementioned operation, I rely on voldemort to store docID, and spamScore pairs of category B.

However, when dealing with category A, voldemort can be the only source during indexing. No need to dump data into files into a directory structure that is identical to CW09B.

What do you think about storing spam scores, in key value database such as voldemort? For fast retrieval during indexing and/or searching?

Is this feasible ? Or generic enough?

iorixxx avatar Oct 22 '15 11:10 iorixxx

Hrm... that's pretty heavyweight and requires an external dependency. I suppose for catB everything can fit in memory. Perhaps we can assume the same for catA? 500m * ( 2 bytes for value + 4 bytes for key) = 30 GB... reasonable on a server?

lintool avatar Oct 22 '15 13:10 lintool

My ClueWeb09B_SpamFusion (contains chunks) directory is 1.4G in size. Indexer loads a single chunk file per a warc file. So memory won't be a problem. But preparing these chunks ( aligned with the warc files ) is heavy. I again rely on voldemort for producing chunks.

Here by saying chunk : I mean a miniature spam ranking file just for a single warc file.

iorixxx avatar Oct 22 '15 13:10 iorixxx

It looks like, we can resolve warc folder path for a given docid deterministically. e.g. docid = clueweb09-en0000-00-35369 path = ClueWeb09_English_1/en0000/00.war.gz

Then we can create miniature fusion files from the clueweb09spam.Fusion directly?

Spam scores will be used for skipping documents (given threshold ) during indexing?

iorixxx avatar Oct 22 '15 14:10 iorixxx

I'd rather index everything and use spam as a feature during retrieval. That way we don't need to develop a cutoff.

lintool avatar Oct 22 '15 14:10 lintool

aha I see. So you just want to percolate the result list? Then we need ability to query arbitrary document id. I cannot think of a solution without a key-value database or something. How about we index spam rankings with lucene? for arbitrary lookup?

iorixxx avatar Oct 22 '15 17:10 iorixxx

Just a big hashmap we load into memory at startup? Using, fastuil, for example?

lintool avatar Oct 22 '15 17:10 lintool

Let me try fastutil tomorrow. If it does not blow the memory that would be the best solution.

iorixxx avatar Oct 22 '15 17:10 iorixxx

I played with Object2IntOpenHashMap<String> however following program java -server -Xmx20g resulted in out of memory error. I think, even if we don't insert into a map, just sequentially traversing this big file will take time. What is the preferred course of action here?

/**
     * Try to load clueweb09spam.Fusion (15 GB) file to memory
     *
     * @param clueweb09spam spam file name
     * @throws IOException
     */
 public static void loadSpamFusion(String clueweb09spam) throws IOException {

        Object2IntOpenHashMap<String> map = new Object2IntOpenHashMap<>();

        Path clueweb09spamFusion = Paths.get(clueweb09spam);

        if (!Files.isRegularFile(clueweb09spamFusion) || !Files.exists(clueweb09spamFusion) || !Files.isReadable(clueweb09spamFusion))
            throw new IllegalArgumentException(clueweb09spamFusion + " does not exist or is not a file");


        try (BufferedReader reader = Files.newBufferedReader(clueweb09spamFusion, StandardCharsets.US_ASCII)) {

            for (; ; ) {
                String line = reader.readLine();
                if (line == null)
                    break;

                // lines with the following format: percentile-score clueweb-docid
                String[] parts = line.split("\\s+");
                map.put(parts[1], Integer.parseInt(parts[0]));
            }
        }

        System.out.println(map.size() + "many entries loaded into the map");
        map.clear();
    }

iorixxx avatar Oct 25 '15 18:10 iorixxx

How much memory do you have on your machine? The machine I use at UMD has 0.75 TB RAM :)

lintool avatar Oct 25 '15 21:10 lintool

I have 64 GB :) Is there a maximum -Xmx value we should aim here? Can you try the loading code? I wonder how much heap it will take.

iorixxx avatar Oct 27 '15 08:10 iorixxx

Try using max heap?

lintool avatar Oct 27 '15 12:10 lintool

with 80GB, 503903810 many entries loaded into the map in 00:42:27. If you think this resource is reasonable, I can replace voldemort with fastutil map in the code that percolates trec submission file.

iorixxx avatar Nov 16 '15 16:11 iorixxx

I found a better data structure ReferenceOpenHashSet<String> for the task. I am abandoning voldemort for my self too. The program will take three arguments : spam threshold, submission file and waterloo spam scores file/folder. And then it will remove spammiest documents from the submission file. Does this reasonable?

iorixxx avatar Nov 25 '15 21:11 iorixxx

Hi @iorixxx sorry for the late reply - was at TREC and starting to dig out of a backlog. Yes, this seems reasonable!

lintool avatar Nov 29 '15 03:11 lintool

This is probably the right way to implement this: https://lucene.apache.org/core/8_11_0/core/org/apache/lucene/document/FeatureField.html

lintool avatar May 29 '22 21:05 lintool