rucene icon indicating copy to clipboard operation
rucene copied to clipboard

Index too large

Open fulmicoton opened this issue 4 years ago • 5 comments

The search benchmark consists in indexing all docs in wikipedia en. To level the field, we merge all segments down to a single segment.

I was happy to see that rucene also implemented force_merge with the blocking option.

Unfortunately after the merge finish, I end up with an index of 24 GB. (Tantivy and Lucene both end up with an index of 3GB.)

fulmicoton avatar Dec 23 '19 01:12 fulmicoton

Apologies: I found one of the problem : I was indexing with term vectors.information! I'll reindex and report here if it solves the problem or not

fulmicoton avatar Dec 23 '19 01:12 fulmicoton

Correction 6.6GB.

This is a bit more than twice the size I would have expected. I think the files that were before the merged are simply not deleted.

fulmicoton avatar Dec 23 '19 02:12 fulmicoton

Hi Paul, Thanks for reporting issues, We will try fixing these issues and let you know when we are done

sunxiaoguang avatar Dec 23 '19 02:12 sunxiaoguang

I am mostly blocked on issue #3

fulmicoton avatar Dec 23 '19 06:12 fulmicoton

@tongjianlin Can you double check if we return from blocking force_merge before old segments getting reclaimed?

sunxiaoguang avatar Dec 23 '19 15:12 sunxiaoguang