cbft Indexing beer-sample takes very long

I've been now indexing the beer-sample bucket through TAP, and I totally understand performance is not a goal yet.

That said, I'm 2hours in and at 5k docs, that is more than a second a doc on average. I notice that in the beginning it was around 3 docs/s but then slowed down and it's probably now stalled completely.

Two questions here:

is it expected to be that slow right now?
If so, I'm curious - what is holding it up in the indexing part?

Dec 05 '14 11:12 daschl

Its not that slow for me. Do you know how many PIndexes you're creating?

When I put everything in 1 PIndex it takes ~30 seconds to index 7k.

Dec 05 '14 12:12 mschoch

how can I find that out? I was just sticking with all the defaults it was presenting me.

Dec 05 '14 12:12 daschl

Its in the configuration somewhere. I could see in the stack trace on the other bug that it had all 64 vbuckets in a single pindex though. So that must not be the reason its so slow for you...

Dec 05 '14 12:12 mschoch

I just tried it again. It took about 2 minutes (which is too long). A few observations:

cbft outputs a LOT of console IO. In my experience this can be a problem.
It did seem to proceed relatively quickly to about 3k items, then slowed quite a bit.
Directly indexing the same dataset from files in the bleve example app "beer-search" takes only 37 seconds (this too is using boltdb) See:

$ ./beer-search 
2014/12/05 15:54:25 GOMAXPROCS: 1
2014/12/05 15:54:25 Creating new index...
2014/12/05 15:54:25 Listening on :8094
2014/12/05 15:54:26 Indexing...
2014/12/05 15:54:30 Indexed 1000 documents, in 4.53s (average 4.53ms/doc)
2014/12/05 15:54:34 Indexed 2000 documents, in 8.47s (average 4.23ms/doc)
2014/12/05 15:54:39 Indexed 3000 documents, in 12.98s (average 4.33ms/doc)
2014/12/05 15:54:44 Indexed 4000 documents, in 17.89s (average 4.47ms/doc)
2014/12/05 15:54:49 Indexed 5000 documents, in 23.27s (average 4.65ms/doc)
2014/12/05 15:54:54 Indexed 6000 documents, in 28.71s (average 4.79ms/doc)
2014/12/05 15:55:01 Indexed 7000 documents, in 34.82s (average 4.97ms/doc)
2014/12/05 15:55:03 Indexed 7303 documents, in 37.33s (average 5.11ms/doc)

So to me this suggests to me that there is some part of either DCP or cbft that is leading to the slowdown.

Dec 05 '14 20:12 mschoch

Is it disk bound? I ran it on a HDD iMac...

Dec 05 '14 21:12 daschl

Yeah that is part of it. Also, I think we determined the batch sizes are different.

Dec 05 '14 21:12 mschoch

Wanted to leave some tips & tricks info, originally from Marty Schoch (2015/02/19). Haven't yet experimented with these myself yet, though...

I played around with some different leveldb configs with bleve-bench. It looks like the following config works best:

{ "write_buffer_size": 536870912, "lru_cache_capacity": 536870912, "bloom_filter_bits_per_key": 10 }

In my tests, the indexing in batches is almost 2x as fast with these settings. ... these settings seems to affect the query speed very much. .... (perhaps) the way we lay out data means that (bleve doesn't) end up getting much benefit from (leveldb's) bloom filter, because we end up needing something from each file anyway. …

...

... LevelDB the configuration is at the database level. Above, the 2 caches are 512MB each. So, if your number of shards * 1GB is greater than the available ram, you should probably reduce them.

Mar 26 '15 16:03 steveyen