pdk
pdk copied to clipboard
Add an option to set import strategy
The Go client supports importing using different strategies. It may be useful to have an option to specify that
I'd rather not add more API surface area right now (either here, or in the Go client). I'm not convinced that having multiple strategies to choose from is a good user experience. I'm thinking that we can combine the two (batch and timeout) so that normally batch is used, but there is a timer that fires if data has been sitting buffered for too long so that there is a cap on the lag between data coming in and being indexed.
Adding to this a little bit, I ran into an interesting case importing batches of records which did not behave the way I was expecting (not wrong necessarily, but unexpected).
Importing 3M records (columns 0 -> 3M-1) with BatchSize = 1M
resulted in the following batch pattern. It was unexpected because I was expecting every post to the Pilosa server to contain 1M records. But what actually occurs is that 1M records are mapped to the appropriate slice, and then all slices are posted. This resulted in posts containing various sizes.
IMPORT: slice: 0, records: 1000000
IMPORT: slice: 0, records: 48576
IMPORT: slice: 1, records: 951424
IMPORT: slice: 0, records: 0
IMPORT: slice: 1, records: 97152
IMPORT: slice: 2, records: 902848
(note that https://github.com/pilosa/go-pilosa/pull/142 fixes the 0 records issue)
I realize that waiting until a slice has 1M records before posting may not be ideal either, especially in the case where columns are set randomly across many slices.
So I agree with @jaffee, we need think through the batch strategy and be smart about it ourselves. Putting it on the user will likely result in unexpected and/or poor performance.