featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

How to generate file outside pilosa

Open sydt2014 opened this issue 5 years ago • 2 comments

I have 200T csv file. It is impossible to generate bitmap file in pilosa.Could I generate bitmap file outside pilosa by other engine such as spark/tez and load bitmap file into pilosa?

sydt2014 avatar Nov 21 '19 08:11 sydt2014

This is very important to use pilosa in production . Actually original data is very big and the bitmap file can not be generated by pilosa and must be produced by bigdata engine such as spark/mapreudce.

sydt2014 avatar Nov 21 '19 08:11 sydt2014

@sydt2014 sorry for the delay here — the short answer is that while this is possible, you'd probably have to do quite a bit of custom work to make it happen.

If you have a really huge amount of data, you might look into Molecula — we're building a lot of tooling and new capabilities around Pilosa to deal with large datasets and enterprise needs. If that doesn't look to be an option, I can give you some pointers on file format and such.

Here's a snippet from api.go which discusses the import-roaring endpoint and our file format:

// ImportRoaring is a low level interface for importing data to Pilosa when
// extremely high throughput is desired. The data must be encoded in a
// particular way which may be unintuitive (discussed below). The data is merged
// with existing data.
//
// It takes as input a roaring bitmap which it uses as the data for the
// indicated index, field, and shard. The bitmap may be encoded according to the
// official roaring spec (https://github.com/RoaringBitmap/RoaringFormatSpec),
// or to the pilosa roaring spec which supports 64 bit integers
// (https://www.pilosa.com/docs/latest/architecture/#roaring-bitmap-storage-format).
//
// The data should be encoded the same way that Pilosa stores fragments
// internally. A bit "i" being set in the input bitmap indicates that the bit is
// set in Pilosa row "i/ShardWidth", and in column
// (shard*ShardWidth)+(i%ShardWidth). That is to say that "data" represents all
// of the rows in this shard of this field concatenated together in one long
// bitmap.
func (api *API) ImportRoaring(ctx context.Context, indexName, fieldName string, shard uint64, remote bool, req *ImportRoaringRequest) 

jaffee avatar Dec 03 '19 04:12 jaffee