pdk icon indicating copy to clipboard operation
pdk copied to clipboard

Develop CSV importer

Open alanbernstein opened this issue 7 years ago • 4 comments

For the use case work, we put together a CSV import system that is specific to the two use cases, but lays some groundwork for working with more general data sources. The scope is limited to well-formatted, well-defined tabular data, so users will be responsible for providing clean data.

alanbernstein avatar Mar 09 '17 21:03 alanbernstein

Mind sharing those use cases and how a CSV file would map to the structure of an index?

bruth avatar Aug 12 '18 18:08 bruth

The mapping for relational data is outlined in our docs at https://www.pilosa.com/docs/latest/data-model/#relational-analogy, and we have a few use case writeups at https://www.pilosa.com/use-cases/. I believe the two referenced in this ticket are transportation and network traffic. Note that these pages are overdue for some updates; you can see up to date PDK use case code in the repo: https://github.com/pilosa/pdk/tree/master/usecase.

alanbernstein avatar Aug 13 '18 03:08 alanbernstein

Thanks. I found the table in the Python notebook you put together helpful as well as the suggestion for binning strategies. The general recommendation for row IDs is that they are contiguous to optimize the bitmap compression (via roaring)? Is this handled if a field is created that supports keys?

bruth avatar Aug 13 '18 12:08 bruth

@bruth it isn't as crucial that row IDs be continuous, but column IDs should be as close to continuous as possible. It is handled if you use keys.

jaffee avatar Aug 24 '18 16:08 jaffee