pdk
pdk copied to clipboard
Develop CSV importer
For the use case work, we put together a CSV import system that is specific to the two use cases, but lays some groundwork for working with more general data sources. The scope is limited to well-formatted, well-defined tabular data, so users will be responsible for providing clean data.
Mind sharing those use cases and how a CSV file would map to the structure of an index?
The mapping for relational data is outlined in our docs at https://www.pilosa.com/docs/latest/data-model/#relational-analogy, and we have a few use case writeups at https://www.pilosa.com/use-cases/. I believe the two referenced in this ticket are transportation and network traffic. Note that these pages are overdue for some updates; you can see up to date PDK use case code in the repo: https://github.com/pilosa/pdk/tree/master/usecase.
Thanks. I found the table in the Python notebook you put together helpful as well as the suggestion for binning strategies. The general recommendation for row IDs is that they are contiguous to optimize the bitmap compression (via roaring)? Is this handled if a field is created that supports keys?
@bruth it isn't as crucial that row IDs be continuous, but column IDs should be as close to continuous as possible. It is handled if you use keys.