stateline
stateline copied to clipboard
Write chain outputs more efficiently
Use run length encoding to reduce space. Maybe switch to binary format, or even use something compression library such as snappy (https://github.com/google/snappy).
including -- binary (hdf5?) run length encoding
multiple files for long chains?
I think multiple files is a good idea. I'm still not sure about binary vs text, since the more complicated the file format, the harder it is to read in other languages (e.g. if I'm using an R binding I would expect to easily read the file in R). If the format is too complicated then each language binding would have to provide its own chain file reading functions.
Embedded vs server: http://stackoverflow.com/questions/3108437/when-to-use-an-embedded-database
It's basically a toss up between a binary format server and an embedded binary DB. The only difference really is that the server will run in a separate process and the DB in a separate thread.
Embedded DBs:
- Raw ostream: Hard to implement atomicity etc. by hand.
- CSV: text protocol is too slow
- LevelDB: We used this before, but it's not a standard format so Python can't read it.
- HDF5: A bit overkill?
- https://en.wikipedia.org/wiki/Embedded_database#Comparisons_of_database_storage_engines
Server DB:
- InfluxDB: I couldn't find the binary protocol. https://github.com/influxdata/influxdb/issues/139 ("the text protocol with gzip already saturates the storage engine")
- Graphite: text protocol
- Memcached: in memory and key value store
- Redis: in memory and key value store
- Interestingly...Postgres: https://news.ycombinator.com/item?id=8368509
Requirements:
- Fast: a binary protocol would be faster than text
- Atomic: so we can Ctrl-C and not break it. (If what we're doing is not atomic, we could always trap Ctrl-C signals (as we are now) and only stop when we're not in the middle of writing to a file.)
- Standard format / Easy to parse format: so something like Python can read it without a C++ executable to extract the data.