sets Base datasets on streaming

Goals:

Read, write and modify datasets that do not fit into main memory.
Streams should be first-class citizens. Base all datasets on streaming to have a single interface.

Possible solutions:

Write own dataformat based on Numpy-blocks living in a tar container. Too much work.
Dask (Documentation, Example): Comes with distributed compute graph and scheduler. Supports Numpy and Pandas interfaces which is nice. Still overkill for us.
HDF5 (Documentation, Example). Looks good for our use case and is well supported. This also makes sets more widely applicable since results can be used from any language or from cluster computing systems like Spark.

We will go with HDF5:

Use h5py library.
h5py file handles will replace our Dataset class completely.
What was a column before now gets a file within the HDF5 container. Don't use HDF5 groups.
We do not enforce equal length for all columns anymore. Also don't call them columns but file, destination, etc. depending on context.
Steps should ask for the filename for their output in the constructor.
Strings should always be unicode. Not enforced but rather a convention for dataset parsers.

Next steps:

Get familiar with HDF5 and h5py. Especially variable-length text data.

May 16 '16 16:05 danijar

Draft of usage example:

Wikipedia()('/dataset/wikipedia.hdf5/articles')

Tokenize()(
    '/dataset/wikipedia.hdf5/articles',
    '/dataset/target.hdf5/tokens', overwrite=True)

for batch in Batcher(100)('/dataset/target.hdf5/tokens'):
    pass

Jun 17 '16 13:06 danijar

We now have a preliminary implementation with https://github.com/danijar/sets/pull/16. Things to improve:

General:

Update all other steps.

Embedding:

Rename to Encode.
Give _get_shape() a better name (there is already shape and it does a different thing).
Bring batch the depth option.
Use np.ndenumerate in apply().

MapStep:

Fix all docstrings (formatting and language).
Is there a way to actually operate on batches in hdf5?

Step:

Remove unnecessary urllib import.
Fix docstring formatting.

SemEval:

Sort imports.
Parameters of _parse_train() should be container and dataset.
Fix all docstrings.

IndexEncode:

Why not inherit from Embedding?

Tokenize:

Better naming (ds).

Tests:

Review tests.
Rename QuadraticStep to SquareStep
Fix blank lines in classes.

Jul 05 '16 21:07 danijar