sets
sets copied to clipboard
Base datasets on streaming
Goals:
- Read, write and modify datasets that do not fit into main memory.
- Streams should be first-class citizens. Base all datasets on streaming to have a single interface.
Possible solutions:
- Write own dataformat based on Numpy-blocks living in a tar container. Too much work.
- Dask (Documentation, Example): Comes with distributed compute graph and scheduler. Supports Numpy and Pandas interfaces which is nice. Still overkill for us.
- HDF5 (Documentation, Example). Looks good for our use case and is well supported. This also makes sets more widely applicable since results can be used from any language or from cluster computing systems like Spark.
We will go with HDF5:
- Use h5py library.
- h5py file handles will replace our
Datasetclass completely. - What was a column before now gets a file within the HDF5 container. Don't use HDF5 groups.
- We do not enforce equal length for all columns anymore. Also don't call them columns but file, destination, etc. depending on context.
- Steps should ask for the filename for their output in the constructor.
- Strings should always be unicode. Not enforced but rather a convention for dataset parsers.
Next steps:
- Get familiar with HDF5 and h5py. Especially variable-length text data.
Draft of usage example:
Wikipedia()('/dataset/wikipedia.hdf5/articles')
Tokenize()(
'/dataset/wikipedia.hdf5/articles',
'/dataset/target.hdf5/tokens', overwrite=True)
for batch in Batcher(100)('/dataset/target.hdf5/tokens'):
pass
We now have a preliminary implementation with https://github.com/danijar/sets/pull/16. Things to improve:
General:
- Update all other steps.
Embedding:
- Rename to
Encode. - Give
_get_shape()a better name (there is alreadyshapeand it does a different thing). - Bring batch the
depthoption. - Use np.ndenumerate in
apply().
MapStep:
- Fix all docstrings (formatting and language).
- Is there a way to actually operate on batches in hdf5?
Step:
- Remove unnecessary
urllibimport. - Fix docstring formatting.
SemEval:
- Sort imports.
- Parameters of
_parse_train()should becontaineranddataset. - Fix all docstrings.
IndexEncode:
- Why not inherit from Embedding?
Tokenize:
- Better naming (
ds).
Tests:
- Review tests.
- Rename
QuadraticSteptoSquareStep - Fix blank lines in classes.