datasets
datasets copied to clipboard
Add Quickdraw Sketch RNN Dataset
Add the Quickdraw Dataset used to train Sketch RNN.
Close one of the TODOs of #337
Caveats:
- I took the liberty and created a sequence directory to hold this data
- I added the test but it is still red
- Manually testing the dataset from TF worked as intended
Thanks for dataset.
You need to add fake_examples.
I am also changing the number of shards per split going from 1-1-1 to 20-5-5, I will update the gist accordingly.
@us How should fake examples work? Are they simply placeholders or should I add 3 complete .npz files?
You can add .npz
files to default. You should think like a that's your extracted dir. Also you can give the _split_generators outputs
details
Updated the dataset_info.json with the correct BibTeX citation.
One thing I have seen during testing is that if we want to be able to reproduce the technique of the original paper we should probably invert the notation for the end of the drawing. In the original paper, the authors pad each sketch with (0,0,0,0,1)
, if we want to leverage the power of the tf.data.Dataset "pipeline" it could be more useful to use (0,0,0,0,0)
as the end stroke, doing this allows using tf.padded_batch
and potentially tf.data.experimental.bucket_by_sequence_length
One thing I have seen during testing is that if we want to be able to reproduce the technique of the original paper we should probably invert the notation for the end of the drawing. In the original paper, the authors pad each sketch with
(0,0,0,0,1)
, if we want to leverage the power of the tf.data.Dataset "pipeline" it could be more useful to use(0,0,0,0,0)
as the end stroke, doing this allows usingtf.padded_batch
and potentiallytf.data.experimental.bucket_by_sequence_length
Contrary to what said above I have moved the padding step to the examples generation, the final user will thus just need to use tf.data.Dataset.filter()
to filter for the various labels.
I have modified the preprocess step adding the stroke signaling the start of a sketch as they do in here.
EDIT: :thinking: apparently the py2-tf2 test is failing. Also updated the dataset info since I have increased the training set shards from 20 to 30.
Fixed an error in the padding function. The TF 2 Py2.7 Test will fail due to a known Keras issue. https://stackoverflow.com/a/55903975/8050556
@mr-ubik still continue?
I had stopped due to the issue with NumPy and Keras I had referenced earlier while I have been working on making sure the data format and pre-processing were accurately reflecting the one done in the paper.
Did you open an issue by tagging this pr?
The issue should be fixed now. I will look at the code next week and start pushing new updates again. :heart:
Okay! It'll be awesome :)
@mr-ubik hey don't forget !
@us Just sorting through issues at work, contributions will resume ASAP
Hi there! This PR looks great, is it still active?
Hi @ageron! I actually had to stop working on it due to other priorities, but I'd like to resume it. It should probably be updated with the new API of tfds
if I am not mistaken.
There's also a discussion to be had on whether to embed the preprocessing done in Sketch-RNN into tfds
pipeline or leaving it up to the user.
In an internal fork of tfds
we are working on at @zurutech/ml we are going to try and see if we can implement this kind of behavior via the use of several BUILDER_CONFIG
; if this work out it could be used for this dataset as well.
is this PR active