datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add Quickdraw Sketch RNN Dataset

Open mr-ubik opened this issue 5 years ago • 19 comments

dataset_info.json

Add the Quickdraw Dataset used to train Sketch RNN.

Close one of the TODOs of #337

Caveats:

  • I took the liberty and created a sequence directory to hold this data
  • I added the test but it is still red
  • Manually testing the dataset from TF worked as intended

mr-ubik avatar Mar 27 '19 09:03 mr-ubik

Thanks for dataset.

You need to add fake_examples.

us avatar Mar 27 '19 10:03 us

I am also changing the number of shards per split going from 1-1-1 to 20-5-5, I will update the gist accordingly.

mr-ubik avatar Mar 27 '19 10:03 mr-ubik

@us How should fake examples work? Are they simply placeholders or should I add 3 complete .npz files?

mr-ubik avatar Mar 27 '19 10:03 mr-ubik

You can add .npz files to default. You should think like a that's your extracted dir. Also you can give the _split_generators outputs details

us avatar Mar 27 '19 11:03 us

Updated the dataset_info.json with the correct BibTeX citation.

mr-ubik avatar Mar 28 '19 09:03 mr-ubik

One thing I have seen during testing is that if we want to be able to reproduce the technique of the original paper we should probably invert the notation for the end of the drawing. In the original paper, the authors pad each sketch with (0,0,0,0,1), if we want to leverage the power of the tf.data.Dataset "pipeline" it could be more useful to use (0,0,0,0,0) as the end stroke, doing this allows using tf.padded_batch and potentially tf.data.experimental.bucket_by_sequence_length

mr-ubik avatar Apr 02 '19 14:04 mr-ubik

One thing I have seen during testing is that if we want to be able to reproduce the technique of the original paper we should probably invert the notation for the end of the drawing. In the original paper, the authors pad each sketch with (0,0,0,0,1), if we want to leverage the power of the tf.data.Dataset "pipeline" it could be more useful to use (0,0,0,0,0) as the end stroke, doing this allows using tf.padded_batch and potentially tf.data.experimental.bucket_by_sequence_length

Contrary to what said above I have moved the padding step to the examples generation, the final user will thus just need to use tf.data.Dataset.filter() to filter for the various labels.

mr-ubik avatar Apr 09 '19 14:04 mr-ubik

I have modified the preprocess step adding the stroke signaling the start of a sketch as they do in here.

EDIT: :thinking: apparently the py2-tf2 test is failing. Also updated the dataset info since I have increased the training set shards from 20 to 30.

mr-ubik avatar Apr 18 '19 10:04 mr-ubik

Fixed an error in the padding function. The TF 2 Py2.7 Test will fail due to a known Keras issue. https://stackoverflow.com/a/55903975/8050556

mr-ubik avatar May 07 '19 09:05 mr-ubik

@mr-ubik still continue?

us avatar Jul 17 '19 14:07 us

I had stopped due to the issue with NumPy and Keras I had referenced earlier while I have been working on making sure the data format and pre-processing were accurately reflecting the one done in the paper.

mr-ubik avatar Jul 21 '19 10:07 mr-ubik

Did you open an issue by tagging this pr?

us avatar Jul 21 '19 19:07 us

The issue should be fixed now. I will look at the code next week and start pushing new updates again. :heart:

mr-ubik avatar Jul 24 '19 11:07 mr-ubik

Okay! It'll be awesome :)

us avatar Jul 24 '19 11:07 us

@mr-ubik hey don't forget !

us avatar Jul 29 '19 22:07 us

@us Just sorting through issues at work, contributions will resume ASAP

mr-ubik avatar Jul 31 '19 10:07 mr-ubik

Hi there! This PR looks great, is it still active?

ageron avatar Mar 21 '20 06:03 ageron

Hi @ageron! I actually had to stop working on it due to other priorities, but I'd like to resume it. It should probably be updated with the new API of tfds if I am not mistaken. There's also a discussion to be had on whether to embed the preprocessing done in Sketch-RNN into tfds pipeline or leaving it up to the user.

In an internal fork of tfds we are working on at @zurutech/ml we are going to try and see if we can implement this kind of behavior via the use of several BUILDER_CONFIG; if this work out it could be used for this dataset as well.

mr-ubik avatar Jul 17 '20 08:07 mr-ubik

is this PR active

osbm avatar Jul 16 '22 22:07 osbm