datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add embeddings to datasets?

Open cleong110 opened this issue 1 year ago • 1 comments

One thing I have done a number of times, manually:

  1. Download a video dataset such as ASL citizen. Usually direectly from the source so I have the .mp4 files, rather than with this library.
  2. run pose estimation on them all, foo1.mp4, foo2.mp4
  3. put those through SignCLIP, saving off the embeddng as foo1-embedded-using-asl-citizen-model.npy, foo1-embedded-using-sem-lex-model.npy, etc.
  4. backup those files somewhere.

It would be nice to have a consistent, documented way to bring all this into the sign-language-datasets ecosystem. Is there a standardized method for how to save the embeddings, load them in, etc?

Perhaps something like...

ds = tfds.load("asl-citizen")

# if they're hosted somewhere and the dataloader knows it
ds_with_embeddings = tfds.load("asl-citizen", embeddings="signclip_asl_citizen") 

# if they're hosted locally
ds_with_embeddings = tfds.load("asl-citizen", embeddings="/path/to/folder/with/embeddings") 

See also: https://www.tensorflow.org/datasets/catalog/sift1m which is a tfds with pretrained embeddings

See also also: https://www.tensorflow.org/datasets/catalog/laion400m

cleong110 avatar Dec 12 '24 15:12 cleong110

I think this could be useful!

AmitMY avatar Dec 13 '24 19:12 AmitMY