scidata icon indicating copy to clipboard operation
scidata copied to clipboard

Make IMDB Reviews dataset consistent

Open josevalim opened this issue 3 years ago • 2 comments

Currently they return a map and still support transforms. Is it possible to normalize them to a similar result as the other datasets? For example, return a tuple of {{input_binary, input_type, input_shape}, {label_binary, label_type, label_shape}}?

josevalim avatar Jan 05 '22 16:01 josevalim

Thanks for flagging! I'll open a PR to get of the transforms.

As for normalizing, we could truncate or pad the binaries to make them uniform in shape (as @seanmor5 suggested). That way we can provide an input_shape that describes the data exactly. Otherwise, we could make the input_shape something like {25000, nil} to indicate the length of each binary varies. Or do you have another suggestion?

We may want to return the map that IMDB.download/1 current returns in a new IMDB.to_columns/1 function for use with Explorer, similar to Squad.to_columns/1.

t-rutten avatar Jan 16 '22 23:01 t-rutten

I see… Hrm. It would be nice to see how this dataset would be used with Axon or Explorer then before making a decision.

josevalim avatar Jan 17 '22 07:01 josevalim