fuel
fuel copied to clipboard
H5PYDataset returns ndarray instead of list
When using Variable-length data, the datatype returned by get_data is an ndarray (object type) of an ndarray containing the data itself, e.g. for images, it is an ndarray containing the images. However, other methods do not handle this, e.g. RandomFixedSizeCrop requires a list of images or a 4d array:
if isinstance(source, list) and all(isinstance(b, numpy.ndarray) and
b.ndim == 3 for b in source):
....
raise ValueError("uninterpretable batch format; expected a list "
"of arrays with ndim = 3, or an array with "
"ndim = 4")
The ServerDataStream also gave me problems. I think we should make this consistent, probably by returning lists in H5PYDataset.get_data for this case, instead in of ndarrays.
@vdumoulin What's your thought on this? I'm running into this as well.
In general, I think transformers should be agnostic as to which one of the two they're getting (like e.g. MinimumImageDimensions
), but there's still the question as to whether we should prefer one format over the other. H5PYDataset
returns NumPy objects because fancy indexing was used I guess, and that's a big advantage of dealing with NumPy arrays. On the other hand, transformers can have simpler code if they can just use return [f(x) for x in batch]
, although I guess we could write a helper function map_likewise
that applies a function over a list or NumPy array, and returns the same object it got as an input.
I think having something like map_likewise
would be a good idea. Maybe we could even make it a decorator so that it's easy to enable the behaviour by default?
We should also have something like ToNumpy
or ToList
transformers, should this behaviour become the norm, so that people can explicitly force changing type.