fuel icon indicating copy to clipboard operation
fuel copied to clipboard

Move the methods that work on the state of a Dataset in the stateful object itself

Open fvisin opened this issue 8 years ago • 6 comments

It seems very weird to me that a Dataset object returns a state through Dataset.open and then manipolates the state itself with, e.g., Dataset.get_data(state).

It seems much better to have a static stateless Dataset providing only the properties of the dataset and being a Factory for stateful objects, e.g., State, that are self contained and have all the methods required to interact with their own state.

This would allow to do something like:

dataset = IterableDataset(
...     iterables=OrderedDict([('features', features), ('targets', targets)]),
...     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
...                              ('targets', ('batch', 'index'))]))
print('Sources are {}.'.format(dataset.sources))
state = dataset.open()
print(state.get_data())
state.reset()
print(state.get_data())
dataset.close(state=state)

instead of this:

dataset = IterableDataset(
...     iterables=OrderedDict([('features', features), ('targets', targets)]),
...     axis_labels=OrderedDict([('features', ('batch', 'height', 'width')),
...                              ('targets', ('batch', 'index'))]))
print('Sources are {}.'.format(dataset.sources))
state = dataset.open()
print(dataset.get_data(state=state))
state = dataset.reset(state)
print(dataset.get_data(state=state))
dataset.close(state=state)

Furthermore, the State object I am proposing and the DataStreams objects could be merged into one single class.

fvisin avatar Sep 17 '15 18:09 fvisin

@bartvm and @janchorowski should have a look at this. I think there are probably good reasons to structure it as it is but I don't know what they are off-hand.

I'd also be -0.5 on major API reshuffles that break people's code rather significantly (more than providing extra arguments that used to be optional, simple renames, or things of that nature that can be coped with relatively straightforwardly) without a really compelling reason. Preferably it would do at least two of a) making things easier to work with, b) significantly reduce lines of code necessary/complexity of implementation, c) enable new functionality that was not permitted by the old interface. And if we do do this, we should probably freeze a stable version before proceeding with it.

dwf avatar Sep 17 '15 19:09 dwf

From the top of my head: state can be a file object, e.g. in TextFile it is. It doesn't have get_data. Also for IterableDataset the state is an arbitrary iterable.

That being said, I find it convenient to have zero assumptions about state interface, and make the Dataset responsible for accessing it.

rizar avatar Sep 17 '15 19:09 rizar

@dwf I share your view on unnecessary major refactors. For what concerns a), IMHO the code this would make the code easier to read (and to write), but I agree that at least b) or c) are also needed to justify a major modification of the API. I cannot tell the effect of this modification on these last 2 points though, probably @bartvm or @janchorowski can answer on that.

@rizar I don't see any difference in that sense. I might be missing your point. Whatever is now in Dataset a part from the properties of the dataset would be moved to the State (or if you want DataStream) class. If TextFile doesn't have get_data, the same would be for the new object.

fvisin avatar Sep 17 '15 19:09 fvisin

For the record, we have had a discussion with @fvisin in which he changed his proposal, though still did not convince me.

rizar avatar Sep 17 '15 20:09 rizar

I spoke with @rizar about my proposal and apparently I wasn't very clear about it so I will try to rephrase it. He also pointed out that Fuel proved to be able to handle well different kind of data and needs, so an internal mechanism like this (which is transparent to the user) should not be questioned too much. He might be right. As I said before though, if the new interface does not give a significant advantage on either b) or c) I am more than fine to drop it. If it does give an advantage, I am happy it helped.

The modification to the interface I am proposing is to make IndexableDataset , IterableDataset, etc DataStreams. The real-world dataset implementations (e.g., fuel.datasets.MNIST, fuel.datasets.CIFAR10, etc, ..) would subclass one of this Datastreams. Their constructor would give back a stateful object with all the methods to access and operate on its own state.

An usage example would be:

from fuel.datasets import MNIST
from fuel.schemes import SequentialScheme

train_features = MNIST(sources='features', split='train', 
                       iterationScheme=SequentialScheme(examples=8, batch_size=4))
valid_features = MNIST(sources='features', split='valid', 
                       iterationScheme=SequentialScheme(examples=8, batch_size=4))

print(train_features.get_data())
print(valid_features.get_data())
train_features.reset()
valid_features.reset()

@rizar pointed out that in this way when the dataset can be loaded in memory I would end up loading the data multiple times, one for each stateful object.

To avoid loading the data in memory twice, it would suffice to keep a reference to the data in a class-level variable in MNIST (and similarly in every other DataStream class) operating on it in a singleton fashion, i.e.:

class MNIST(H5PYDataStream):
    filename = 'mnist.hdf5'

    def __init__(self, which_sets, **kwargs):
        kwargs.setdefault('load_in_memory', True)
        super(MNIST, self).__init__(**kwargs)

class H5PYDataStream(DataStream):
data = None
    # I omit the constructor, because it's not relevant

    def open(self):
        # `type` is needed to access the variable of the inheriting class
        if type(self).data is None:
            type(self).data = load_data()
        return type(self).data

Finally, @dmitriy-serdyuk pointed out that this Singleton behavior might confuse users because the first loading might take a long time while the next ones would be very fast. I don't think this is a issue and IMHO it would suffice to document it properly. Nonetheless, it is possible to solve this problem by having a Dataset class with the reference to the data to be instantiated once and for all (as in current implementation) and passing its instance to the DataStream constructor so that the DataStream object can access the data as in my proposal.

fvisin avatar Sep 17 '15 23:09 fvisin

I'm not convinced yet by any solution, but I share @fvisin's concern with the current status of the state object. My main concern is that it is an object without semantics except when combined with the specific Dataset instance that generated it. In fact, you never need the state object by itself, you always need a (dataset, state) pair, where state was created by state = dataset.open(). This is what you need to call reset(), next_epoch(), close(), or get_data(). It is cumbersome for the user, because they have to keep track of two objects instead of one. It is also more prone to errors, since they could mix up Dataset instances and states, which may work for a while (there is no check) and then get broken, or appear to work but give incorrect results. I think that if a state cannot be used or interpreted without its Dataset creator, then it should have a pointer back to that object, for instance. I understand that, most of the time, regular users will not deal with those abstractions, since they are already encapsulated in DataStream. However, there are places, in particular in tests, and in the online documentation, that explicitly call dataset.open() and use the returned "handle".

A naïve solution may be to have a new class, basically a glorified (dataset, state) pair. This would be the return type of Dataset.open(), could be passed to reset(), next_epoch(), etc. (in order not to break the compatibility), but would also have the corresponding methods. For instance:

def reset(self):
    self.dataset.reset(state=self.state)
    # or self.dataset.reset(state=self),
    # depending on what we want Dataset.reset to accept

def close(self):
    self.dataset.close(state=self.state)

def next_epoch(self):
    self.dataset.next_epoch(state=self.state)

def get_data(self, request=None):
    self.dataset.get_data(state=self.state, request=request)

Pros:

  • clearer semantics for a user
  • less risk of mixing up states and their datasets

Cons:

  • one additional object type

We would have to find a name (State? Handle?) distinguishing it from what is currently called state or handle

lamblin avatar Sep 29 '15 20:09 lamblin