Use huggingface dataset repositories for rollouts, not model repositories
Problem
We recently added the feature to load rollouts from HuggingFace hub here. The new function assumes that rollouts are stored in a file rollouts.npz in the repository of a model.
Right now I see the following issues with this solution:
- [major] HuggingFace model repositories should be used for models and not for datasets. There are specific dataset repositories in the HuggingFace hub and I think we should use them as such.
- [minor] The function is located in the package
imitation.policies.serializewhile it has nothing to do with serializing policies. - [minor] The function assumes a
rollout.npzfiles is stored with the model. Placing it there is mostly a manual process which is very brittle. Training and uploading models is at least semi-automated by the zoo.
Related to this is #408 where we half decided to consistently name rollouts "trajectories".
Solution
- Move the rollouts to HuggingFace datasets and change the function to load them from there. I am not sure if
.npzfiles are supported right now. - Move it to something like
imitation.data.serialize. - Automate the generation and upload of the rollouts in a similar way to how the zoo automates training and upload.
Thanks for flagging this! Didn't know about HuggingFace datasets, I agree we should switch to that.
Support automation in the form of "there's a command we can run to do this given a set of policies", but we may want to do quality control on the policies at least so probably can't be fully automated end-to-end.
In the long run I would like to transition to make the best usage of the existing HuggingFace Dataset infrastructure but I realized that this would required some major changes to the imitation library. I broke it down into a number of smaller steps which in itself already provide an improvement over the status quo. Scroll down for a diagram to illustrate the steps.
Current State
Right now Experts generate list of TrajectoryWithRewards via the rollout() function. Those trajectories are fed to the imitation learning algorithms. To store the trajectories we concatenate them all to one long trajectory and store that alongside with the split positions in an numpy .npz file which is basically a dict of numpy arrays.
This .npz file is manually uploaded to a HuggingFace Model Repository. There are functions to automatically retrieve it from there and load it.
Step 1: Using the HuggingFace Dataset Repository with the .npz format
A first step would be to move the .npz files HuggingFace Dataset Repositories.
HuggingFace datasets are backed by Apache Arrow which does not natively support the .npz format as far as I understand. Therefore we need to provide a special dataset loading script in order to be compatible with the HuggingFace datasets library.
Advantages:
- We use the HuggingFace Hub as intended
- Datasets are explicitly discoverable
- We can have versions of datasets
- We now have a cache that also works for transformations on the dataset
- Backwards-compatible to the old file format
- All changes are hidden behind the
load_rollouts_from_huggingface()function. No changes downstream of that.
Step 2a: Switching to a file format supported by Apache Arrow
In the long run I don't want the imitation library to contain details like the file format for storing trajectories. To me this counts as "infrastructure" and as such should be upstreamed to either SB3 or the huggingface dataset library (I contacted @osanseviero from HuggingFace to discuss this).
Towards this I would like to get rid of the numpy format and use something that Apache Arrow natively supports. I would also like to re-think the conversion between the HF Dataset object and our Trajectory format. Maybe there are smarter/simpler ways to solve this than concatenating/splitting.
Advantages
- We can now stream datasets/partial downloads possible for large datasets.
- No more special purpose loading script
- No more thinking about how a dataset gets stored on disk. Arrow does that for us now.
Disadvantages
- Lock-In to Apache Arrow?
Step 2b Switching to a PyTorch DataLoader
In the past I already considered migrating all the algorithms to take PyTorch DataLoader objects as input and made some draft implementation for this.
The PyTorch DataLoader could be fed from a HF Dataset (this conversion is already provided by HF), from an expert (for online learning) or from any other data source that is compatible with the PyTorch DataLoader.
Advantages
- Trajectorys can transparently be loaded or generated on-the-fly.
- PyTorch
DataLoaderinterface opens up lots of possibilities to load data from without introducing any other dependency. - PyTorch
DataLoadermakes use of parallelization - Data Pipeline heavily simplified
- The Dataset -> DataLoader conversion is already provided by HF
Disadvantages
- Major changes to the Algo API necessary
Step 3: Merge
Steps 2a and 2b can be executed independently. When we did them both we should be able to easily merge the changes.
Diagram of the steps
Green: Objects in RAM
Blue: Objects on HD
Yellow: Objects in Cloud

Hi @ernestum thanks for looking into this and laying out such a clear plan.
I'm on board with step 1 and step 2(a). As you say worth thinking a bit about how to convert from Dataset to Trajectory, but ultimately I don't expect the conversion code to be that complex, so as long as we're not losing any information in the conversion then probably no need to overthink this.
Step 2(b) I'm more hesitant about. A DataLoader is much more abstract than our current transition types -- will this work with all our algorithms? MCEIRL.set_demonstrations currently has different logic depending on the transition type, and I don't see a way to push this just into a DataLoader without losing functionality. However, at first glance AIRL/GAIL/BC/DAgger do all seem to just make a data loader internally with base.make_data_loader so if it's just MCE IRL that's the odd-one out then I support this change (we can special case MCEIRL, it's an awkward algorithm in a bunch of ways).
Talking to @osanseviero and @simoninithomas and some playing around with the HuggingFace datasets library revealed, that it is super easy to map between our current format and their datasets.
I now would propose the following steps/PRs:
1. Add basic support for HuggingFace datasets
- Makes
imitation.types.loadsupport huggingface datasets. - Make
imitation.types.savestore to a huggingface dataset by default.
2. Add extended support for HuggingFace datasets
- Ensure that
imitation.types.<load/save>supports all of the weird action/observation spaces with unit tests (I think right now we only support discrete and box spaces). - Write utility to sample trajectories from a model, record a video of some of them and upload everything to the HuggingFace hub with a model card showing the video and indicting from what model the trajectories originated from.
3. Deprecate imitation.policies.serialize.load_rollouts_from_huggingface
- Migrate all datasets currently stored in a model repo to their own dataset repo.
- Remove
imitation.policies.serialize.load_rollouts_from_huggingface.
4. Refactor the way we handle demonstrations inside of imitation
Skimming over the code it looks like we spend way too much LOC on supporting and converting between different trajectory formats (with or without rewards, transitions, transitions with next_obs and dones). I have the vague hunch that there is a lot of potential to reduce complexity, LOC and even improve performance by using the HuggingFace datasets library.
Talking to @osanseviero and @simoninithomas and some playing around with the HuggingFace
datasetslibrary revealed, that it is super easy to map between our current format and their datasets.
That's good news!
1. Add basic support for HuggingFace
datasets
- Makes
imitation.types.loadsupport huggingface datasets.- Make
imitation.types.savestore to a huggingface dataset by default.
This seems fine to me.
2. Add extended support for HuggingFace
datasets
- Ensure that
imitation.types.<load/save>supports all of the weird action/observation spaces with unit tests (I think right now we only support discrete and box spaces).
Do any of our algorithms work with these "weird" action/observation spaces? If not I don't see the point of adding support at the type level at this point.
- Write utility to sample trajectories from a model, record a video of some of them and upload everything to the HuggingFace hub with a model card showing the video and indicting from what model the trajectories originated from.
Good idea, it'd be nice to have a model card making it user friendly to browse different demos. There's some overlap here with existing eval_policy script.
3. Deprecate
imitation.policies.serialize.load_rollouts_from_huggingface
- Migrate all datasets currently stored in a model repo to their own dataset repo.
- Remove
imitation.policies.serialize.load_rollouts_from_huggingface.
This sounds good, much cleaner.
4. Refactor the way we handle demonstrations inside of
imitationSkimming over the code it looks like we spend way too much LOC on supporting and converting between different trajectory formats (with or without rewards, transitions, transitions with next_obs and dones). I have the vague hunch that there is a lot of potential to reduce complexity, LOC and even improve performance by using the HuggingFacedatasetslibrary.
I don't currently have your intuition but probably best to just discuss once you've got a more fleshed out proposal there.
- Ensure that
imitation.types.<load/save>supports all of the weird action/observation spaces with unit tests (I think right now we only support discrete and box spaces).Do any of our algorithms work with these "weird" action/observation spaces? If not I don't see the point of adding support at the type level at this point
Good point. I don't know what spaces we support right now. Do we want to clarify this? Do we want to support all of them? I think when we support all spaces with the datasets then probably the algos will support them too?
Good point. I don't know what spaces we support right now. Do we want to clarify this? Do we want to support all of them? I think when we support all spaces with the datasets then probably the algos will support them too?
Dict spaces I think will break most (all?) of our current algorithms as we assume observations/actions are NumPy arrays/tensors all over the place.
Provided it's an array/tensor I think we should be OK so long as underlying RL algorithm supports it.
Do we properly on-hot encode discrete spaces? We could easily flatten/unflatten spaces using these util functions: https://github.com/Farama-Foundation/Gymnasium/blob/76559ed553cffc3e5082227b0a8acea461273e84/gymnasium/spaces/utils.py#L114
Do we properly on-hot encode discrete spaces? We could easily flatten/unflatten spaces using these util functions: https://github.com/Farama-Foundation/Gymnasium/blob/76559ed553cffc3e5082227b0a8acea461273e84/gymnasium/spaces/utils.py#L114
I think the stable_baselines3.common.preprocessing.preprocess_obs function should take care of this.