models
models copied to clipboard
[RMP] Expand supported datasets in Merlin Models (to support the tutorial?)
Problem:
A clear description of the problem and how it impacts the customer experience of Merlin users. Why is this important? Why/Should we prioritize this work?
Goal:
- What is the goal of this work. Note that this is goal singular. Please try to ensure that we're trying to solve one problem and not many.
- This can also include anti-goals of what this work does not include.
- Ideally this takes the form of a bulleted list.
Constraints:
- What are the constraints that might impact the choice of solution?
- This can also include non-constraints to clarify if something that would normally be a constraint is not a consideration.
- Ideally this takes the form of a bulleted list.
Scope:
- Support both Retrieval and Ranking
- Datasets: Ali-CCP, H&M and LastFM (used for retrieval research)
Starting Point:
Enabler tasks from NVTabular
- [ ] NVIDIA-Merlin/NVTabular#1504
- [x] https://github.com/NVIDIA-Merlin/NVTabular/issues/1484
Specific dataset tasks
- [ ] NVIDIA-Merlin/models#342
- [ ] NVIDIA-Merlin/models#343
All datasets
- [ ] Add synthetic creation methods for these datasets
- [ ] integrate the example notebook workflows into the MM library.
- [ ] Identify which dataset will be used for the tutorial
Documentation
- [ ] Release notes
- [ ] docs update?
See also NVIDIA-Merlin/models#345 - out of scope for 22.05
There are preprocessing notebooks drafted by me and @rnyak for H&M and LastFM for the retrieval experiments in the research private repo. Those notebooks are specific to the evaluation protocol chosen for retrieval models experiments, but they can work as a basis for the implementation of a general preprocessing of those datasets within Merlin Models.
Are these datasets planned to be used in the KDD/RecSys tutorials, @rnyak? If so, that makes adding them a higher priority; if not, this becomes a nice to have compared to the other work that does directly feed into the tutorials.
@karlhigley mostly likely we will use Ecom-REES46 dataset for KDD'22 and real H&M dataset for RecSys'22 tutorial, if we can get permission to use it and share on our DLI platform, which we dont have it yet.
If you are asking about adding code for downloading and preprocessing H&M and LastFM, that's be useful to add examples around that but not urgent, but the code is already there, so it wont be that much work to do so.
Nevertheless, It'd be useful to have add synthetic creation methods for H&M dataset
, which can be used at the hands-on tutorials.
@marcromeyn , could you please help to provide more information in the problem, goal, constraints section above. You may have provided these details in the comments. Please help to summarize at the top. Let me know if you are facing any difficulties.
Datasets used will be example driven. Closing.