models [RMP] Expand supported datasets in Merlin Models (to support the tutorial?)

Problem:

A clear description of the problem and how it impacts the customer experience of Merlin users. Why is this important? Why/Should we prioritize this work?

Goal:

What is the goal of this work. Note that this is goal singular. Please try to ensure that we're trying to solve one problem and not many.
This can also include anti-goals of what this work does not include.
Ideally this takes the form of a bulleted list.

Constraints:

What are the constraints that might impact the choice of solution?
This can also include non-constraints to clarify if something that would normally be a constraint is not a consideration.
Ideally this takes the form of a bulleted list.

Scope:

Support both Retrieval and Ranking
Datasets: Ali-CCP, H&M and LastFM (used for retrieval research)

Starting Point:

Enabler tasks from NVTabular

[ ] NVIDIA-Merlin/NVTabular#1504
[x] https://github.com/NVIDIA-Merlin/NVTabular/issues/1484

Specific dataset tasks

[ ] NVIDIA-Merlin/models#342
[ ] NVIDIA-Merlin/models#343

All datasets

[ ] Add synthetic creation methods for these datasets
[ ] integrate the example notebook workflows into the MM library.
[ ] Identify which dataset will be used for the tutorial

Documentation

[ ] Release notes
[ ] docs update?

Apr 06 '22 16:04 gabrielspmoreira

See also NVIDIA-Merlin/models#345 - out of scope for 22.05

Apr 11 '22 16:04 benfred

There are preprocessing notebooks drafted by me and @rnyak for H&M and LastFM for the retrieval experiments in the research private repo. Those notebooks are specific to the evaluation protocol chosen for retrieval models experiments, but they can work as a basis for the implementation of a general preprocessing of those datasets within Merlin Models.

Jun 16 '22 00:06 gabrielspmoreira

Are these datasets planned to be used in the KDD/RecSys tutorials, @rnyak? If so, that makes adding them a higher priority; if not, this becomes a nice to have compared to the other work that does directly feed into the tutorials.

Jun 16 '22 18:06 karlhigley

@karlhigley mostly likely we will use Ecom-REES46 dataset for KDD'22 and real H&M dataset for RecSys'22 tutorial, if we can get permission to use it and share on our DLI platform, which we dont have it yet.

If you are asking about adding code for downloading and preprocessing H&M and LastFM, that's be useful to add examples around that but not urgent, but the code is already there, so it wont be that much work to do so.

Nevertheless, It'd be useful to have add synthetic creation methods for H&M dataset, which can be used at the hands-on tutorials.

Jun 22 '22 17:06 rnyak

@marcromeyn , could you please help to provide more information in the problem, goal, constraints section above. You may have provided these details in the comments. Please help to summarize at the top. Let me know if you are facing any difficulties.

Jun 29 '22 22:06 viswa-nvidia

Datasets used will be example driven. Closing.

Oct 12 '22 16:10 EvenOldridge

models models copied to clipboard

[RMP] Expand supported datasets in Merlin Models (to support the tutorial?)

Problem:

Goal:

Constraints:

Scope:

Starting Point:

models
models copied to clipboard