models icon indicating copy to clipboard operation
models copied to clipboard

[RMP] Expand supported datasets in Merlin Models (to support the tutorial?)

Open gabrielspmoreira opened this issue 2 years ago • 5 comments

Problem:

A clear description of the problem and how it impacts the customer experience of Merlin users. Why is this important? Why/Should we prioritize this work?

Goal:

  • What is the goal of this work. Note that this is goal singular. Please try to ensure that we're trying to solve one problem and not many.
  • This can also include anti-goals of what this work does not include.
  • Ideally this takes the form of a bulleted list.

Constraints:

  • What are the constraints that might impact the choice of solution?
  • This can also include non-constraints to clarify if something that would normally be a constraint is not a consideration.
  • Ideally this takes the form of a bulleted list.

Scope:

  • Support both Retrieval and Ranking
  • Datasets: Ali-CCP, H&M and LastFM (used for retrieval research)

Starting Point:

Enabler tasks from NVTabular

  • [ ] NVIDIA-Merlin/NVTabular#1504
  • [x] https://github.com/NVIDIA-Merlin/NVTabular/issues/1484

Specific dataset tasks

  • [ ] NVIDIA-Merlin/models#342
  • [ ] NVIDIA-Merlin/models#343

All datasets

  • [ ] Add synthetic creation methods for these datasets
  • [ ] integrate the example notebook workflows into the MM library.
  • [ ] Identify which dataset will be used for the tutorial

Documentation

  • [ ] Release notes
  • [ ] docs update?

gabrielspmoreira avatar Apr 06 '22 16:04 gabrielspmoreira

See also NVIDIA-Merlin/models#345 - out of scope for 22.05

benfred avatar Apr 11 '22 16:04 benfred

There are preprocessing notebooks drafted by me and @rnyak for H&M and LastFM for the retrieval experiments in the research private repo. Those notebooks are specific to the evaluation protocol chosen for retrieval models experiments, but they can work as a basis for the implementation of a general preprocessing of those datasets within Merlin Models.

gabrielspmoreira avatar Jun 16 '22 00:06 gabrielspmoreira

Are these datasets planned to be used in the KDD/RecSys tutorials, @rnyak? If so, that makes adding them a higher priority; if not, this becomes a nice to have compared to the other work that does directly feed into the tutorials.

karlhigley avatar Jun 16 '22 18:06 karlhigley

@karlhigley mostly likely we will use Ecom-REES46 dataset for KDD'22 and real H&M dataset for RecSys'22 tutorial, if we can get permission to use it and share on our DLI platform, which we dont have it yet.

If you are asking about adding code for downloading and preprocessing H&M and LastFM, that's be useful to add examples around that but not urgent, but the code is already there, so it wont be that much work to do so.

Nevertheless, It'd be useful to have add synthetic creation methods for H&M dataset, which can be used at the hands-on tutorials.

rnyak avatar Jun 22 '22 17:06 rnyak

@marcromeyn , could you please help to provide more information in the problem, goal, constraints section above. You may have provided these details in the comments. Please help to summarize at the top. Let me know if you are facing any difficulties.

viswa-nvidia avatar Jun 29 '22 22:06 viswa-nvidia

Datasets used will be example driven. Closing.

EvenOldridge avatar Oct 12 '22 16:10 EvenOldridge