Merlin
Merlin copied to clipboard
Add Merlin Models example notebooks as integration tests with real data
- [ ] Add integration tests to merlin models notebooks using Testbook
- [ ] Define a process where outputs from the notebooks are consistent and can be captured in ASV
- [ ] Add notebook tests to CI and ensure that CI is working correctly
On real data, not synthetic
We have three notebooks, which uses synthetic as an example: 03-Exploring-different-models.ipynb 04-Exporting-ranking-models.ipynb 05-Retrieval-Model.ipynb
Do we want to update them to use real data in CI?
@karlhigley @benfred and @jperez999 as a follow up question to Benedikt's, how CI will work with ALI -CCP
real data? we need to sign up to download that dataset. How the dataset will be downloaded?
I have no idea (nor do I particularly care whether we run integration tests with real or synthetic data.) I've just written down what people kept telling me was supposed to happen, but if we need to revise the scope or the issues (or both), we should totally do that. Basically, I'm trying to write down what I keep hearing in order to make sure conversations like this one happen, so that we can clarify what's actually workable and do that instead. 😃
@jperez999 (+ not sure who else can help).
For movielens-1m notebooks, the tests expect following data structure on the CI machine:
/raid/data/movielens/
├── ml-1m
│ ├── README
│ ├── movies.dat
│ ├── ratings.dat
│ ├── users.dat
├── ml-1m.zip
For aliccp-raw, the tests expect following data structure on the CI machine
/raid/data/aliccp/raw/
├── train
│ ├── train_0.parquet
│ ├── train_1.parquet
│ ├── train_2.parquet
│ ├── train_3.parquet
│ ├── train_4.parquet
│ ├── train_5.parquet
│ ├── train_6.parquet
│ └── train_7.parquet
└── valid
├── test_0.parquet
├── test_1.parquet
├── test_2.parquet
├── test_3.parquet
├── test_4.parquet
├── test_5.parquet
├── test_6.parquet
└── test_7.parquet
@jperez999 have you had a chance to add the dataset to the CI environment? Is there something we can help with?
What do you all think of breaking the full-dataset runs out from the notebook examples? They could be jobs scheduled to run daily on the latest docker images and latest main
branches of each library. Some reasons it could be useful are:
- We can run the Merlin notebook integration tests (on sampled data) faster, and therefore can run them in PRs for each of the other libraries to make sure a change to
merlin-models
ormerlin-systems
doesn't break the demo notebooks. - By running the full-dataset jobs daily, we'll still get quick-ish feedback (via pagerduty and/or slack) if code changes introduce an issue at scale.
- By separating integration test jobs from the example notebooks, we can write more specific jobs to test edge cases and more advanced functionality that doesn't make sense in the examples.
- Once separated from CI, there's no reason to try to get the test jobs to finish in a reasonable amount of time (except for cost, of course), so we can really scale up the amount of data used to test future features like https://github.com/NVIDIA-Merlin/Merlin/issues/107
- Regularly scheduled jobs is a more realistic implementation of Merlin-based workflow than notebooks.
Yeah, I think that makes sense as the next incremental step from where we are now