Scenario-Wise-Rec icon indicating copy to clipboard operation
Scenario-Wise-Rec copied to clipboard

About Results

Open Porphy opened this issue 7 months ago • 1 comments
trafficstars

Hello, and thank you for this excellent repository—your work makes it much easier for newcomers to reproduce and build upon those baseline models!

  1. I set up the environment exactly as specified in requirements.txt and attempted to reproduce the MovieLens results. For most models, my results fall within an acceptable margin of the numbers reported in the paper. However, for STAR and M2M, I observed performance that was substantially lower than the published results. (run 10 times with seeds in range(2020, 2030)) Could this be due to the robustness of STAR and M2M, or might there be suboptimal hyperparameter settings? Image

  2. I also noticed that no method consistently outperforms the shared‑bottom architecture across all six datasets. Does it suggest that MSR models may be highly dataset‑dependent, similar to how, in tabular learning, tree‑based models like XGBoost still often outperform so-called SOTA deep methods?

  3. The differences among all methods on each dataset are quite small. Is this a limitation of the datasets (e.g., lack of complexity or noise), or could it indicate that we have reached a performance plateau on these benchmarks?

Porphy avatar Apr 11 '25 10:04 Porphy

Hi @Porphy, thank you for the thoughtful questions. Let me address them one at a time.

Question 1 If your initial runs fall short of expectations, try adjusting key hyper-parameters and rerunning the experiment. From my tests, M2M’s performance is quite stable, whereas STAR is noticeably more sensitive and benefits from finer-grained tuning.

Question 2 You’re absolutely right, according to my experimental experience, the scenario-related feature is almost the most important feature highly impact the final model performance, which might also indicate that the perfomance is quite dataset‑dependent.

Question 3 This remains an open question in this area. Dataset availability is indeed a key bottleneck, which is why I built several domain-specific sets manually. As for a "performance plateau", I don’t think we’re there yet. Recent papaers report impressive gains from LLM-based and prototype-based methods. I think it is worth exploring additional methods to see whether we can push the state of the art even further, and I welcome your pull requests if you have any contribution to our benchmark. Thanks again for your question.

I hope this clarifies things, feel free to let me know if you’d like more detail on any point.

Xiaopengli1 avatar May 30 '25 04:05 Xiaopengli1