jasp-issues icon indicating copy to clipboard operation
jasp-issues copied to clipboard

[Bug]: sample size limitation in machine learning modules

Open rd24911 opened this issue 1 year ago • 9 comments

JASP Version

0.17.2.1

Commit ID

No response

JASP Module

Machine Learning

What analysis are you seeing the problem on?

SVM and random forest(and maybe other method in the module

What OS are you seeing the problem on?

macOS Intel

Bug Description

I have a sample with size around 2500 but each time I tried to use machine learning models in the JASP modules, it could only utilize 2383 data(exactly the number). I'm not sure whether it is a bug or it is the limitation of my computer? but it could be a little bothering because that the data aggregated in certain order so the last several data were the same type and it was drop by the model. Thank you for help!

Expected Behaviour

full utilization of all sample

Steps to Reproduce

I've tried in different modules(SVM and random forest) and it both only use 2383 data

Log (if any)

No response

Final Checklist

  • [X] I have included a screenshot showcasing the issue, if possible.
  • [X] I have included a JASP file (zipped) or data file that causes the crash/bug, if applicable.
  • [X] I have accurately described the bug, and steps to reproduce it.

rd24911 avatar Jul 23 '23 18:07 rd24911

@rd24911, thanks for taking the time to create this issue. If possible (and applicable), please upload to the issue website (https://github.com/jasp-stats/jasp-issues/issues/2238) a screenshot showcasing the problem, and/or a compressed (zipped) .jasp file or the data file that causes the issue. If you would prefer not to make your data publicly available, you can send your file(s) directly to us, [email protected]

github-actions[bot] avatar Jul 23 '23 18:07 github-actions[bot]

Hi @rd24911,

This is probably due to your ‘Data split preferences’ options. These options can be found under the corresponding expandable section in all supervised analyses. The default approach in all ml analyses is to use 20% of the entire data set as a test set to assess the model’s performance. The other 80 percent is used for training the model (of which 20% is optionally used to optimize the model parameters). You can add a binary variable called “test set indicator” to the data and specify this in the data split preferences to use (almost) all the data for training, or change the data split options to reflect this, but it is highly recommended to use at good portion of the data as the test set to assess performance. Let me know if this clarifies!

koenderks avatar Jul 23 '23 19:07 koenderks

Thank you very much! I have tried the function of "add generated indicator" but it still only generated 2383 indicators to the data; so only the first 2383 line was added an indicator and if I ran the ml by the indicators the size dropped a little more. I have saved the prediction and reviewed the whole data; and I realized that the number difference between 2383 and the full data(~2500) are due to missing values in some items. The ml model only used data without missing, which is 2383, but it generated indicators to the first 2383 lines. Is it possible to generate an indicator in advance of the whole data set so it would not generate a smaller number based on unsuccessful training? or should I manage the missing value in advance? I'm sorry if the problem is not feasible in this forum. Thank you for your reply though I understand now that it is not a bug problem.

rd

Koen Derks @.***> 於 2023年7月24日 週一 上午3:12寫道:

Hi @rd24911 https://github.com/rd24911,

This is probably due to your ‘Data split preference’. The default approach in all ml analyses is to use 20% of the entire data set as a test set to assess the model’s performance. The other 80 percent is used for training the model (of which 20% is optionally used to optimize the model parameters). You can add a binary variable called “test set indicator” to the data and specify this in the data split preferences to use (almost) all the data for training, but it is highly recommended to use at good portion of the data as the test set to assess performance. Let me know if this clarifies!

— Reply to this email directly, view it on GitHub https://github.com/jasp-stats/jasp-issues/issues/2238#issuecomment-1646935967, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBOXHE43F4FMVJ6K4OYO743XRVZTHANCNFSM6AAAAAA2UVCR6E . You are receiving this because you were mentioned.Message ID: @.***>

rd24911 avatar Jul 23 '23 19:07 rd24911

@rd24911 There is work on imputation to handle missings within jasp, planned for the upcoming version 0.19, see https://github.com/jasp-stats/jasp-issues/issues/2437 So I hope this will help here! Can you help us testing the feature once it is out in beta?

tomtomme avatar Feb 08 '24 21:02 tomtomme

Sure! Thank you for developing several useful functions. I will check the feature once it is available!

Thomas Langkamp @.***> 於 2024年2月9日 週五 上午5:19 寫道:

@rd24911 https://github.com/rd24911 There is work on imputation to handle missings within jasp, planned for the upcoming version 0.19, see #2437 https://github.com/jasp-stats/jasp-issues/issues/2437 So I hope this will help here! Can you help us testing the feature once it is out in beta?

— Reply to this email directly, view it on GitHub https://github.com/jasp-stats/jasp-issues/issues/2238#issuecomment-1934949453, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBOXHE2VCB2YDYDGTGMJK53YSU6N3AVCNFSM6AAAAAA2UVCR6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUHE2DSNBVGM . You are receiving this because you were mentioned.Message ID: @.***>

rd24911 avatar Feb 09 '24 02:02 rd24911

Great! Please remind me in 2 or 3 weeks here, to send you a link to the beta - maybe then a beta will be out. If you are adventurous, you can check out the nightly builds, once a week here https://static.jasp-stats.org/Nightlies/

But those might be very unstable and currently imputation is not yet in!

tomtomme avatar Feb 09 '24 08:02 tomtomme

Sorry I don't know if it would be a bit late but I'm available for a test now! If you need a test of the beta version please send me a link. Looking forward to the next update!

YC(rd24911)

Thomas Langkamp @.***> 於 2024年2月9日 週五 下午4:05寫道:

Great! Please remind me in 2 or 3 weeks here, to send you a link to the beta - maybe then a beta will be out. If you are adventurous, you can check out the nightly builds, once a week here https://static.jasp-stats.org/Nightlies/

But those might be very unstable and currently imputation is not yet in!

— Reply to this email directly, view it on GitHub https://github.com/jasp-stats/jasp-issues/issues/2238#issuecomment-1935488470, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBOXHE277MQM36HUYT6ZMWTYSXKF3AVCNFSM6AAAAAA2UVCR6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGQ4DQNBXGA . You are receiving this because you were mentioned.Message ID: @.***>

rd24911 avatar Mar 27 '24 14:03 rd24911

@rd24911 Thx for the reminder! I would love to share a link, but missing value stuff is not yet in the current beta afaik. Sorry!

tomtomme avatar Mar 27 '24 14:03 tomtomme

@rd24911 Sadly the imputation stuff had to be delayed to the next major version 0.20. I will keep you up to date when there is a beta for that. Thx so far.

tomtomme avatar Jul 08 '24 09:07 tomtomme

For progress on missing data handling please see: https://github.com/jasp-stats/jasp-issues/issues/2437

tomtomme avatar Sep 06 '24 10:09 tomtomme