LightGBM icon indicating copy to clipboard operation
LightGBM copied to clipboard

[WIP] support dataset rows more then max(int32_t)

Open junpeng0715 opened this issue 2 years ago • 12 comments

Description

This is a WIP PR to allow more than MAX(int32_t) of dataset rows.

  • change data_size_t to int64_t
  • change int32_t to int64_t for related places
  • modify test code to allow int64 dataset.
  • change unmatch type in c_api.h(int) and c_api.cpp(int32_t) for ncol https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/c_api.h#L1221-L1231
  • change the thread run logic in Threading.h, when nblock = 0, set left_cnt = 0 https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/utils/threading.h#L162-L168

Additional Comments

  • Need some experienced developers to review this PR, and make some suggestions.
  • When I build in java wrapper(swig), I got two errors,
  ubuntu: 20.04
  SWIG:4.0.2
  gcc: GNU 9.4.0
  G++: GNU 9.4.0
  JVM: 1.8.0_342

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

 [ 96%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetCreateFromSampledColumn(JNIEnv*, jclass, jlong, jlong, jint, jlong, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1285:68: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1285 |   result = (int)LGBM_DatasetCreateFromSampledColumn(arg1,arg2,arg3,(long long const *)arg4,arg5,arg6,arg7,(char const *)arg8,arg9);
      |                                                                    ^~~~~~~~~~~~~~~~~~~~~~~
      |                                                                    |
      |                                                                    const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:130:74: note:   initializing argument 4 of ‘int LGBM_DatasetCreateFromSampledColumn(double**, int**, int32_t, const int64_t*, int64_t, int64_t, int64_t, const char*, void**)’
  130 |                                                           const int64_t* num_per_col,
      |                                                           ~~~~~~~~~~~~~~~^~~~~~~~~~~
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetGetSubset(JNIEnv*, jclass, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1561:44: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1561 |   result = (int)LGBM_DatasetGetSubset(arg1,(long long const *)arg2,arg3,(char const *)arg4,arg5);
      |                                            ^~~~~~~~~~~~~~~~~~~~~~~
      |                                            |
      |                                            const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:316:60: note:   initializing argument 2 of ‘int LGBM_DatasetGetSubset(DatasetHandle, const int64_t*, int64_t, const char*, void**)’
  316 |                                             const int64_t* used_row_indices,

junpeng0715 avatar Aug 31 '22 05:08 junpeng0715

will this affect the training speed? any benchmarks?

guolinke avatar Sep 01 '22 15:09 guolinke

will this affect the training speed? any benchmarks?

@guolinke

  • I found a benchmarks in you repo, boosting_tree_benchmarks, is it fine to confirm training speed?
  • Can you give me some feedback on the question for SWIG that I raised? Or who is the main developer for SWIG part to ask for advice.

junpeng0715 avatar Sep 02 '22 01:09 junpeng0715

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

guolinke avatar Sep 02 '22 03:09 guolinke

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

StrikerRUS avatar Sep 04 '22 15:09 StrikerRUS

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

@guolinke Tested Higgs Dataset in AWS Instrance : r5.2xlarge (vCPUs 8/Memory 64GiB)

# Master Branch Modified
Accuracy 310.827840 seconds, iteration 500, AUC:0.845874 427.753441 seconds, iteration 500, AUC:0.845874
Speed 294.647278 seconds, iteration 500 303.744985 seconds, iteration 500

junpeng0715 avatar Sep 05 '22 02:09 junpeng0715

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21 Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

junpeng0715 avatar Sep 05 '22 03:09 junpeng0715

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

guolinke avatar Sep 05 '22 13:09 guolinke

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

@guolinke Re-ran the Higgs benchmark, and got a new result below. It is faster this time.It seems the instance is not stable. Anyway, I also agree with you, it's fine to add an option.

# Master Branch Modified
Accuracy 309.241788 seconds, iteration 500, AUC:0.845874 318.976365 seconds, iteration 500, AUC:0.845874
Speed 399.465431 seconds, iteration 500 304.222913 seconds, iteration 500

e.g. make it a flag (define in CMake) in the compilation.

It feels like changing the modified int64_t to data_size_t , then add a CMAKE option to switch data_size_t . But how to handle the APIs(C, Python, etc)?

junpeng0715 avatar Sep 05 '22 15:09 junpeng0715

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

junpeng0715 avatar Sep 06 '22 01:09 junpeng0715

cc @shiyu1994 to confirm the performance test.

guolinke avatar Sep 07 '22 14:09 guolinke

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke Could you explain my doubt for SWIG part?

LightGBM is using the SWIG's stdint.i for int64_t, https://github.com/microsoft/LightGBM/blob/952458a9e0890825634194905e919bd772551b67/swig/lightgbmlib.i#L21

the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit, https://github.com/swig/swig/blob/ad1688055dea4f07cd4023d5e0301d97f7759151/Lib/stdint.i#L19-L24

So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined), Is that expected?Or unsuitable?

junpeng0715 avatar Sep 11 '22 15:09 junpeng0715

Hi guys, Is there any update on this ticket?

junpeng0715 avatar Sep 21 '22 12:09 junpeng0715

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

:x: junpeng0715 sign now
You have signed the CLA already but the status is still pending? Let us recheck it.

ghost avatar Sep 22 '22 05:09 ghost

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

guolinke avatar Oct 09 '22 05:10 guolinke

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

Thank you so mush! Ok, will do that!

junpeng0715 avatar Oct 11 '22 08:10 junpeng0715

I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540

junpeng0715 avatar Oct 13 '22 07:10 junpeng0715

Based on https://github.com/microsoft/LightGBM/pull/5540#issuecomment-1281741251, #5540 replaces this PR. I'm going to close this one.

jameslamb avatar Oct 18 '22 03:10 jameslamb

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

github-actions[bot] avatar Nov 15 '23 00:11 github-actions[bot]