LightGBM [WIP] support dataset rows more then max(int32

Description

This is a WIP PR to allow more than MAX(int32_t) of dataset rows.

change data_size_t to int64_t
change int32_t to int64_t for related places
modify test code to allow int64 dataset.
change unmatch type in c_api.h(int) and c_api.cpp(int32_t) for ncol https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/c_api.h#L1221-L1231
change the thread run logic in Threading.h, when nblock = 0, set left_cnt = 0 https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/utils/threading.h#L162-L168

Additional Comments

Need some experienced developers to review this PR, and make some suggestions.
When I build in java wrapper(swig), I got two errors,

  ubuntu: 20.04
  SWIG:4.0.2
  gcc: GNU 9.4.0
  G++: GNU 9.4.0
  JVM: 1.8.0_342

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

 [ 96%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetCreateFromSampledColumn(JNIEnv*, jclass, jlong, jlong, jint, jlong, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1285:68: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1285 |   result = (int)LGBM_DatasetCreateFromSampledColumn(arg1,arg2,arg3,(long long const *)arg4,arg5,arg6,arg7,(char const *)arg8,arg9);
      |                                                                    ^~~~~~~~~~~~~~~~~~~~~~~
      |                                                                    |
      |                                                                    const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:130:74: note:   initializing argument 4 of ‘int LGBM_DatasetCreateFromSampledColumn(double**, int**, int32_t, const int64_t*, int64_t, int64_t, int64_t, const char*, void**)’
  130 |                                                           const int64_t* num_per_col,
      |                                                           ~~~~~~~~~~~~~~~^~~~~~~~~~~
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetGetSubset(JNIEnv*, jclass, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1561:44: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1561 |   result = (int)LGBM_DatasetGetSubset(arg1,(long long const *)arg2,arg3,(char const *)arg4,arg5);
      |                                            ^~~~~~~~~~~~~~~~~~~~~~~
      |                                            |
      |                                            const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:316:60: note:   initializing argument 2 of ‘int LGBM_DatasetGetSubset(DatasetHandle, const int64_t*, int64_t, const char*, void**)’
  316 |                                             const int64_t* used_row_indices,

Aug 31 '22 05:08 junpeng0715

will this affect the training speed? any benchmarks?

Sep 01 '22 15:09 guolinke

will this affect the training speed? any benchmarks?

@guolinke

I found a benchmarks in you repo, boosting_tree_benchmarks, is it fine to confirm training speed?
Can you give me some feedback on the question for SWIG that I raised? Or who is the main developer for SWIG part to ask for advice.

Sep 02 '22 01:09 junpeng0715

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

Sep 02 '22 03:09 guolinke

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

Sep 04 '22 15:09 StrikerRUS

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

@guolinke Tested Higgs Dataset in AWS Instrance : r5.2xlarge (vCPUs 8/Memory 64GiB)

#	Master Branch	Modified
Accuracy	310.827840 seconds, iteration 500, AUC:0.845874	427.753441 seconds, iteration 500, AUC:0.845874
Speed	294.647278 seconds, iteration 500	303.744985 seconds, iteration 500

Sep 05 '22 02:09 junpeng0715

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21 Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?
I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Sep 05 '22 03:09 junpeng0715

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

Sep 05 '22 13:09 guolinke

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

@guolinke Re-ran the Higgs benchmark, and got a new result below. It is faster this time.It seems the instance is not stable. Anyway, I also agree with you, it's fine to add an option.

#	Master Branch	Modified
Accuracy	309.241788 seconds, iteration 500, AUC:0.845874	318.976365 seconds, iteration 500, AUC:0.845874
Speed	399.465431 seconds, iteration 500	304.222913 seconds, iteration 500

e.g. make it a flag (define in CMake) in the compilation.

It feels like changing the modified int64_t to data_size_t , then add a CMAKE option to switch data_size_t . But how to handle the APIs(C, Python, etc)?

Sep 05 '22 15:09 junpeng0715

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?

I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

Sep 06 '22 01:09 junpeng0715

cc @shiyu1994 to confirm the performance test.

Sep 07 '22 14:09 guolinke

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

Is there any tasks are working on this issue?

I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke Could you explain my doubt for SWIG part?

LightGBM is using the SWIG's stdint.i for int64_t, https://github.com/microsoft/LightGBM/blob/952458a9e0890825634194905e919bd772551b67/swig/lightgbmlib.i#L21

the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit， https://github.com/swig/swig/blob/ad1688055dea4f07cd4023d5e0301d97f7759151/Lib/stdint.i#L19-L24

So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined), Is that expected?Or unsuitable?

Sep 11 '22 15:09 junpeng0715

Hi guys, Is there any update on this ticket?

Sep 21 '22 12:09 junpeng0715

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

:x: junpeng0715 sign now
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Sep 22 '22 05:09 ghost

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

Oct 09 '22 05:10 guolinke

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

Thank you so mush! Ok, will do that!

Oct 11 '22 08:10 junpeng0715

I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540

Oct 13 '22 07:10 junpeng0715

Based on https://github.com/microsoft/LightGBM/pull/5540#issuecomment-1281741251, #5540 replaces this PR. I'm going to close this one.

Oct 18 '22 03:10 jameslamb

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Nov 15 '23 00:11 github-actions[bot]

LightGBM
LightGBM copied to clipboard

[WIP] support dataset rows more then max(int32_t)

Description

Additional Comments

LightGBM LightGBM copied to clipboard

[WIP] support dataset rows more then max(int32_t)

Description

Additional Comments

LightGBM
LightGBM copied to clipboard