LightGBM
LightGBM copied to clipboard
[WIP] support dataset rows more then max(int32_t)
Description
This is a WIP PR to allow more than MAX(int32_t) of dataset rows.
- change data_size_t to int64_t
- change int32_t to int64_t for related places
- modify test code to allow int64 dataset.
- change unmatch type in
c_api.h(int)
andc_api.cpp(int32_t)
forncol
https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/c_api.h#L1221-L1231 - change the thread run logic in
Threading.h
, whennblock = 0
, setleft_cnt = 0
https://github.com/microsoft/LightGBM/blob/2b8fe8b4bdc00a2611442fdee4c45316f08b1c4b/include/LightGBM/utils/threading.h#L162-L168
Additional Comments
- Need some experienced developers to review this PR, and make some suggestions.
- When I build in java wrapper(swig), I got two errors,
ubuntu: 20.04
SWIG:4.0.2
gcc: GNU 9.4.0
G++: GNU 9.4.0
JVM: 1.8.0_342
It seem that SWIG will convert the parameter which defined as const int64_t*
to long long const*
, could you give some suggestions for this error?
[ 96%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetCreateFromSampledColumn(JNIEnv*, jclass, jlong, jlong, jint, jlong, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1285:68: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
1285 | result = (int)LGBM_DatasetCreateFromSampledColumn(arg1,arg2,arg3,(long long const *)arg4,arg5,arg6,arg7,(char const *)arg8,arg9);
| ^~~~~~~~~~~~~~~~~~~~~~~
| |
| const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:130:74: note: initializing argument 4 of ‘int LGBM_DatasetCreateFromSampledColumn(double**, int**, int32_t, const int64_t*, int64_t, int64_t, int64_t, const char*, void**)’
130 | const int64_t* num_per_col,
| ~~~~~~~~~~~~~~~^~~~~~~~~~~
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetGetSubset(JNIEnv*, jclass, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1561:44: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
1561 | result = (int)LGBM_DatasetGetSubset(arg1,(long long const *)arg2,arg3,(char const *)arg4,arg5);
| ^~~~~~~~~~~~~~~~~~~~~~~
| |
| const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:316:60: note: initializing argument 2 of ‘int LGBM_DatasetGetSubset(DatasetHandle, const int64_t*, int64_t, const char*, void**)’
316 | const int64_t* used_row_indices,
will this affect the training speed? any benchmarks?
will this affect the training speed? any benchmarks?
@guolinke
- I found a benchmarks in you repo, boosting_tree_benchmarks, is it fine to confirm training speed?
- Can you give me some feedback on the question for SWIG that I raised? Or who is the main developer for SWIG part to ask for advice.
yeah, it is fine to benchmark.
@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?
It seem that SWIG will convert the parameter which defined as
const int64_t*
tolong long const*
, could you give some suggestions for this error?
I believe this is related: #4091.
yeah, it is fine to benchmark.
@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?
@guolinke Tested Higgs Dataset in AWS Instrance : r5.2xlarge (vCPUs 8/Memory 64GiB)
# | Master Branch | Modified |
---|---|---|
Accuracy | 310.827840 seconds, iteration 500, AUC:0.845874 | 427.753441 seconds, iteration 500, AUC:0.845874 |
Speed | 294.647278 seconds, iteration 500 | 303.744985 seconds, iteration 500 |
It seem that SWIG will convert the parameter which defined as
const int64_t*
tolong long const*
, could you give some suggestions for this error?I believe this is related: #4091.
https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21 Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.
- Is there any tasks are working on this issue?
- I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from
long long const*
tolong int const*
, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?
@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.
@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.
@guolinke Re-ran the Higgs benchmark
, and got a new result below.
It is faster this time.It seems the instance is not stable. Anyway, I also agree with you, it's fine to add an option.
# | Master Branch | Modified |
---|---|---|
Accuracy | 309.241788 seconds, iteration 500, AUC:0.845874 | 318.976365 seconds, iteration 500, AUC:0.845874 |
Speed | 399.465431 seconds, iteration 500 | 304.222913 seconds, iteration 500 |
e.g. make it a flag (define in CMake) in the compilation.
It feels like changing the modified int64_t
to data_size_t
, then add a CMAKE option to switch data_size_t
.
But how to handle the APIs(C, Python, etc)?
It seem that SWIG will convert the parameter which defined as
const int64_t*
tolong long const*
, could you give some suggestions for this error?I believe this is related: #4091.
https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21
Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.
- Is there any tasks are working on this issue?
- I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from
long long const*
tolong int const*
, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?
Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?
cc @shiyu1994 to confirm the performance test.
It seem that SWIG will convert the parameter which defined as
const int64_t*
tolong long const*
, could you give some suggestions for this error?I believe this is related: #4091.
https://github.com/microsoft/LightGBM/blob/2e54d5f518a4c7db721483347bec6676a8d20c89/swig/ChunkedArray_API_extensions.i#L16-L21
Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.
- Is there any tasks are working on this issue?
- I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from
long long const*
tolong int const*
, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?
Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke Could you explain my doubt for SWIG part?
LightGBM is using the SWIG's stdint.i for int64_t, https://github.com/microsoft/LightGBM/blob/952458a9e0890825634194905e919bd772551b67/swig/lightgbmlib.i#L21
the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit, https://github.com/swig/swig/blob/ad1688055dea4f07cd4023d5e0301d97f7759151/Lib/stdint.i#L19-L24
So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined), Is that expected?Or unsuitable?
Hi guys, Is there any update on this ticket?
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
:x: junpeng0715 sign now
You have signed the CLA already but the status is still pending? Let us recheck it.
@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.
@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.
Thank you so mush! Ok, will do that!
I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540
Based on https://github.com/microsoft/LightGBM/pull/5540#issuecomment-1281741251, #5540 replaces this PR. I'm going to close this one.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.