SynapseML
SynapseML copied to clipboard
feat: [LightGBM] support dataset rows more then max(int32_t)
Related Issues/PRs
What changes are proposed in this pull request?
Allow more than MAX(int32_t) of dataset rows to use LightGBM via SynapseML.
How is this patch tested?
- [X] I tested in our Databricks envs use more than Max(int32) dataset rows.
Does this PR change any dependencies?
- [X] No. You can skip this section.
- [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.
Does this PR add a new feature? If so, have you added samples on website?
- [X] No. You can skip this section.
- [ ] Yes. Make sure you have added samples following below steps.
Hey @junpeng0715 :wave:! Thank you so much for contributing to our repository :raised_hands:. Someone from SynapseML Team will be reviewing this pull request soon.
We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!
Examples of commit messages with semantic prefixes:
fix: Fix LightGBM crashes with empty partitionsfeat: Make HTTP on Spark back-offs configurabledocs: Update Spark Serving usagebuild: Add codecov supportperf: improve LightGBM memory usagerefactor: make python code generation rely on classesstyle: Remove nulls from CNTKModeltest: Add test coverage for CNTKModel
To test your commit locally, please follow our guild on building from source. Check out the developer guide for additional guidance on testing your change.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 20.65%. Comparing base (448f6b7) to head (45e427b).
:warning: Report is 460 commits behind head on master.
:exclamation: There is a different number of reports uploaded between BASE (448f6b7) and HEAD (45e427b). Click for more details.
HEAD has 28 uploads less than BASE
Flag BASE (448f6b7) HEAD (45e427b) 39 11
Additional details and impacted files
@@ Coverage Diff @@
## master #1684 +/- ##
===========================================
- Coverage 80.63% 20.65% -59.99%
===========================================
Files 272 244 -28
Lines 14224 11971 -2253
Branches 732 570 -162
===========================================
- Hits 11470 2473 -8997
- Misses 2754 9498 +6744
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
hello. thanks for this. I will review when I get a chance, but a couple of points:
- We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
- Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).
Hi @svotaw ,thank you!
hello. thanks for this. I will review when I get a chance, but a couple of points:
- We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
Is there any issue ticket refer to?
- Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).
Yes, that is the PR I made in LightGBM repo. https://github.com/microsoft/LightGBM/pull/5540
Hi @svotaw
- We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
Is that will support dataset rows over max(int32)?
@junpeng0715 https://github.com/microsoft/SynapseML/pull/1580 is the PR that will enable using "streaming" mode for lightgbm. It is currently waiting for a bug fix to sparse data that is pending in this LightGBM PR: https://github.com/microsoft/LightGBM/pull/5551.
I have not specifically looked into int64 support for the new code. We try to make it so, but not all the LightGBM APIs support it as you know.
You have errors in the pipeline:
[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found
You have errors in the pipeline:
[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found
Yes the dependency change is https://github.com/microsoft/LightGBM/pull/5540.
Hi @svotaw Thank you very much.
As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
Compilation failed, please confirm this PR compiles locally
[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala:532:35: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_void [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] lightgbmlib.int_to_voidp_ptr(indexes.array), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:29:61: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] lightgbmlib.LGBM_DatasetGetField(datasetPtr, fieldName, tmpOutLenPtr, outArray, outTypePtr) [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:57:75: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found [error] (lightgbm / Compile / compileIncremental) Compilation failed [error] Total time: 17 s, completed Oct 26, 2022 4:08:19 PM ##[error]Script failed with exit code: 1
Hi @svotaw Thank you very much.
As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?
There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM https://github.com/microsoft/LightGBM/pull/5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.
Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?
Hi @mhamilton723 Thank you for running pipeline. This PR is depend on https://github.com/microsoft/LightGBM/pull/5540, so the error is known.
Hi @svotaw Thank you very much. As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?
There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.
Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?
I'm very glad that you can take a look for this PR, since the "bulk" mode will always stay in the system for user to choose.
I can't say the plan clearly, there also hava issue block CI jobs in LGBM repo.
Hi @svotaw , we will push the PR, as long as the CI pass, and with @shiyu1994 's review.