SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

feat: [LightGBM] support dataset rows more then max(int32_t)

Open junpeng0715 opened this issue 3 years ago • 18 comments

Related Issues/PRs

#LightGBM: 5540

What changes are proposed in this pull request?

Allow more than MAX(int32_t) of dataset rows to use LightGBM via SynapseML.

How is this patch tested?

  • [X] I tested in our Databricks envs use more than Max(int32) dataset rows.

Does this PR change any dependencies?

  • [X] No. You can skip this section.
  • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • [X] No. You can skip this section.
  • [ ] Yes. Make sure you have added samples following below steps.

junpeng0715 avatar Oct 20 '22 05:10 junpeng0715

Hey @junpeng0715 :wave:! Thank you so much for contributing to our repository :raised_hands:. Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source. Check out the developer guide for additional guidance on testing your change.

github-actions[bot] avatar Oct 20 '22 05:10 github-actions[bot]

/azp run

mhamilton723 avatar Oct 20 '22 13:10 mhamilton723

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Oct 20 '22 13:10 azure-pipelines[bot]

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 20.65%. Comparing base (448f6b7) to head (45e427b). :warning: Report is 460 commits behind head on master.

:exclamation: There is a different number of reports uploaded between BASE (448f6b7) and HEAD (45e427b). Click for more details.

HEAD has 28 uploads less than BASE
Flag BASE (448f6b7) HEAD (45e427b)
39 11
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1684       +/-   ##
===========================================
- Coverage   80.63%   20.65%   -59.99%     
===========================================
  Files         272      244       -28     
  Lines       14224    11971     -2253     
  Branches      732      570      -162     
===========================================
- Hits        11470     2473     -8997     
- Misses       2754     9498     +6744     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Oct 20 '22 13:10 codecov-commenter

hello. thanks for this. I will review when I get a chance, but a couple of points:

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.
  2. Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

svotaw avatar Oct 20 '22 16:10 svotaw

Hi @svotaw ,thank you!

hello. thanks for this. I will review when I get a chance, but a couple of points:

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is there any issue ticket refer to?

  1. Note that LightGBM itself does not fully support Datasets that have more than max(int_32) rows (I'm on a PR there for someone trying to improve that to max(int_64), but currently they are planning to make it a build option).

Yes, that is the PR I made in LightGBM repo. https://github.com/microsoft/LightGBM/pull/5540

junpeng0715 avatar Oct 21 '22 12:10 junpeng0715

Hi @svotaw

  1. We are moving to a totally different method of loading Datasets that hopefully will be checked in here in a few weeks (in last stages of testing), and DatasetAggregator will basically become a deprecated file. We are moving to a streaming way of loading Datasets that does not take any memory (DatasetAggregator ends up using a ton of memory, and the new method uses more than an order of magnitude less). I'm not sure what these changes would look like for the new code.

Is that will support dataset rows over max(int32)?

junpeng0715 avatar Oct 23 '22 12:10 junpeng0715

@junpeng0715 https://github.com/microsoft/SynapseML/pull/1580 is the PR that will enable using "streaming" mode for lightgbm. It is currently waiting for a bug fix to sparse data that is pending in this LightGBM PR: https://github.com/microsoft/LightGBM/pull/5551.

I have not specifically looked into int64 support for the new code. We try to make it so, but not all the LightGBM APIs support it as you know.

svotaw avatar Oct 23 '22 19:10 svotaw

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found

svotaw avatar Oct 23 '22 19:10 svotaw

You have errors in the pipeline:

[error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found

Yes the dependency change is https://github.com/microsoft/LightGBM/pull/5540.

junpeng0715 avatar Oct 24 '22 01:10 junpeng0715

Hi @svotaw Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

junpeng0715 avatar Oct 24 '22 02:10 junpeng0715

/azp run

mhamilton723 avatar Oct 26 '22 16:10 mhamilton723

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Oct 26 '22 16:10 azure-pipelines[bot]

Compilation failed, please confirm this PR compiles locally

[error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala:532:35: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_void [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] lightgbmlib.int_to_voidp_ptr(indexes.array), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:29:61: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] lightgbmlib.LGBM_DatasetGetField(datasetPtr, fieldName, tmpOutLenPtr, outArray, outTypePtr) [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:57:75: type mismatch; [error] found : com.microsoft.ml.lightgbm.SWIGTYPE_p_long_long [error] required: com.microsoft.ml.lightgbm.SWIGTYPE_p_int [error] LightGBMUtils.validate(lightgbmlib.LGBM_DatasetGetNumData(datasetPtr, numDataPtr), "DatasetGetNumData") [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:118:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:152:77: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data64bitType), [error] ^ [error] /home/vsts/work/1/s/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/LightGBMDataset.scala:166:79: type mismatch; [error] found : Long [error] required: Int [error] lightgbmlib.LGBM_DatasetSetField(datasetPtr, fieldName, colAsVoidPtr, numRows, data32bitType), [error] ^ [error] 7 errors found [error] (lightgbm / Compile / compileIncremental) Compilation failed [error] Total time: 17 s, completed Oct 26, 2022 4:08:19 PM ##[error]Script failed with exit code: 1

mhamilton723 avatar Oct 26 '22 16:10 mhamilton723

Hi @svotaw Thank you very much.

As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM https://github.com/microsoft/LightGBM/pull/5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

svotaw avatar Oct 26 '22 17:10 svotaw

Hi @mhamilton723 Thank you for running pipeline. This PR is depend on https://github.com/microsoft/LightGBM/pull/5540, so the error is known.

junpeng0715 avatar Oct 27 '22 01:10 junpeng0715

Hi @svotaw Thank you very much. As you said, the "streaming" mode will be use in the future, do we still need to move forward with this PR?

There will be 2 modes of execution, the older "bulk" mode which uses DatasetAggregator, and the newer "streaming" mode which we hopefully will make the default at some point within the next semester. If you plan to use the older "bulk" mode (e.g. in the next few months) and you want this feature, let me know and I'll review it. Otherwise, you can wait until we check in the newer streaming version and then see what changes need to made for it to support int64. This is blocked at the moment until we get a bug fix into LightGBM microsoft/LightGBM#5551. That PR will hopefully be approved relatively soon since it's only a few lines and only in APIs we added ourselves.

Also, what is the plan for your LightGBM PR? For SynapseMl, we make our own ilghtgbm maven package from the top of microsoft/lightgbm repo. This is an involved process and we don't do it often. I assume your PR would have to be checked in and we make a new synapseml lightgbm package before we could even build this?

I'm very glad that you can take a look for this PR, since the "bulk" mode will always stay in the system for user to choose.

I can't say the plan clearly, there also hava issue block CI jobs in LGBM repo.

junpeng0715 avatar Oct 27 '22 10:10 junpeng0715

Hi @svotaw , we will push the PR, as long as the CI pass, and with @shiyu1994 's review.

guolinke avatar Nov 10 '22 08:11 guolinke