LightGBM
LightGBM copied to clipboard
feature: Add serialization of reference dataset
Summary
This is in reference to feature request: https://github.com/microsoft/LightGBM/issues/5426
This PR adds APIs for serializing/deserializing Datasets without their data to a byte array, effectively creating a "schema" or "reference" that can be used to create other Datasets.
Implementation
The existing code for serializing Datasets to file was refactored to be able to go to any generic BinaryWriter, whether memory or file. The verbose serialization code was shared as much as possible, splitting methods into Header vs Data components.
Also, a generic ByteBuffer was created so that higher languages (e.g. Java) are removed from managing the byte memory of the serialized buffer.
Test
New C++ tests were created to test both the serialization/deserialization and the new ByteBuffer functionality.
/gha run r-valgrind
Workflow R valgrind tests has been triggered! 🚀 https://github.com/microsoft/LightGBM/actions/runs/2893166788
Status: success ✔️.
@shiyu1994 can you help to reivew?
@shiyu1994 can you take a look? ty
@shiyu1994 Just checking in
/gha run r-valgrind
Workflow R valgrind tests has been triggered! 🚀 https://github.com/microsoft/LightGBM/actions/runs/3826303140
Status: success ✔️.
@shiyu1994 @guolinke can you help with a review on this?
Sorry for the late response. Will review it within the next two days.
@shiyu1994 I made the requested changes. Can you look it over and try rerunning the failures? they don't seem related to this PR
/gha run r-valgrind
Workflow R valgrind tests has been triggered! 🚀 https://github.com/microsoft/LightGBM/actions/runs/4169689163
Status: success ✔️.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.