LightGBM
LightGBM copied to clipboard
[python-package] prefix all internal object names with "_"
Short Description
I'm opening this RFC to propose the following:
all objects in the Python package which are intended to be internal should have names beginning with
_
The convention "anything starting with _
is internal" is common in Python programming. If LightGBM's Python package followed that convention consistently, it would simplify LightGBM's position on which parts of the Python package are considered "internal" and subject to breaking changes in any release.
Detailed Description
In pursuit of the v4.0.0 release (#5153), we have been accepting breaking changes which simplify the project and reduce its maintenance burden, so that limited maintainer time can be focused on providing a good experience with LightGBM's main functionality.
I think "all internal objects should begin with _
" is one such type of change. Making it clearer which parts of the Python package are considered internal reduces the surface area of the public API which users need to consider, and increases contributors' flexibility to change implementations without breaking user code.
R packages support a "namespace" concept (see "Writing R extensions"), which clearly separates exported and internal objects, and I think that mechanism has been really helpful. The closest thing that Python has, to my knowledge, is the __all__
member for modules (https://docs.python.org/3/tutorial/modules.html#importing-from-a-package). __all__
controls the behavior of "star imports".
For example, if you run the following:
from lightgbm import *
dir()
the only LightGBM objects you'll see are those in the package's __all__
.
This mechanism is a good start, but prefixing all internal objects with _
would be and even stronger signal about which things are internal. It would also improve the consistency of the Python package's code.
Specific Proposal
expand to see original proposal (click me)
I am proposing that the following be prefixed with a _
and treated as internal:
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L128
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L144
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L155
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L160
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L168
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L175
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L190
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L240
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L248
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L256
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L264
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L272
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L277
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L282
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L292
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L436-L470
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L472
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L482
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/basic.py#L502
- https://github.com/microsoft/LightGBM/blob/ff947bfac5148a9b94d045f2879aee183e02e174/python-package/lightgbm/libpath.py#L9
Considerations for Implementation
Check that any name being modified is only used at runtime and not part of the state for objects like Booster
or LGBMClassifier
which might be pickled (since pickle
only stores references to imports and expects to be able to reference them when running load()
).
Status: Accepted
~@StrikerRUS @jmoralez @shiyu1994 @guolinke Whenever you have time, will you please comment here on whether or not you support this proposal? Thank you for your time and consideration.~
This was originally opened as a request for comment, but it's been accepted and we're now accepting contributions to make more of the Python package's namespace private?
How to Contribute
Anyone is welcome to submit a pull request to help contribute to this effort.
Please limit pull requests to 1 function/method/module per pull request, to make them easier to review.
When you open pull requests, please put Contributes to #5313
in the description.
Add __all__
entries to modules.
- [ ]
python-package/basic.py
- [ ]
python-package/callback.py
- [ ]
python-package/dask.py
- [ ]
python-package/engine.py
- [ ]
python-package/libpath.py
- [ ]
python-package/plotting.py
- [ ]
python-package/sklearn.py
Prefix objects, functions, class attributes with _
- [ ]
basic.ZERO_THRESHOLD
- [x]
basic.NUMERIC_TYPES
(#5409) - [x]
basic.is_numeric()
(#5421) - [ ]
basic.is_numpy_1d_array()
- [ ]
basic.is_numpy_column_array()
- [ ]
basic.cast_numpy_array_to_dtype()
- [ ]
basic.is_1d_list()
- [ ]
basic.list_to_1d_numpy()
- [ ]
basic.cfloat32_array_to_numpy()
- [ ]
basic.cfloat64_array_to_numpy()
- [ ]
basic.cint32_array_to_numpy()
- [ ]
basic.cint64_array_to_numpy()
- [ ]
basic.c_str()
- [ ]
basic.c_array()
- [ ]
basic.json_default_with_numpy()
- [ ]
basic.param_dict_to_str()
- [ ]
basic.MAX_INT32
- [ ]
basic.C_API_DTYPE_FLOAT32
- [ ]
basic.C_API_DTYPE_FLOAT64
- [ ]
basic.C_API_DTYPE_INT32
- [ ]
basic.C_API_DTYPE_INT64
- [ ]
basic.C_API_IS_ROW_MAJOR
- [ ]
basic.C_API_PREDICT_NORMAL
- [ ]
basic.C_API_PREDICT_RAW_SCORE
- [ ]
basic.C_API_PREDICT_LEAF_INDEX
- [ ]
basic.C_API_PREDICT_CONTRIB
- [ ]
basic.C_API_MATRIX_TYPE_CSR
- [ ]
basic.C_API_MATRIX_TYPE_CSC
- [ ]
basic.C_API_FEATURE_IMPORTANCE_SPLIT
- [ ]
basic.C_API_FEATURE_IMPORTANCE_GAIN
- [ ]
basic.FIELD_TYPE_MAPPER
- [ ]
basic.FEATURE_IMPORTANCE_TYPE_MAPPER
- [ ]
basic.convert_from_sliced_object()
- [ ]
basic.c_float_array()
- [ ]
basic.c_int_array()
Thanks for writing this up, I support it. Are we also adding the __all__
? I think it's good having both.
Are we also adding the
__all__
? I think it's good having both.
There is already a __all__
controlling what happens when you from lightgbm import *
.
https://github.com/microsoft/LightGBM/blob/1b43214f72e2f42c262f0c79eb35cd185a8654af/python-package/lightgbm/init.py#L30-L36
Are you talking about also adding a __all__
to each module, for example one in python-package/lightgbm/sklearn.py
that would limit what gets imported by from lightgbm.sklearn import *
?
I don't know enough about this to know whether or not that would work. Like I'm not sure if __all__
as a concept is specific to __init__.py
files. But I'd be interested in trying it!
Yeah I meant at the module level, I believe it would be very useful for callback
for example.
Edit: Adding reference
The public names defined by a module are determined by checking the module’s namespace for a variable named _all_
https://docs.python.org/3/reference/simple_stmts.html#import
Edit2: On that same paragraph it says:
If _all_ is not defined, the set of public names includes all names found in the module’s namespace which do not begin with an underscore character ('_')
So we can set __all__
on the modules as a first step and that would avoid importing the current internal functions that aren't prefixed by _
.
Ok nice! I hadn't thought of that. Totally support that idea. I think that new public classes and functions are rare enough in this project that the communication benefit of doing that is worth the slight additional maintenance burden.
I'm +1 for
all objects in the Python package which are intended to be internal should have names beginning with
_
Also I'm proposing rethinking which properties of Dataset
, Booster
and other public classes should be public.
For example, I don't think that pandas_categorical
or handle
properties should be public.
https://github.com/microsoft/LightGBM/blob/1b43214f72e2f42c262f0c79eb35cd185a8654af/python-package/lightgbm/basic.py#L1200
https://github.com/microsoft/LightGBM/blob/1b43214f72e2f42c262f0c79eb35cd185a8654af/python-package/lightgbm/basic.py#L2581
I'm +1
I'm +1 for this proposal! Thanks.
Ok thanks all! I'll update the description and convert this from an RFC to a regular maintenance issue, similar to #3756 .
I've updated the description and title to describe how to contribute on this initiative. @StrikerRUS @jmoralez feel free to edit it with any changes you'd like!
For example, I didn't list out any class attributes yet (but agree with the idea in https://github.com/microsoft/LightGBM/issues/5313#issuecomment-1166323802).
I'm glad we originally opened this as a good first issue
and that some contributors made their first PRs into the project working on it!
But I'm going to take this over and finish it. It's been open for 6 months, and I'd like to get these changes in for v4.0.0 (#5153).
The last piece of this is @StrikerRUS 's proposal to also make some more of the class attributes for public classes private.
Here's my proposal for which class attributes I think we should make private in the v4.0.0 release:
- [x]
Booster.handle
-
ctypes
pointer to theBooster
object on the C++ side. Keeping this private would allow changing the implementation for how that reference is managed
-
- [x]
Booster.network
-
used to keep track of whether or not this Booster is actively participating in distributed training, users should only be modifying this via public method
Booster.free_network()
- #5723
-
used to keep track of whether or not this Booster is actively participating in distributed training, users should only be modifying this via public method
- [ ]
Dataset.handle
-
ctypes
pointer to theDataset
object on the C++ side. Keeping this private would allow changing the implementation for how that reference is managed
-
- [x]
Dataset.params_back_up
- implementation detail for parameter-updating, users shouldn't ever think about this
- #5723
- [x]
Dataset.need_slice
-
used to defer when slicing actually happens when
used_indices
is modified (including byDataset.subset()
). Users should never be modifying this directly - #5723
-
used to defer when slicing actually happens when
- no changes to
lightgbm.dask
classes - no changes to
lightgbm.sklearn
classes - no changes to
lightgbm.basic.CVBooster
@jmoralez @guolinke @StrikerRUS @shiyu1994 @btrotta what do you think?
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.