NA's in label, weights, and init_score are replaced by 0

Open fabsig opened this issue 3 years ago • 1 comments

LightGBM replaces missing values in labels, weights, and init_score by 0. See, e.g., here for labels:

https://github.com/microsoft/LightGBM/blob/f1d3181ced9fd01f4b2899054abd99be6773e939/src/io/metadata.cpp#L388)

athe function AvoidInf():

https://github.com/microsoft/LightGBM/blob/f1d3181ced9fd01f4b2899054abd99be6773e939/include/LightGBM/utils/common.h#L653-L655

which does the actual replacement.

Is this being done intentionally? In my view, this is dangerous practice and missing values in labels, weights, and init_score should not be imputed with 0's (neither in LightGBM nor in any other other machine learning model). The situation is similar for Inf's in labels. Currently, Inf's are simply replaced by a large number. How NA's and Inf's are dealt with is something a user should decide, and a lot can go wrong if missing values are simply replaced by 0's and users might not even be aware of this (this was the case for me).

If this should be changed, I am glad to provide fix as I have done it here for the GPBoost library.

Nov 18 '22 10:11 fabsig

Hey @fabsig I'm very very sorry that no one ever responded to your post here from a few years ago.

I'd like to revive that conversation now, based on this related report that just came in: #7041

My guess (not certain): it's intentional

I suspect that it is intentional to convert infinite values to large values and NaNs to 0s in label, weights, and init_score. I'm going to guess at why, but it would be good for @guolinke to confirm.

It's important to remember that before LightGBM had a Python package (#97) or R package (#168), the options for training models were:

use the C API
use the command-line interface (CLI)

And for both of those, the main supported way to provide input data was via files, not in-memory arrays. With input data in files, it could be much more difficult to edit subsets of the raw data (e.g. dropping rows with NA values, imputing custom values, etc.) than it is with in-memory array data in Python or R.

Replacing with 0s is on simple and performant solution to that problem.

this is dangerous practice

I don't think it's "dangerous" unconditionally and equally. Listed below from most to least impactful (in my opinion):

label = most severe (click me)

If only a few such values are found in the training data, and if the dataset is sufficiently large, this could almost act like a form of regularization and help generalizability.

But yes if it affects a significant portion of the label, or if the values are not missing/infinite-at-random then converting to 0.0 could result in a nonsensical or at least very bad model.

Converting NaN/Inf in the label to 0.0 is the hardest choice to justify, in my opinion.

weights = fairly severe (click me)

A weight of 0.0 is similar in some ways to dropping an observation from the Dataset entirely. It's not identical, because the observations's feature values still contribute the distribution used for histogram construction, it still counts as 1 towards any number-of-observations parameters like min_data_in_leaf or bin_construct_sample_cnt.

A relatively small number of these cases could again act as a sort of regularization.

But too many and the model could be nonsensical or very bad (consider the extreme case like a regression problem where the weight is NaN for all samples in the top 90% of the label distribution... the model will not learn how to predict in most of the distribution).

init_score = annoying, but not huge problem (click me)

If there are real relationships between the features and the label, then it should be possible to recover from bad init_score with enough boosting rounds.

Even if every sample had its init_score set to 0.0, as long as the features, label, and weights are OK then LightGBM should be able to produce a good and useful model.

But it could lead to longer training times and more complex models than would be achieved with non-0 init_score set by some other model.

If this should be changed, I am glad to provide fix as I have done it here for the GPBoost library.

It looks like there, you've implemented the behavior "a single NaN of infinite value in the label, init_score, or weights results in a runtime error". That would be a fairly significant breaking change for LightGBM and not one I'd want to make lightly.

HOWEVER... I do agree that missing or infinite values in these inputs could generally be assumed to be either mistakes or people people intentionally depending on LightGBM's behavior of coercing them to 0.0.

Proposed change to LightGBM

I do think it'd be helpful to users to alert them to missing or infinite values in these inputs, but that it should still be possible for backwards compatibility to have LightGBM convert such values to 0.0.

Proposing:

raise a fatal error if NaN / Inf are detected in init_score, label, or weight
introduce a parameter treat_nan_as_zero (or similar) allowing users to avoid the error and opt in to the existing "replace with 0.0" behavior

[LightGBM] [Fatal] Encountered NaN or infinite value in init_score.
To resolve this, either ensure that init_score is free of such values
or set treat_nan_as_zero=false in params.

I think introducing that breaking change is OK as long as we make it easy to restore the replace-with-0.0 behavior that has been there for a long time.

Sep 27 '25 04:09 jameslamb