xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Error in cox-regression while evaluating

Open Stochastic13 opened this issue 3 years ago • 13 comments

I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to survival:cox, I repeatedly get this error.

Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 94, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

There are no nan in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implement early_stopping_rounds so this is not a long-term option), I still get nan (or inf) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.

The run parameters were thus:

{'colsample_bytree': 0.8, 'eta': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 
'num_parallel_tree': 20, 'sampling_method': 'uniform', 'subsample': 0.8, 'tree_method':'gpu_hist', 
'verbosity':1,  'seed':0, 'objective':'survival:cox', 'eval_metric':'cox-nloglik'}

The same error for many different parameter sets. Another example: {'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}

Stochastic13 avatar Apr 21 '21 17:04 Stochastic13

Are you able to post your data and training script? That will help us further diagnose this problem.

hcho3 avatar Apr 21 '21 17:04 hcho3

@hcho3 I can post the main part of training script and the output. There is a large section of preprocessing and setting up the CV that I am skipping. Also the parameters are just so as to get the error quickly.

    data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
    data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
    print('Imbalance: ', imbalance)
    print('Length/NA ytrain:', len(y_coded_train), np.sum(np.isnan(y_coded_train)))
    print('Length/NA xtrain:', xsub_train.shape, np.sum(np.isnan(xsub_train.to_numpy().flatten())))
    print('Length/NA yest:', len(y_coded_test), np.sum(np.isnan(y_coded_test)))
    print('Length/NA xtest:', xsub_test.shape, np.sum(np.isnan(xsub_test.to_numpy().flatten())))
    print(np.sum(ysub_train), np.sum(ysub_test))
    m = xg.train(params=p, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
                 evals=[(data_m, 'train'), (data_v, 'eval')])

Output (In other runs, like I said above, the score does not have to reach this high for the error):

Imbalance:  31.582978723404256
Length/NA ytrain: 6125 0
Length/NA xtrain: (6125, 52) 0
Length/NA yest: 1532 0
Length/NA xtest: (1532, 52) 0
{'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'unifo
rm', 'subsample': 0.8, 'tree_method': 'gpu_hist', 'objective': 'survival:cox', 'eval_metric': 'cox-nloglik', 'seed': 0, 'verbosity': 1}
188 47
[0]     train-cox-nloglik:8.88308       eval-cox-nloglik:6.99363
[1]     train-cox-nloglik:9.26756       eval-cox-nloglik:7.79090
[2]     train-cox-nloglik:9.64561       eval-cox-nloglik:8.33538
[3]     train-cox-nloglik:10.17499      eval-cox-nloglik:9.04028
[4]     train-cox-nloglik:10.41129      eval-cox-nloglik:9.17929
[5]     train-cox-nloglik:10.87415      eval-cox-nloglik:10.01846
[6]     train-cox-nloglik:11.59993      eval-cox-nloglik:10.37045
[7]     train-cox-nloglik:12.22635      eval-cox-nloglik:10.77931
[8]     train-cox-nloglik:12.22635      eval-cox-nloglik:10.77931
[9]     train-cox-nloglik:12.40515      eval-cox-nloglik:11.21803
[10]    train-cox-nloglik:12.40514      eval-cox-nloglik:11.21803
[11]    train-cox-nloglik:12.40514      eval-cox-nloglik:11.21803
Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 101, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

Should I print anything else of importance?

Stochastic13 avatar Apr 21 '21 19:04 Stochastic13

@Stochastic13 Can you post the data after the pre-processing step? If we cannot run the program ourselves, it's hard for us developers to find the cause of the error.

hcho3 avatar Apr 21 '21 19:04 hcho3

@hcho3 I understand. The data is confidential unfortunately. Here's a reproducible example I recreated with random data:

import numpy as np
import pandas as pd
import xgboost as xg


param_cv = dict()
param_cv['eta'] = 0.3
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 100
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.8
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.8
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1

imbalance = 31.58

xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]

xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=25,
             evals=[(data_m, 'train'), (data_v, 'eval')])

And the Output:

[0]     train-cox-nloglik:7.98620       eval-cox-nloglik:6.65554
[1]     train-cox-nloglik:8.27961       eval-cox-nloglik:7.17497
[2]     train-cox-nloglik:8.66985       eval-cox-nloglik:7.64955
[3]     train-cox-nloglik:9.10624       eval-cox-nloglik:8.32161
[4]     train-cox-nloglik:9.56293       eval-cox-nloglik:8.84506
[5]     train-cox-nloglik:10.12860      eval-cox-nloglik:9.36206
[6]     train-cox-nloglik:10.66752      eval-cox-nloglik:10.30291
[7]     train-cox-nloglik:11.37288      eval-cox-nloglik:10.78025
[8]     train-cox-nloglik:12.12385      eval-cox-nloglik:12.21345
[9]     train-cox-nloglik:12.66841      eval-cox-nloglik:13.19342
[10]    train-cox-nloglik:12.66842      eval-cox-nloglik:13.19342
[11]    train-cox-nloglik:12.66842      eval-cox-nloglik:13.19342
Traceback (most recent call last):
  File "xgboost_error.py", line 39, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

Stochastic13 avatar Apr 21 '21 19:04 Stochastic13

@hcho3 I can also try with similar random datasets with different censoring extents, if that sounds like something that helps narrow down the problem. I was just worried if I had made some mistake in running the training.

Stochastic13 avatar Apr 22 '21 04:04 Stochastic13

Weird that the nloglik increases.

trivialfis avatar Apr 22 '21 19:04 trivialfis

@trivialfis The increase doesn't have to be large either for the error, in case it helps. The following has the same error after 85 iterations, but both the train and the eval score does not change at all up to the first 5 decimal places.

import numpy as np
import pandas as pd
import xgboost as xg


param_cv = dict()
param_cv['eta'] = 0.1
param_cv['max_depth'] = 3
param_cv['min_child_weight'] = 200
param_cv['max_delta_step'] = 0
param_cv['subsample'] = 0.5
param_cv['sampling_method'] = 'uniform'
param_cv['colsample_bytree'] = 0.5
param_cv['num_parallel_tree'] = 1
param_cv['tree_method'] = 'gpu_hist'
param_cv['objective'] = 'survival:cox'
param_cv['eval_metric'] = 'cox-nloglik'
param_cv['seed'] = 0
param_cv['verbosity'] = 1

imbalance = 31.58

xsub_train = pd.DataFrame(np.random.normal(0, 1, (6125, 52)))
ysub_train = np.random.choice([0, 1], 6125, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_train = np.random.randint(1, 1500, 6125)
y_coded_train[ysub_train == 0] = -y_coded_train[ysub_train == 0]

xsub_test = pd.DataFrame(np.random.normal(0, 1, (1532, 52)))
ysub_test = np.random.choice([0, 1], 1532, p=[imbalance / (imbalance + 1), 1 / (imbalance + 1)])
y_coded_test = np.random.randint(1, 1500, 1532)
y_coded_test[ysub_test == 0] = -y_coded_test[ysub_test == 0]
data_m = xg.DMatrix(xsub_train, label=y_coded_train, weight=[imbalance if i == 1 else 1 for i in ysub_train])
data_v = xg.DMatrix(xsub_test, label=y_coded_test, weight=[imbalance if i == 1 else 1 for i in ysub_test])
m = xg.train(params=param_cv, dtrain=data_m, num_boost_round=1000, early_stopping_rounds=600,
             evals=[(data_m, 'train'), (data_v, 'eval')])

Stochastic13 avatar Apr 23 '21 04:04 Stochastic13

I played a bit with the code. Without passing weights, I could not reproduce the problem. However, even with quite low weights (imbalance 2 or 3), the problem remained.

mayer79 avatar May 06 '21 19:05 mayer79

I am using XGBoost version 1.3.3 on Windows with Python 3.6.8. On attempting to do training with objective set to survival:cox, I repeatedly get this error.

Traceback (most recent call last):
  File "xgboost_survival_cv.py", line 94, in <module>
    evals=[(data_m, 'train'), (data_v, 'eval')])
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 235, in train
  File "C:\Program Files\Python36\lib\site-packages\xgboost\training.py", line 110, in _train_internal
    if callbacks.after_iteration(bst, i, dtrain, evals):
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 427, in after_iteration
    self._update_history(score, epoch)
  File "C:\Program Files\Python36\lib\site-packages\xgboost\callback.py", line 393, in _update_history
    name, s = d[0], float(d[1])
ValueError: could not convert string to float: '-nan(ind)'

There are no nan in my data I initially assumed this might be due to some overflow since this error often appeared when the test loss (cox-nloglik) exceeded 15 or so (in the last successful boosting iteration), as confirmed by the disappearance of the error on using lesser boosting rounds, smaller learn rate, smaller trees (no overfitting and hence no blowup of the test loss?) or switching off evaluation (empty evals list). But, later, I got the same error when the test loss was 6 (in the last successful boosting iteration). Further, on removing the evaluation (I need to implement early_stopping_rounds so this is not a long-term option), I still get nan (or inf) in the prediction output, though no error. The data is highly censored (90% right censored), in case that matters.

The run parameters were thus:

{'colsample_bytree': 0.8, 'eta': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 
'num_parallel_tree': 20, 'sampling_method': 'uniform', 'subsample': 0.8, 'tree_method':'gpu_hist', 
'verbosity':1,  'seed':0, 'objective':'survival:cox', 'eval_metric':'cox-nloglik'}

The same error for many different parameter sets. Another example: {'colsample_bytree': 0.8, 'eta': 0.3, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 100, 'num_parallel_tree': 1, 'sampling_method': 'gradient_based', 'subsample': 0.2}

same problem, have you solved this? Thx!

XiangBu avatar Oct 03 '21 01:10 XiangBu

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

Stochastic13 avatar Oct 03 '21 08:10 Stochastic13

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

XiangBu avatar Oct 06 '21 06:10 XiangBu

Could you please checkout the new survial training module in XGBoost: https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html ?

trivialfis avatar Oct 07 '21 09:10 trivialfis

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.

in the xgboost\callback.py file

change line: "cvmap[(metric_idx, k)].append(float(v))"

to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"


Ruihaoh avatar Apr 19 '22 03:04 Ruihaoh

@SandyBy Not directly, unfortunately. Had to change the data processing to get different results. Not much success with other implementations either. See if scikit-survival helps you since it has gradient-boosted learning with cox-ph loss. Is much slower than XGBoost though.

If you don't use cv, it goes well. Also, I tried scikit-survival already, Thx a lot for kind reply!

Hi everyone, I found a easy way to fix this bug. This is due to reason that nloglik goes to infinity or nan then it can't be converted into float.

in the xgboost\callback.py file

change line: "cvmap[(metric_idx, k)].append(float(v))"

to: " try: cvmap[(metric_idx, k)].append(float(v)) except: cvmap[(metric_idx, k)].append(numpy.nan)"


Hello. I tried your method, but I encountered a new Bug.

File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\core.py", line 617, in inner_f
    return func(**kwargs)
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\training.py", line 196, in train
    if cb_container.after_iteration(bst, i, dtrain, evals):
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in after_iteration
    metric_score = [(n, float(s)) for n, s in metric_score_str]
  File "D:\ProgramData\Anaconda3\envs\py38\lib\site-packages\xgboost\callback.py", line 259, in <listcomp>
    metric_score = [(n, float(s)) for n, s in metric_score_str]
ValueError: could not convert string to float: '-nan(ind)'

This error is converted to the next step.

skyee1 avatar Nov 26 '22 09:11 skyee1

Hello, I also tried the approach above but encounter the same bug as @skyee1

Ediebah avatar Jan 17 '23 07:01 Ediebah

Please try to use version xgboost version 1.6. Good luck

On Tue, Jan 17, 2023 at 2:18 AM Divine @.***> wrote:

Hello, I also tried the approach above but encounter the same bug as @skyee1 https://github.com/skyee1

— Reply to this email directly, view it on GitHub https://github.com/dmlc/xgboost/issues/6885#issuecomment-1384933633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJPAVEAU5O5WAB5BZDT4JTDWSZBT3ANCNFSM43KYTSLQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Best regards!

Ruihaoh avatar Jan 17 '23 14:01 Ruihaoh