lleaves
lleaves copied to clipboard
Does this cause core dump ?
Recently, I find that one of my model will cause core dump if I use lleaves for predict.
I am confused about two functions below.
In codegen.py, function param type can be int*
if param is categorical
def make_tree(tree):
# declare the function for this tree
func_dtypes = (INT_CAT if f.is_categorical else DOUBLE for f in tree.features)
scalar_func_t = ir.FunctionType(DOUBLE, func_dtypes)
tree_func = ir.Function(module, scalar_func_t, name=str(tree))
tree_func.linkage = "private"
# populate function with IR
gen_tree(tree, tree_func)
return LTree(llvm_function=tree_func, class_id=tree.class_id)
But in data_processing.py with predict used, all feature param are convert to double*
def ndarray_to_ptr(data: np.ndarray):
"""
Takes a 2D numpy array, converts to float64 if necessary and returns a pointer
:param data: 2D numpy array. Copying is avoided if possible.
:return: pointer to 1D array of dtype float64.
"""
# ravel makes sure we get a contiguous array in memory and not some strided View
data = data.astype(np.float64, copy=False, casting="same_kind").ravel()
ptr = data.ctypes.data_as(POINTER(c_double))
return ptr
Is this just like
int* predict(int* a, double* b);
double a = 1.1;
double b = 2.2;
predict(&a, &b);
Does this will happy in lleaves?
TLDR: It's possible that there's a bug that causes a segfault, though it's unlikely that this is happening in the parts of the code you're pointing to.
For diagnosing the segfault: Could you run a minimally reproducing example with gdb
to see which instruction triggers the segfault? There used to be an issue with overflows for very large datasets, but I fixed that a few months ago. If there's any way you can have a self-contained, minimally reproducible sample and send it to me (email is fine), I'd love to help you out.
Regarding the categorical data: The relevant function is actually this one: https://github.com/siboehm/lleaves/blob/9784625d8503c02e2679fafefb41c469b345566d/lleaves/compiler/codegen/codegen.py#L42
This is the function in the binary that lleaves calls from Python (using two double
pointers). The categorical features are then cast to ints in the core loop here:https://github.com/siboehm/lleaves/blob/9784625d8503c02e2679fafefb41c469b345566d/lleaves/compiler/codegen/codegen.py#L205
Most of the processing of the Pandas dataframes follows LightGBM very closely. This double
to int
casting is a bit strange, but I wanted to follow LightGBM as closely as possible. It works since LightGBM doesn't allow categoricals > 2^31-1 (max int 32), but double can represent any int up to 2^53 and lower without loss of precision.
I find that if categorical feature is numerical value, we can get rid of the code df[categorical_feature] = df[categorical_feature].astype('category')
when prepared training data. We can just call lightgbm train function by set param categorical_feature=categorical_feature
. In model file trained like this, pandas_categorical is null. May this issue related to this?
When I retrained a model that pandas_categorical is not null, the core dump disappeared.
PR: return empty list if pandas_categorical is null in model file
BTW, I think we show keep pandas_categorical = None
, when pandas_categorical: null
in the model file.
I'm having trouble understanding this issue. Could you write up a minimally reproducible example of the core dump / send me the model.txt
that causes it?