xgboost_ray `predict` on distributed `RayDMatrix` that was already used for training fails

When predict is called on a distributed RayDMatrix that was already used for training, it will fail as combine_data requires sharding to be one of RayShardingMode.INTERLEAVED or RayShardingMode.BATCH but that RayDMatrix will have RayShardingMode.FIXED.

E           ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E           FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.

Distributed RayDMatrix is essentially "single use". This should be either rectified by restoring the original state of the RayDMatrix after training or raising an exception if it's used for prediction.

Repro:

ray.init(num_cpus=3, num_gpus=0)

from pandas import DataFrame as PdDataFrame
from modin.pandas import read_csv

repeat = 8  # Repeat data a couple of times for stability
x = np.array([
    [1, 0, 0, 0],  # Feature 0 -> Label 0
    [0, 1, 0, 0],  # Feature 1 -> Label 1
    [0, 0, 1, 1],  # Feature 2+3 -> Label 2
    [0, 0, 1, 0],  # Feature 2+!3 -> Label 3
] * repeat)
y = np.array([0, 1, 2, 3] * repeat)

data = PdDataFrame(self.x).to_csv("frame.csv")
data = read_csv("frame.csv")
data["label"] = y
dtrain = RayDMatrix(data, "label")

params = {
    "booster": "gbtree",
    "nthread": 1,
    "max_depth": 2,
    "objective": "multi:softmax",
    "num_class": 4
}

bst = train(
    params,
    dtrain,
    num_boost_round=2,
    ray_params=RayParams(num_actors=2))

Jul 10 '21 00:07 Yard1

We have same issue too.:

    from xgboost_ray import RayDMatrix, RayParams, RayXGBRegressor, RayShardingMode
    from sklearn.datasets import make_regression

    X, y = make_regression(n_samples=1_0000, n_features=10)
    pd_df = pd.concat([pd.DataFrame(X), pd.DataFrame({"target": y})], axis=1)
    ds = ray_dataset.from_pandas_refs([ray.put(pd_df), ray.put(pd_df)])
    ray_params = RayParams(num_actors=2, cpus_per_actor=1)
    reg = RayXGBRegressor(
        ray_params=ray_params, random_state=42
    )
    # train
    train_set = RayDMatrix(ds, "target")
    reg.fit(train_set, y=None, ray_params=ray_params)
    print(f"Regressor {reg}")
    pred = reg.predict(train_set)
    print("predicted values: ", pred)

sharding = <RayShardingMode.FIXED: 3>, data = [array([-180.90732,  310.31183,  -98.30024, ...,  178.35243,  199.3186 ,
       -110.26601], dtype=float32)]

    @DeveloperAPI
    def combine_data(sharding: RayShardingMode, data: Iterable) -> np.ndarray:
        if sharding not in (RayShardingMode.BATCH, RayShardingMode.INTERLEAVED):
>           raise ValueError(f"Invalid value for `sharding` parameter: "
                             f"{sharding}"
                             f"\nFIX THIS by passing any item of the "
                             f"`RayShardingMode` enum, for instance "
                             f"`RayShardingMode.BATCH`.")
E           ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E           FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.

Nov 17 '21 04:11 chaokunyang

Raise this issue again, as it blocks our workflow.

Jul 04 '22 08:07 tsaoyu

As a workaround, recreate the RayDMatrix before passing it again

Jul 04 '22 16:07 Yard1

xgboost_ray xgboost_ray copied to clipboard

`predict` on distributed `RayDMatrix` that was already used for training fails

xgboost_ray
xgboost_ray copied to clipboard