xgboost_ray icon indicating copy to clipboard operation
xgboost_ray copied to clipboard

`predict` on distributed `RayDMatrix` that was already used for training fails

Open Yard1 opened this issue 3 years ago • 3 comments

When predict is called on a distributed RayDMatrix that was already used for training, it will fail as combine_data requires sharding to be one of RayShardingMode.INTERLEAVED or RayShardingMode.BATCH but that RayDMatrix will have RayShardingMode.FIXED.

E           ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E           FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.

Distributed RayDMatrix is essentially "single use". This should be either rectified by restoring the original state of the RayDMatrix after training or raising an exception if it's used for prediction.

Repro:

ray.init(num_cpus=3, num_gpus=0)

from pandas import DataFrame as PdDataFrame
from modin.pandas import read_csv

repeat = 8  # Repeat data a couple of times for stability
x = np.array([
    [1, 0, 0, 0],  # Feature 0 -> Label 0
    [0, 1, 0, 0],  # Feature 1 -> Label 1
    [0, 0, 1, 1],  # Feature 2+3 -> Label 2
    [0, 0, 1, 0],  # Feature 2+!3 -> Label 3
] * repeat)
y = np.array([0, 1, 2, 3] * repeat)

data = PdDataFrame(self.x).to_csv("frame.csv")
data = read_csv("frame.csv")
data["label"] = y
dtrain = RayDMatrix(data, "label")

params = {
    "booster": "gbtree",
    "nthread": 1,
    "max_depth": 2,
    "objective": "multi:softmax",
    "num_class": 4
}

bst = train(
    params,
    dtrain,
    num_boost_round=2,
    ray_params=RayParams(num_actors=2))

Yard1 avatar Jul 10 '21 00:07 Yard1

We have same issue too.:

    from xgboost_ray import RayDMatrix, RayParams, RayXGBRegressor, RayShardingMode
    from sklearn.datasets import make_regression

    X, y = make_regression(n_samples=1_0000, n_features=10)
    pd_df = pd.concat([pd.DataFrame(X), pd.DataFrame({"target": y})], axis=1)
    ds = ray_dataset.from_pandas_refs([ray.put(pd_df), ray.put(pd_df)])
    ray_params = RayParams(num_actors=2, cpus_per_actor=1)
    reg = RayXGBRegressor(
        ray_params=ray_params, random_state=42
    )
    # train
    train_set = RayDMatrix(ds, "target")
    reg.fit(train_set, y=None, ray_params=ray_params)
    print(f"Regressor {reg}")
    pred = reg.predict(train_set)
    print("predicted values: ", pred)
sharding = <RayShardingMode.FIXED: 3>, data = [array([-180.90732,  310.31183,  -98.30024, ...,  178.35243,  199.3186 ,
       -110.26601], dtype=float32)]

    @DeveloperAPI
    def combine_data(sharding: RayShardingMode, data: Iterable) -> np.ndarray:
        if sharding not in (RayShardingMode.BATCH, RayShardingMode.INTERLEAVED):
>           raise ValueError(f"Invalid value for `sharding` parameter: "
                             f"{sharding}"
                             f"\nFIX THIS by passing any item of the "
                             f"`RayShardingMode` enum, for instance "
                             f"`RayShardingMode.BATCH`.")
E           ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E           FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.


chaokunyang avatar Nov 17 '21 04:11 chaokunyang

Raise this issue again, as it blocks our workflow.

tsaoyu avatar Jul 04 '22 08:07 tsaoyu

As a workaround, recreate the RayDMatrix before passing it again

Yard1 avatar Jul 04 '22 16:07 Yard1