xgboost_ray
xgboost_ray copied to clipboard
`predict` on distributed `RayDMatrix` that was already used for training fails
When predict
is called on a distributed RayDMatrix
that was already used for training, it will fail as combine_data
requires sharding
to be one of RayShardingMode.INTERLEAVED
or RayShardingMode.BATCH
but that RayDMatrix
will have RayShardingMode.FIXED
.
E ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.
Distributed RayDMatrix is essentially "single use". This should be either rectified by restoring the original state of the RayDMatrix after training or raising an exception if it's used for prediction.
Repro:
ray.init(num_cpus=3, num_gpus=0)
from pandas import DataFrame as PdDataFrame
from modin.pandas import read_csv
repeat = 8 # Repeat data a couple of times for stability
x = np.array([
[1, 0, 0, 0], # Feature 0 -> Label 0
[0, 1, 0, 0], # Feature 1 -> Label 1
[0, 0, 1, 1], # Feature 2+3 -> Label 2
[0, 0, 1, 0], # Feature 2+!3 -> Label 3
] * repeat)
y = np.array([0, 1, 2, 3] * repeat)
data = PdDataFrame(self.x).to_csv("frame.csv")
data = read_csv("frame.csv")
data["label"] = y
dtrain = RayDMatrix(data, "label")
params = {
"booster": "gbtree",
"nthread": 1,
"max_depth": 2,
"objective": "multi:softmax",
"num_class": 4
}
bst = train(
params,
dtrain,
num_boost_round=2,
ray_params=RayParams(num_actors=2))
We have same issue too.:
from xgboost_ray import RayDMatrix, RayParams, RayXGBRegressor, RayShardingMode
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1_0000, n_features=10)
pd_df = pd.concat([pd.DataFrame(X), pd.DataFrame({"target": y})], axis=1)
ds = ray_dataset.from_pandas_refs([ray.put(pd_df), ray.put(pd_df)])
ray_params = RayParams(num_actors=2, cpus_per_actor=1)
reg = RayXGBRegressor(
ray_params=ray_params, random_state=42
)
# train
train_set = RayDMatrix(ds, "target")
reg.fit(train_set, y=None, ray_params=ray_params)
print(f"Regressor {reg}")
pred = reg.predict(train_set)
print("predicted values: ", pred)
sharding = <RayShardingMode.FIXED: 3>, data = [array([-180.90732, 310.31183, -98.30024, ..., 178.35243, 199.3186 ,
-110.26601], dtype=float32)]
@DeveloperAPI
def combine_data(sharding: RayShardingMode, data: Iterable) -> np.ndarray:
if sharding not in (RayShardingMode.BATCH, RayShardingMode.INTERLEAVED):
> raise ValueError(f"Invalid value for `sharding` parameter: "
f"{sharding}"
f"\nFIX THIS by passing any item of the "
f"`RayShardingMode` enum, for instance "
f"`RayShardingMode.BATCH`.")
E ValueError: Invalid value for `sharding` parameter: RayShardingMode.FIXED
E FIX THIS by passing any item of the `RayShardingMode` enum, for instance `RayShardingMode.BATCH`.
Raise this issue again, as it blocks our workflow.
As a workaround, recreate the RayDMatrix before passing it again