MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Fix Pandas codec decoding from numpy arrays

Open lhnwrk opened this issue 1 year ago • 2 comments

Pandas codec currently breaks mlflow runtime's schema enforcement (#1625). This is because numeric mlflow datatypes are assigned NumpyCodec content type by default, which decodes PandasCodec's explicit batch size into a column vector:

input = RequestInput(
	name="foo",
	shape=[3, 1],
	data=[1,2,3],
	datatype="INT64",
)
NumpyCodec.decode_input(input)  # np.array([[1], [2], [3]])

However, pd.Series expects one-dimensional data, and when the _decoded_payload is converted to a pd.Series here returns unexpected results:

# Expect pd.Series([1,2,3]), got pd.Series([(1,), (2,), (3,)])
payload = np.array([[1], [2], [3]])
payload = list(payload)
pd.Series(payload, dtype="int64")

This seems like a bug from pandas side, dtype="int64" should have thrown an error (and in fact it does for higher dimension, but for some reason is okay with [np.array([1]), np.array([2]), np.array([3])] and weirdly casts them into tuples), but either way mlserver should have dropped the trailing axis as pd.Series already implies column vector.

From this test, it seems like the conversion to list in the codec was to accommodate the use case of pd.Series of tensors:

# Expect decoded pd.Series([np.array([1,2,3])]) in test
RequestInput(
	name="a",
	data=[1, 2, 3],
	datatype="FP32",
	shape=[1, 3],
	parameters=Parameters(_decoded_payload=np.array([[1, 2, 3]])),
)

I'd argue in the context of Pandas codec this should be made explicit with shape [1, 1] and datatype BYTES for array elements. In fact, a test above does so, specifying the intended shape is a [2,1] column vector with datatype BYTES for list elements.

# Expect encoded ResponseOutput(name="bar", shape=[2, 1], data=[[1, 2, 3], [4, 5, 6]], datatype="BYTES") in test
pd.Series(data=[[1, 2, 3], [4, 5, 6]], name="bar")

This PR explicitly reshapes np.array payload to expected shapes before conversion to pd.Series.

lhnwrk avatar May 09 '24 16:05 lhnwrk

We've gotten that fix in. @lhnwrk: could you please rebase this branch on top of master?

jesse-c avatar May 15 '24 15:05 jesse-c

@jesse-c Rebased on master!

lhnwrk avatar May 15 '24 22:05 lhnwrk

@lhnwrk: Thank you! I've had to re-approve it and run the checks, so they're going now. As soon as it passes, I'll merge it in.

jesse-c avatar May 16 '24 09:05 jesse-c

Merged in! Thank you again, @lhnwrk. 🙏🏻

jesse-c avatar May 16 '24 10:05 jesse-c