TypeError Expected input to be a struct type, received: Python
Describe the bug
Now there is a scenario where text data is processed to generate tensor data with an accuracy of bfloat16. However, since bfloat16 is not supported in Daft, in the udf function, the value of returned_type=object is defined. But now when retrieving the data from it, an error will be reported: TypeError Expected input to be a struct type, received: Python.
Error message
Traceback (most recent call last):
File "/data00/code/tmp2/Daft/temp/test_tensort.py", line 86, in <module>
df = df.with_column("bfloat16", df["bfloat16_values"].struct.get("bfloat16"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data00/code/tmp2/Daft/daft/dataframe/dataframe.py", line 2249, in with_column
return self.with_columns({column_name: expr})
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data00/code/tmp2/Daft/daft/dataframe/dataframe.py", line 2285, in with_columns
builder = self._builder.with_columns(new_columns)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data00/code/tmp2/Daft/daft/logical/builder.py", line 163, in with_columns
builder = self._builder.with_columns(column_pyexprs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Unable to create logical plan node.
Due to: DaftError::TypeError Expected input to be a struct type, received: Python
Code Example:
import daft
data = {"text": ["hello", "world"], "value": [1, 2]}
df = daft.from_pydict(data)
@daft.udf(return_dtype=object)
def to_bfloat16(str_s) -> torch.Tensor:
res = []
for s in str_s:
res.append({"len": len(s), "bfloat16": torch.tensor([[1.5003, 2.2523, 3.7542],[1.2,2.3,4.2]], dtype=torch.bfloat16)})
return res
df = df.with_column("bfloat16_values", to_bfloat16(df["text"]))
df = df.with_column("bfloat16", df["bfloat16_values"].struct.get("bfloat16"))
df.show()
To Reproduce
No response
Expected behavior
No response
Component(s)
Expressions
Additional context
No response
@huleilei — two questions
- (1) Which version/commit of daft are you using?
- (2) Does your issue persist even after checking out #5201?
Thank you!
@rchowell I use the latest main branch on GitHub. And I tried using the latest code today and the problem still persists. Thank you for helping me take a look
@huleilei I think the issue is that you are specifying return_dtype=object. This is treated as a Python datatype, not a struct datatype.
For struct return, you either need to use a typeddict, or the datatype api
Datatype api
return_dtype = dt.struct({"len": dt.int64(), "bfloat16": dt.python()})
@daft.udf(return_dtype=return_dtype)
def to_bfloat16(str_s): ...
Typed Dict
class MyTypedDict(TypedDict):
len: int
bfloat16: torch.Tensor
@daft.udf(return_dtype=MyTypedDict)
def to_bfloat16(str_s): ...
our full python -> daft casting table can be found here.
https://docs.daft.ai/en/stable/api/datatypes/type_conversions/#python-to-daft
@universalmind303 Thanks. I have two more questions, could you please help me answer them:
- Will we support
bfloat16precision indaft.DataType.tensortypes in the future? - Represent
torch.tensor(dtype=torch.bfloat16)data as Daft'sdaft.DataType.Pythontype. Is there any performance impact? For example, serialization.
Thank you again and best wishes.
(1). I think adding bfloat16 would be a pretty large endeavor. We'd be open to PR's for it, but I don't believe it's a high priority at the moment.
(2). We don't have any specific benchmarks comparing the two, but using daft.DataType.Python will generally be a little less performant than native types such as tensor[...], it does however offer much greater flexibility for unsupported types such as bf16.
Thank you for your explanation. The daft.DataType.Python provided in Daft indeed offers great convenience. However, regarding the tensor type, I believe there are still some differences between the daft.DataType.tensor in Daft and the torch.tensor in PyTorch. In the field of AI, algorithm practitioners commonly use torch.tensor. Therefore, I think we could provide some methods to handle torch.tensor, such as:
- Implementing mutual conversion between
daft.DataType.tensorandtorch.tensor. - Providing some tensor operation methods, for example:
torch.abs(df["tensors"].tensor).
It looks like we should be able to support bfloat16 with some modifications, since the rust numpy library supports conversions for that: https://docs.rs/numpy/latest/numpy/trait.Element.html Is this a blocking issue for you? If so, I can take a look next week.
As for torch tensor to Daft, @huleilei we added support for native conversions in Daft v0.6.6, did you get a chance to try it? We also support specifying the dtype and shape of the tensor via jaxtyping (docs here)
@kevinzwang Yes, this issue is currently a blocker for our business progress as it results in low resource utilization. I'm very keen to participate in the resolution and improvement of this problem.
Regarding the bfloat16 support, thank you for the information and the link! I can confirm that this is a blocking issue for us as it impacts our operational efficiency. I'm happy to collaborate on the implementation and help with testing the feature and providing feedback. Thanks
@huleilei I do not have capacity to work on bfloat16 support in the near future, but I would be happy to provide guidance and reviews. Will you be working on a PR for this feature?
@kevinzwang I'm thrilled to implement bfloat16 support! Currently, I'm working on the dynamic resource scaling feature, and I'll dive into this right after completing it. Thank you so much for your guidance and review—it will be invaluable!