Daft icon indicating copy to clipboard operation
Daft copied to clipboard

TypeError Expected input to be a struct type, received: Python

Open huleilei opened this issue 2 months ago • 11 comments

Describe the bug

Now there is a scenario where text data is processed to generate tensor data with an accuracy of bfloat16. However, since bfloat16 is not supported in Daft, in the udf function, the value of returned_type=object is defined. But now when retrieving the data from it, an error will be reported: TypeError Expected input to be a struct type, received: Python.

Error message

Traceback (most recent call last):
  File "/data00/code/tmp2/Daft/temp/test_tensort.py", line 86, in <module>
    df = df.with_column("bfloat16", df["bfloat16_values"].struct.get("bfloat16"))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data00/code/tmp2/Daft/daft/dataframe/dataframe.py", line 2249, in with_column
    return self.with_columns({column_name: expr})
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data00/code/tmp2/Daft/daft/dataframe/dataframe.py", line 2285, in with_columns
    builder = self._builder.with_columns(new_columns)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data00/code/tmp2/Daft/daft/logical/builder.py", line 163, in with_columns
    builder = self._builder.with_columns(column_pyexprs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Unable to create logical plan node.
Due to: DaftError::TypeError Expected input to be a struct type, received: Python

Code Example:

import daft

data = {"text": ["hello", "world"], "value": [1, 2]}
df = daft.from_pydict(data)

@daft.udf(return_dtype=object)
def to_bfloat16(str_s) -> torch.Tensor:
    res = []
    for s in str_s:
        res.append({"len": len(s), "bfloat16": torch.tensor([[1.5003, 2.2523, 3.7542],[1.2,2.3,4.2]], dtype=torch.bfloat16)})
    return res

df = df.with_column("bfloat16_values", to_bfloat16(df["text"]))
df = df.with_column("bfloat16", df["bfloat16_values"].struct.get("bfloat16"))

df.show()

To Reproduce

No response

Expected behavior

No response

Component(s)

Expressions

Additional context

No response

huleilei avatar Sep 30 '25 08:09 huleilei

@huleilei — two questions

  • (1) Which version/commit of daft are you using?
  • (2) Does your issue persist even after checking out #5201?

Thank you!

rchowell avatar Sep 30 '25 16:09 rchowell

@rchowell I use the latest main branch on GitHub. And I tried using the latest code today and the problem still persists. Thank you for helping me take a look

huleilei avatar Oct 04 '25 06:10 huleilei

@huleilei I think the issue is that you are specifying return_dtype=object. This is treated as a Python datatype, not a struct datatype.

For struct return, you either need to use a typeddict, or the datatype api

Datatype api

return_dtype = dt.struct({"len": dt.int64(), "bfloat16": dt.python()})

@daft.udf(return_dtype=return_dtype)
def to_bfloat16(str_s): ...

Typed Dict


class MyTypedDict(TypedDict):
    len: int
    bfloat16: torch.Tensor


@daft.udf(return_dtype=MyTypedDict)
def to_bfloat16(str_s): ...

universalmind303 avatar Oct 06 '25 16:10 universalmind303

our full python -> daft casting table can be found here.

https://docs.daft.ai/en/stable/api/datatypes/type_conversions/#python-to-daft

universalmind303 avatar Oct 06 '25 16:10 universalmind303

@universalmind303 Thanks. I have two more questions, could you please help me answer them:

  1. Will we support bfloat16 precision in daft.DataType.tensor types in the future?
  2. Representtorch.tensor(dtype=torch.bfloat16)data as Daft's daft.DataType.Python type. Is there any performance impact? For example, serialization.

Thank you again and best wishes.

huleilei avatar Oct 09 '25 10:10 huleilei

(1). I think adding bfloat16 would be a pretty large endeavor. We'd be open to PR's for it, but I don't believe it's a high priority at the moment.

(2). We don't have any specific benchmarks comparing the two, but using daft.DataType.Python will generally be a little less performant than native types such as tensor[...], it does however offer much greater flexibility for unsupported types such as bf16.

universalmind303 avatar Oct 17 '25 19:10 universalmind303

Thank you for your explanation. The daft.DataType.Python provided in Daft indeed offers great convenience. However, regarding the tensor type, I believe there are still some differences between the daft.DataType.tensor in Daft and the torch.tensor in PyTorch. In the field of AI, algorithm practitioners commonly use torch.tensor. Therefore, I think we could provide some methods to handle torch.tensor, such as:

  1. Implementing mutual conversion between daft.DataType.tensor and torch.tensor.
  2. Providing some tensor operation methods, for example: torch.abs(df["tensors"].tensor).

huleilei avatar Oct 18 '25 14:10 huleilei

It looks like we should be able to support bfloat16 with some modifications, since the rust numpy library supports conversions for that: https://docs.rs/numpy/latest/numpy/trait.Element.html Is this a blocking issue for you? If so, I can take a look next week.

As for torch tensor to Daft, @huleilei we added support for native conversions in Daft v0.6.6, did you get a chance to try it? We also support specifying the dtype and shape of the tensor via jaxtyping (docs here)

kevinzwang avatar Oct 29 '25 03:10 kevinzwang

@kevinzwang Yes, this issue is currently a blocker for our business progress as it results in low resource utilization. I'm very keen to participate in the resolution and improvement of this problem.

Regarding the bfloat16 support, thank you for the information and the link! I can confirm that this is a blocking issue for us as it impacts our operational efficiency. I'm happy to collaborate on the implementation and help with testing the feature and providing feedback. Thanks

huleilei avatar Oct 29 '25 16:10 huleilei

@huleilei I do not have capacity to work on bfloat16 support in the near future, but I would be happy to provide guidance and reviews. Will you be working on a PR for this feature?

kevinzwang avatar Nov 07 '25 20:11 kevinzwang

@kevinzwang I'm thrilled to implement bfloat16 support! Currently, I'm working on the dynamic resource scaling feature, and I'll dive into this right after completing it. Thank you so much for your guidance and review—it will be invaluable!

huleilei avatar Nov 08 '25 12:11 huleilei