datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[datafusion-spark] Implement `size` function

Open iajoiner opened this issue 2 years ago • 9 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. There is a user on Discord that wants size akin to the PySpark function with the same name to be implemented.

Describe the solution you'd like A clear and concise description of what you want to happen. Implement size as a BuiltinScalarFunction for ListArray, LargeListArray and MapArray. Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

iajoiner avatar Feb 19 '23 16:02 iajoiner

btw, there is no size function in PG. There are cardinality function or array_length. Its again should we support both spark/PG syntax ?

comphead avatar Feb 20 '23 01:02 comphead

Its again should we support both spark/PG syntax ?

I would personally suggest supporting postgres syntax unless postgres doesn't support the feature -- and in that case we could follow spark

alamb avatar Feb 20 '23 12:02 alamb

@alamb Sure! So I guess array_length should be the correct name here.

iajoiner avatar Feb 20 '23 15:02 iajoiner

@alamb Wait. In Postgres https://www.postgresql.org/docs/15/functions-array.html it is necessary to have compatibility to have nested arrays so it does make sense to have array_length(array, dimension). For us this makes less sense since we can have arrays that look like [[2, 3], [4, 5, 6]] and even lists of maps. cardinality does make more sense but it doesn't suit our current use case. size from Spark probably makes the most sense here.

iajoiner avatar Feb 20 '23 16:02 iajoiner

I'll rebrand this as a Spark function issue.

Reference: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.size.html

We want to support size() function on array + map types. Though we do have array_length already, this is meant for arrays so can be confusing to extend it to support maps.

Jefffrey avatar Oct 20 '25 03:10 Jefffrey

I'll mark as good first issue, some pointers:

  • See existing Spark function PRs to see what needs to be done for adding Spark functions, e.g. #18018
  • Can see existing array_length implementation in DataFusion for reference as I mentioned above; can ignore all the code around handling dimensions though since Spark size doesn't do that

Jefffrey avatar Oct 20 '25 03:10 Jefffrey

take

CuteChuanChuan avatar Oct 27 '25 15:10 CuteChuanChuan

@Jefffrey is this issue resolved yet or can I work on it ??

Yuvraj-cyborg avatar Dec 11 '25 12:12 Yuvraj-cyborg

Hi @Yuvraj-cyborg, I am currently working on this. Will submit a PR soon.

CuteChuanChuan avatar Dec 11 '25 12:12 CuteChuanChuan