narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Bug]: `SparkLikeLazyFrame.rename`cannot handle periods in names

Open lucas-nelson-uiuc opened this issue 3 months ago • 5 comments

Describe the bug

hey everyone,

not sure if this has been discussed elsewhere but ran into an issue renaming a Spark DataFrame with periods in its columns (a somewhat common issue).

historically I've resolved this using .withColumnsRenamed() but noticed that narwhals is using .select() - either approach works with some minor differences.

wondering if we could do one of two things:

  • replace select with withColumnsRenamed - rename_mapping remains unchanged
  • surround the keys in rename_mapping with backticks - self.native.select remains unchanged

https://github.com/narwhals-dev/narwhals/blob/ec5f4967bf7ea3513a9d0fa1cc837985b612e0f7/narwhals/_spark_like/dataframe.py#L358-L366

Steps or code to reproduce the bug

create a PySpark DataFrame with a column that has a dot in its name

import pandas as pd
from sqlframe.spark import SparkSession

spark = SparkSession.builder.getOrCreate()

temp = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
spark_dataframe = spark.createDataFrame(temp).withColumnRenamed("c", "c.d")
spark_dataframe.columns
#>: ["a", "b", "c.d"]

try to rename the column using narwhals

import narwhals as nw

narwhals_dataframe = nw.from_native(spark_dataframe)
mapping = {column: column.replace(".", "_").upper() for column in narwhals_dataframe.columns}
renamed = narwhals_dataframe.rename(mapping)
renamed.to_native().show()
#>: AnalysisException

Expected results

narwhals_dataframe.columns
#>: A, B, C_D

Actual results

raises an analysis exception

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `c`.`d` cannot be resolved

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.1.123+-x86_64-with-glibc2.35

Python dependencies:
     narwhals: 2.3.0
        numpy: 2.0.2
       pandas: 2.2.2
        modin: 
         cudf: 25.6.0
      pyarrow: 18.1.0
      pyspark: 3.5.1
       polars: 1.25.2
         dask: 2025.5.0
       duckdb: 1.3.2
         ibis: 9.5.0
     sqlframe: 3.40.2

Relevant log output

---------------------------------------------------------------------------

AnalysisException                         Traceback (most recent call last)

/tmp/ipython-input-3169437464.py in <cell line: 0>()
     11 narwhals_dataframe = nw.from_native(spark_dataframe)
     12 mapping = {column: column.upper() for column in narwhals_dataframe.columns}
---> 13 narwhals_dataframe.rename(mapping).to_native().show()

7 frames

/usr/local/lib/python3.12/dist-packages/pyspark/errors/exceptions/captured.py in deco(*a, **kw)
    183                 # Hide where the exception came from that shows a non-Pythonic
    184                 # JVM exception message.
--> 185                 raise converted from None
    186             else:
    187                 raise

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `c`.`d` cannot be resolved. Did you mean one of the following? [`t32279404`.`a`, `t32279404`.`b`, `t32279404`.`c.d`].; line 1 pos 293;
'WithCTE
:- CTERelationDef 9, false
:  +- SubqueryAlias t28915597
:     +- Project [cast(a#49 as bigint) AS a#42L, cast(b#50 as bigint) AS b#43L, cast(c#51 as bigint) AS c#44L]
:        +- SubqueryAlias a11
:           +- LocalRelation [a#49, b#50, c#51]
:- CTERelationDef 10, false
:  +- SubqueryAlias t32279404
:     +- Project [a#42L, b#43L, c#44L AS c.d#45L]
:        +- SubqueryAlias t28915597
:           +- CTERelationRef 9, true, [a#42L, b#43L, c#44L], false
:- 'CTERelationDef 11, false
:  +- 'SubqueryAlias t12673183
:     +- 'Project [a#42L AS a#46L, b#43L AS b#47L, 'c.d AS c.d#48]
:        +- SubqueryAlias t32279404
:           +- CTERelationRef 10, true, [a#42L, b#43L, c.d#45L], false
+- 'GlobalLimit 20
   +- 'LocalLimit 20
      +- 'Project ['a AS A#39, 'b AS B#40, '`c.d` AS C.D#41]
         +- 'SubqueryAlias t12673183
            +- 'CTERelationRef 11, false, false

lucas-nelson-uiuc avatar Sep 11 '25 00:09 lucas-nelson-uiuc

thanks @lucas-nelson-uiuc for the report! our minimum pyspark version is 3.4.0 so happy to use withColumnsRenamed

MarcoGorelli avatar Sep 12 '25 08:09 MarcoGorelli

@MarcoGorelli sorry I might have re-opened this too soon. I thought it was a mistake since the PR is only adding a test, but the test is not failing without editing the codebase?

Edit: Nevermind, issue is from dotted to whatever

FBruzzesi avatar Sep 12 '25 11:09 FBruzzesi

thanks @FBruzzesi

yup - @lucas-nelson-uiuc as noted in the linked PR, we can't use withColumnsRenamed here unfortunately

i'd suggesting renaming outside of Narwhals for now

MarcoGorelli avatar Sep 12 '25 12:09 MarcoGorelli

The thing is that this is not a rename only issue. select and with_columns break as well:

import narwhals as nw
import pandas as pd
from sqlframe.spark import SparkSession

spark = SparkSession.builder.getOrCreate()

temp = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
spark_dataframe = spark.createDataFrame(temp).withColumnRenamed("c", "c.d")
nw.from_native(spark_dataframe).select("c.d").collect("pandas")

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name c.d cannot be resolved. Did you mean one of the following? [t33636136.a, t33636136.b, t33636136.c.d]. SQLSTATE: 42703; line 1 pos 251;

FBruzzesi avatar Sep 12 '25 12:09 FBruzzesi

@MarcoGorelli one option to "fail early" is to do a check when we wrap a dataframe in the same way we check for duplicate column names in pandas. In that way we can have a bit more control and suggest what to do rather than failing later on.

FBruzzesi avatar Sep 12 '25 20:09 FBruzzesi