fastparquet TypeError: assign() keywords must be strings

I am running into issues converting a dataframe to parquet on Heroku (Ubuntu 20.04). The code works perfectly on my local windows machine. The dataframe has a multiindex with dtypes datetime, str. I receive the error below.

File "/app/.heroku/python/lib/python3.8/site-packages/fastparquet/writer.py", line 935, in write 2021-12-31T14:17:00.091844+00:00 app[web.1]: data = data.assign(**{name: pd.Categorical.from_codes(codes, cats)}) 2021-12-31T14:17:00.091845+00:00 app[web.1]: TypeError: assign() keywords must be strings

df.to_parquet()

""" close 2007-01-01 SPY 140.54 2007-01-08 SPY 143.24 2007-01-15 SPY 142.82 2007-01-22 SPY 142.13 2007-01-29 SPY 144.81 ... ... 2021-11-29 SPY 453.42 2021-12-06 SPY 470.74 """

Environment:

fastparquet: 0.7.2
Python version: 1.3.5
Operating System: Ubuntu 20.04
Install method (conda, pip, source): pip, pypi

Dec 31 '21 14:12 pieroliviermarquis

If it works locally but nor remotely, could you please check the versions installed in both, particularly pandas? I don't know if it's possible for you, but it would be useful to know what name was, which caused the exception.

Dec 31 '21 14:12 martindurant

Thanks for the quick reply. name is the FrozenList names of the DataFrame multi-index. It was none by default, but somehow worked on my local machine. When setting it, it does solve this issue.

Dec 31 '21 16:12 pieroliviermarquis

I think @yohplala fixed the issue with None names of multi-index levels recently.

Dec 31 '21 16:12 martindurant

I think @yohplala fixed the issue with None names of multi-index levels recently.

Hi, What I have solved for column multi index in the PR just recently merged is managing empty string '' for a column name.

But i understand here that the trouble is with None being used for level names. I confirm that in a dummy branch not merged, I did also 'proposed a fix' for it, but I did not port the fix to PR #729 as it would have been 'yet another item' and one I am not so sure about possible side effects and did not want to spend too much time on it (the workaround is simple, it is enough to provide a name to levels).

So basically, the trouble is when your define your column multi-index without name for levels.

# Notice no 'name' parameter is being provided.
# This is fully acceptable by pandas, but not by fastparquet.
cmidx = pd.MultiIndex([('a', '1'),('b','1')])

In fastparquet, an issue is then raised in util.get_column_metadata(), line 335.

    if isinstance(name, tuple):
        name = str(name)
    elif not isinstance(name, str):
        raise TypeError(
            'Column name must be a string. Got column {} of type {}'.format(
                name, type(name).__name__
            )
        )

I 'solved' it this way, with an additional if to manage the None case.

    if isinstance(name, tuple):
        name = str(name)
    elif is None:
        name = ''
    elif not isinstance(name, str):
        raise TypeError(
            'Column name must be a string. Got column {} of type {}'.format(
                name, type(name).__name__
            )
        )

Here, if I get it right, name is the name of the column index, not of a specific column. When it is a column multi-index, its name is then a tuple. Each values of the tuple is a level name actually. And it can be None.

With the fix above, the exception was not raised anylonger, and I could read the dataframe back, this was ok. The thing I am not easy with is 'is name used elsewhere?' (has setting it '' a side effect?)

At least, I should have:

created a test case with an assert and checking df_init.equals(df_recorded)
run all the other test cases

But i did not delved into that, not yet, one thing at a time :) (willing to move on with PR #712 first here :))

I propose to keep the ticket open till we come back to this and solve it (I think we just need to confirm the fix is ok). Bests, and soon, happy new year!!!! :)

Dec 31 '21 17:12 yohplala

OK, leaving this open, but I have no plans to work on it in the near as term. As you say, the workaround is simple.

Dec 31 '21 17:12 martindurant

fastparquet fastparquet copied to clipboard

TypeError: assign() keywords must be strings

fastparquet
fastparquet copied to clipboard