fastparquet
fastparquet copied to clipboard
TypeError: assign() keywords must be strings
I am running into issues converting a dataframe to parquet on Heroku (Ubuntu 20.04). The code works perfectly on my local windows machine. The dataframe has a multiindex with dtypes datetime, str. I receive the error below.
File "/app/.heroku/python/lib/python3.8/site-packages/fastparquet/writer.py", line 935, in write 2021-12-31T14:17:00.091844+00:00 app[web.1]: data = data.assign(**{name: pd.Categorical.from_codes(codes, cats)}) 2021-12-31T14:17:00.091845+00:00 app[web.1]: TypeError: assign() keywords must be strings
df.to_parquet()
""" close 2007-01-01 SPY 140.54 2007-01-08 SPY 143.24 2007-01-15 SPY 142.82 2007-01-22 SPY 142.13 2007-01-29 SPY 144.81 ... ... 2021-11-29 SPY 453.42 2021-12-06 SPY 470.74 """
Environment:
- fastparquet: 0.7.2
- Python version: 1.3.5
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source): pip, pypi
If it works locally but nor remotely, could you please check the versions installed in both, particularly pandas? I don't know if it's possible for you, but it would be useful to know what name
was, which caused the exception.
Thanks for the quick reply. name
is the FrozenList names of the DataFrame multi-index. It was none by default, but somehow worked on my local machine. When setting it, it does solve this issue.
I think @yohplala fixed the issue with None
names of multi-index levels recently.
I think @yohplala fixed the issue with
None
names of multi-index levels recently.
Hi,
What I have solved for column multi index in the PR just recently merged is managing empty string ''
for a column name.
But i understand here that the trouble is with None
being used for level names.
I confirm that in a dummy branch not merged, I did also 'proposed a fix' for it, but I did not port the fix to PR #729 as it would have been 'yet another item' and one I am not so sure about possible side effects and did not want to spend too much time on it (the workaround is simple, it is enough to provide a name to levels).
So basically, the trouble is when your define your column multi-index without name for levels.
# Notice no 'name' parameter is being provided.
# This is fully acceptable by pandas, but not by fastparquet.
cmidx = pd.MultiIndex([('a', '1'),('b','1')])
In fastparquet, an issue is then raised in util.get_column_metadata()
, line 335.
if isinstance(name, tuple):
name = str(name)
elif not isinstance(name, str):
raise TypeError(
'Column name must be a string. Got column {} of type {}'.format(
name, type(name).__name__
)
)
I 'solved' it this way, with an additional if
to manage the None
case.
if isinstance(name, tuple):
name = str(name)
elif is None:
name = ''
elif not isinstance(name, str):
raise TypeError(
'Column name must be a string. Got column {} of type {}'.format(
name, type(name).__name__
)
)
Here, if I get it right, name
is the name of the column index, not of a specific column.
When it is a column multi-index, its name is then a tuple. Each values of the tuple is a level name actually.
And it can be None
.
With the fix above, the exception was not raised anylonger, and I could read the dataframe back, this was ok.
The thing I am not easy with is 'is name
used elsewhere?' (has setting it ''
a side effect?)
At least, I should have:
- created a test case with an
assert
and checkingdf_init.equals(df_recorded)
- run all the other test cases
But i did not delved into that, not yet, one thing at a time :) (willing to move on with PR #712 first here :))
I propose to keep the ticket open till we come back to this and solve it (I think we just need to confirm the fix is ok). Bests, and soon, happy new year!!!! :)
OK, leaving this open, but I have no plans to work on it in the near as term. As you say, the workaround is simple.