frame_from_hyper_query Invalid dtype: "��" - Issue with Duplicate Column Names
Describe the bug
Querying more than one SUM() fails.
MacOS Monterey 12.3.1 (21E258)
Is this expected? Thanks!
Thanks for the report. Can you share what version of pantab and the tableauhyperapi you are using?
Get Outlook for iOShttps://aka.ms/o0ukef
From: Rafael Garcia @.> Sent: Monday, May 23, 2022 2:42:23 PM To: innobi/pantab @.> Cc: Subscribed @.***> Subject: [innobi/pantab] frame_from_hyper_query Invalid dtype: "��" (Issue #161)
Describe the bug
Querying more than one SUM() fails.
[Untitled_-_Jupyter_Notebook]https://user-images.githubusercontent.com/1904314/169910043-c679bec4-2cc7-4e20-9949-d1c8cbc12b12.png
[hyper_py_—_crate]https://user-images.githubusercontent.com/1904314/169910286-1510568d-6dfd-438f-a299-27e14613bdff.png
MacOS Monterey 12.3.1 (21E258)
— Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/161, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UO47CT5U6RJWTTCGZLVLP3T7ANCNFSM5WXEVQAA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Ops sorry, here it goes:
pantab 2.1.1
tableauhyperapi 0.0.14401
pandas 1.4.2
Python 3.9.9
FYI it looks like this works if you alias your column names, i.e. if you change your query to:
SELECT sum(new_appt) AS a, sum(new_appt) AS b FROM sales
I don't think Hyper has native support for duplicated column names, which is likely the root cause of issues here. To illustrate, if you do something like
>>> pantab.frame_to_hyper(pd.DataFrame([[1, 1]], columns=list("aa")), "sales.hyper", table="sales")
File ~/miniconda3/envs/pantab-dev/lib/python3.9/site-packages/tableauhyperapi/impl/dllutil.py:110, in Error.check(p)
107 if p != ffi.NULL:
108 # this will free the error when it goes out of scope
109 errp = Error(p)
--> 110 raise errp.to_exception()
HyperException: column 'a' specified more than once
It might be a bug or corner case in the tableauhyperapi that you are allowed to create duplicate column names via an ad hoc query. @vogelsgesang might have some insights
Oh, I see. When I use an aggregation function like SUM or COUNT, the column name comes as the function name, that's why when I summed two different columns I got that error.
Isn't there a way to name it like <AGGR_FN>_<COLUMN_NAME> or similar to give them unique names?
Anyway, I'm closing this issue because the problem is solved.
unfortunately, this is somewhat expected behavior from the Hyper side.
The same happens if you run the following query, e.g., in Postgres:
SELECT sum(a), sum(b)
FROM (VALUES (1,2)) v(a,b)
Postgres also gives you two columns which have the same name, i.e. sum
Could it be that maybe pantab somehow gets confused by that, and expects column names to be unique?
Yea I think that's the current theory.
Yea there is still a bug here. It likely stems from this:
https://github.com/innobi/pantab/blob/f6a9fbb3f73cd51ada79e5d6d6aa114aea77dc98/pantab/_reader.py#L29
i.e. it is a mistake to be using a dict as a container. This should be replaceable with a list in the Python space to make this work. @rafbgarcia would you be interesting in trying to contribute that change?
@WillAyd I'd love to but I started coding in Python this week for a particular project and I'm struggling to understand what's the issue with that code 😄
If you would be willing to tell me overall what needs to be done with code examples I can try to do it.
The first step is to create a test that reproduces your failure. This should go in pantab/tests/test_reader.py. The test should be a function named test_reads_duplicate_column_names and the function body should do what you are doing now. You can copy an existing test in that module to take advantage of an already existing hyper file and tweak your query a bit.
After the test is in place try changing the code linked above to use a list as a container instead of a dict.
With pantab 4.0 we are really starting to use pyarrow tables to convert to/from Hyper. Those also do not allow duplicate column names, so we would have to develop some kind of conflict resolution strategy for duplicates like this to make reading possible