pantab icon indicating copy to clipboard operation
pantab copied to clipboard

frame_from_hyper_query Invalid dtype: "��" - Issue with Duplicate Column Names

Open rafbgarcia opened this issue 3 years ago • 10 comments

Describe the bug

Querying more than one SUM() fails.

Untitled_-_Jupyter_Notebook hyper_py_—_crate

MacOS Monterey 12.3.1 (21E258)


Is this expected? Thanks!

rafbgarcia avatar May 23 '22 21:05 rafbgarcia

Thanks for the report. Can you share what version of pantab and the tableauhyperapi you are using?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Rafael Garcia @.> Sent: Monday, May 23, 2022 2:42:23 PM To: innobi/pantab @.> Cc: Subscribed @.***> Subject: [innobi/pantab] frame_from_hyper_query Invalid dtype: "��" (Issue #161)

Describe the bug

Querying more than one SUM() fails.

[Untitled_-_Jupyter_Notebook]https://user-images.githubusercontent.com/1904314/169910043-c679bec4-2cc7-4e20-9949-d1c8cbc12b12.png

[hyper_py_—_crate]https://user-images.githubusercontent.com/1904314/169910286-1510568d-6dfd-438f-a299-27e14613bdff.png

MacOS Monterey 12.3.1 (21E258)

— Reply to this email directly, view it on GitHubhttps://github.com/innobi/pantab/issues/161, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEU4UO47CT5U6RJWTTCGZLVLP3T7ANCNFSM5WXEVQAA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

WillAyd avatar May 23 '22 23:05 WillAyd

Ops sorry, here it goes:

pantab 2.1.1
tableauhyperapi 0.0.14401
pandas 1.4.2
Python 3.9.9

rafbgarcia avatar May 23 '22 23:05 rafbgarcia

FYI it looks like this works if you alias your column names, i.e. if you change your query to:

SELECT sum(new_appt) AS a, sum(new_appt) AS b FROM sales

I don't think Hyper has native support for duplicated column names, which is likely the root cause of issues here. To illustrate, if you do something like

>>> pantab.frame_to_hyper(pd.DataFrame([[1, 1]], columns=list("aa")), "sales.hyper", table="sales")
File ~/miniconda3/envs/pantab-dev/lib/python3.9/site-packages/tableauhyperapi/impl/dllutil.py:110, in Error.check(p)
    107 if p != ffi.NULL:
    108     # this will free the error when it goes out of scope
    109     errp = Error(p)
--> 110     raise errp.to_exception()

HyperException: column 'a' specified more than once

It might be a bug or corner case in the tableauhyperapi that you are allowed to create duplicate column names via an ad hoc query. @vogelsgesang might have some insights

WillAyd avatar May 24 '22 03:05 WillAyd

Oh, I see. When I use an aggregation function like SUM or COUNT, the column name comes as the function name, that's why when I summed two different columns I got that error.

Isn't there a way to name it like <AGGR_FN>_<COLUMN_NAME> or similar to give them unique names?

rafbgarcia avatar May 24 '22 12:05 rafbgarcia

Anyway, I'm closing this issue because the problem is solved.

rafbgarcia avatar May 24 '22 12:05 rafbgarcia

unfortunately, this is somewhat expected behavior from the Hyper side.

The same happens if you run the following query, e.g., in Postgres:

SELECT sum(a), sum(b)
FROM (VALUES (1,2)) v(a,b)

Postgres also gives you two columns which have the same name, i.e. sum

Could it be that maybe pantab somehow gets confused by that, and expects column names to be unique?

vogelsgesang avatar May 24 '22 21:05 vogelsgesang

Yea I think that's the current theory.

rafbgarcia avatar May 24 '22 21:05 rafbgarcia

Yea there is still a bug here. It likely stems from this:

https://github.com/innobi/pantab/blob/f6a9fbb3f73cd51ada79e5d6d6aa114aea77dc98/pantab/_reader.py#L29

i.e. it is a mistake to be using a dict as a container. This should be replaceable with a list in the Python space to make this work. @rafbgarcia would you be interesting in trying to contribute that change?

WillAyd avatar May 25 '22 01:05 WillAyd

@WillAyd I'd love to but I started coding in Python this week for a particular project and I'm struggling to understand what's the issue with that code 😄

If you would be willing to tell me overall what needs to be done with code examples I can try to do it.

rafbgarcia avatar May 25 '22 10:05 rafbgarcia

The first step is to create a test that reproduces your failure. This should go in pantab/tests/test_reader.py. The test should be a function named test_reads_duplicate_column_names and the function body should do what you are doing now. You can copy an existing test in that module to take advantage of an already existing hyper file and tweak your query a bit.

After the test is in place try changing the code linked above to use a list as a container instead of a dict.

WillAyd avatar May 25 '22 16:05 WillAyd

With pantab 4.0 we are really starting to use pyarrow tables to convert to/from Hyper. Those also do not allow duplicate column names, so we would have to develop some kind of conflict resolution strategy for duplicates like this to make reading possible

WillAyd avatar Jan 19 '24 15:01 WillAyd