pantab icon indicating copy to clipboard operation
pantab copied to clipboard

Force Not Null Columns Option for Frames to Hyper

Open jharris126 opened this issue 4 years ago • 8 comments

Is your feature request related to a problem? Please describe. The new Tableau REST API feature for publishing multi-table hyper files as a data source requires primary keys to be set on fields that are not nullable. The current pandas to hyper type mappings doesn't allow user intervention in nullability of columns in the generated hyper tables.

Describe the solution you'd like I think an optional parameter on frame_to_hyper that allows a list of columns to be made not null is the cleanest way to integrate with current functionality.

frame_to_hyper(your_data_frame, "path/to/file", forced_notnull_columns=["key", "key_2"])

Describe alternatives you've considered Profiling the dataframe for NaN values and setting nullability accordingly is technically possible but would be a large performance overhead and could cause problems with future appended data sets if the data in the initial frame doesn't reflect the possible future nullability of those fields.

Additional context REST API feature for publishing multi-table extracts when hyper foreign key relationships are defined is coming in 2021.4

jharris126 avatar Nov 09 '21 18:11 jharris126

We might need to scope this a bit further. So for numeric types the nullable variants are IntXX whereas the non null variants are intXX.

I'm guessing this is in reference to more ambiguous types like object / string or float types?

WillAyd avatar Nov 09 '21 18:11 WillAyd

@WillAyd Yes, string was the specific use case I ran into. Looping in @vogelsgesang because he and I spoke about this a bit over Slack. My initial workaround was to try to simply add a ALTER TABLE {table} ALTER COLUMN {Name('id')} SET NOT NULL; hyper command, but this syntax doesn't currently exist in Hyper SQL dialect.

jharris126 avatar Nov 09 '21 18:11 jharris126

Yea this is pretty tricky. I'm not sure of the cleanest way to handle this from an API perspective.

For one, most of the target types are inferred from the pandas dtype. Specifying a keyword to override that is certainly an option, but can be a bit of a mess when dealing with very wide data frames.

PyArrow has a cleaner mapping of types to hyper. I've mostly got a writer done in hyperarrow, so a second option would be for users to go pandas -> pyarrow -> hyper in some fashion, though I'm not sure how much better or worse this would be than overriding the type map.

Specifying a schema is also an option, but same problem - it's a bit of a pain with a wide data frame.

One last consideration is the new pyarrow[string] type introduced in pandas 1.3. I've been a little out of the development loop over the past year, but I think this might allow us to use a natively baked-in pandas string dtype that has its own knowledge of whether or not it contains NULL values. If so, pushing users towards that might be the best approach

https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type

Any chance you've used that type yet?

WillAyd avatar Nov 09 '21 18:11 WillAyd

I haven't used pyarrow[string] type. That looks really cool but not something I'm familiar with yet. Converting to arrow and setting nullability on the pyarrow schema doesn't seem like too much to ask. It isn't that different from my initial attempt at a work around of altering the hyper table after creation.

jharris126 avatar Nov 09 '21 18:11 jharris126

I'll see what I can find in the next few days. I would hope that something like df["pyarrow_str_col"].nullable might be inspectable, which could help.

That said, maybe a schema_overrides={...} keyword in the to_hyper call would always be useful. If that's something you want to submit would definitely take a PR for it

WillAyd avatar Nov 09 '21 19:11 WillAyd

It seems like both could have their place. I'll dive a little deeper and see what I come up with.

jharris126 avatar Nov 09 '21 20:11 jharris126

As much as I'd like to, I'm unable to use pantab for exactly this reason: my tables have str columns as primary keys, but at things currently stand I cannot set them to be NOT NULL when on creation. Also, the hyper flavour of SQL does not support changing column constraints via ALTER.

HeiziND avatar Apr 19 '22 01:04 HeiziND

@HeiziND I would love any kind of community PR to move this forward. Alternately you might be interested in the hyperarrow project, which is similar in scope to this but uses Arrow as a backend.

Arrow has a much cleaner type system with nullability control than pandas / numpy

http://hyperarrow.readthedocs.io/

WillAyd avatar Apr 19 '22 02:04 WillAyd