Force Not Null Columns Option for Frames to Hyper
Is your feature request related to a problem? Please describe. The new Tableau REST API feature for publishing multi-table hyper files as a data source requires primary keys to be set on fields that are not nullable. The current pandas to hyper type mappings doesn't allow user intervention in nullability of columns in the generated hyper tables.
Describe the solution you'd like
I think an optional parameter on frame_to_hyper that allows a list of columns to be made not null is the cleanest way to integrate with current functionality.
frame_to_hyper(your_data_frame, "path/to/file", forced_notnull_columns=["key", "key_2"])
Describe alternatives you've considered
Profiling the dataframe for NaN values and setting nullability accordingly is technically possible but would be a large performance overhead and could cause problems with future appended data sets if the data in the initial frame doesn't reflect the possible future nullability of those fields.
Additional context REST API feature for publishing multi-table extracts when hyper foreign key relationships are defined is coming in 2021.4
We might need to scope this a bit further. So for numeric types the nullable variants are IntXX whereas the non null variants are intXX.
I'm guessing this is in reference to more ambiguous types like object / string or float types?
@WillAyd Yes, string was the specific use case I ran into. Looping in @vogelsgesang because he and I spoke about this a bit over Slack. My initial workaround was to try to simply add a ALTER TABLE {table} ALTER COLUMN {Name('id')} SET NOT NULL; hyper command, but this syntax doesn't currently exist in Hyper SQL dialect.
Yea this is pretty tricky. I'm not sure of the cleanest way to handle this from an API perspective.
For one, most of the target types are inferred from the pandas dtype. Specifying a keyword to override that is certainly an option, but can be a bit of a mess when dealing with very wide data frames.
PyArrow has a cleaner mapping of types to hyper. I've mostly got a writer done in hyperarrow, so a second option would be for users to go pandas -> pyarrow -> hyper in some fashion, though I'm not sure how much better or worse this would be than overriding the type map.
Specifying a schema is also an option, but same problem - it's a bit of a pain with a wide data frame.
One last consideration is the new pyarrow[string] type introduced in pandas 1.3. I've been a little out of the development loop over the past year, but I think this might allow us to use a natively baked-in pandas string dtype that has its own knowledge of whether or not it contains NULL values. If so, pushing users towards that might be the best approach
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.3.0.html#pyarrow-backed-string-data-type
Any chance you've used that type yet?
I haven't used pyarrow[string] type. That looks really cool but not something I'm familiar with yet. Converting to arrow and setting nullability on the pyarrow schema doesn't seem like too much to ask. It isn't that different from my initial attempt at a work around of altering the hyper table after creation.
I'll see what I can find in the next few days. I would hope that something like df["pyarrow_str_col"].nullable might be inspectable, which could help.
That said, maybe a schema_overrides={...} keyword in the to_hyper call would always be useful. If that's something you want to submit would definitely take a PR for it
It seems like both could have their place. I'll dive a little deeper and see what I come up with.
As much as I'd like to, I'm unable to use pantab for exactly this reason: my tables have str columns as primary keys, but at things currently stand I cannot set them to be NOT NULL when on creation. Also, the hyper flavour of SQL does not support changing column constraints via ALTER.
@HeiziND I would love any kind of community PR to move this forward. Alternately you might be interested in the hyperarrow project, which is similar in scope to this but uses Arrow as a backend.
Arrow has a much cleaner type system with nullability control than pandas / numpy
http://hyperarrow.readthedocs.io/