pantab Change _assert_columns_equal exception message, add a function to return dtypes of existing hyper

I get frustrated when I run into the "TypeError: Mismatched column definitions" exception because it's hard to tell which columns did not match. The message built from lines 75 to 82 in _writer.py returns all of the column definitions without pointing out the problematic pairs.

While I could avoid this result by standardizing my input I don't have time to put together a dtype mapping for the 40 or so columns across my 15+ recurring extracts. It would be convenient if I could retrieve the dtypes of the hyper file I'll be appending to without reading the hyper file and getting the dtypes from the dataframe.

For these issues could we do two things:

Change _assert_columns_equal in _writer.py to append mismatched column pairs to a list instead of breaking and raise the type error if the list has anything in it. The message of the error should include the mismatched pairs and optionally still include all of the column definitions.
Copy or break out the dtype-retrieving section that makes up most of _read_table and make a public function called dtypes_from_hyper or the like.

Unrelated to these issues, there are three code changes I'd like to suggest:

Is it possible to break out the sections which make a string into a connection in all four of the frame_x_hyper functions? It seems identical in each of the two to/from pairs and similar between them. Doing so would also let you not repeat the "for schema..." chunk that makes up lines 123-127 and 134-140 in frames_from_hyper in _reader.py.
Since string inputs for table are handled silently in both frame_from_hyper and frame_to_hyper (inside _read_table and _insert_frame respectively), could the suggested type for table be changed to table: Union[str, TableType]?
Is DummyColumn in _assert_columns_equal supposed to match with any column configuration, allowing additional columns to be added to the table, or is it implicitly supposed to break if the column lengths don't match? If the latter, could we make it an explicit comparison and provide a relevant error message?

I can do a pull request for all of this. Let me know which changes you're alright with and any caveats I should be aware of.

Oct 28 '21 19:10 SpencerRP

Cool thanks for all of the feedback. I think making a public function that gives information about an existing table is good. I'm not sure if that's called dtypes_from_hyper or if we make a class that manages that. The latter would make more long term use, but the former is probably fine for now. I also don't think this precludes you from what you suggest in point 1.

For your remaining points:

Sure - I'd definitely take any refactor of this. Maybe a wrapper function?
TableType is already a Union type which contains a string (see top of the module), so there wouldn't be much benefit to wrapping that with another Union
That class is just supposed to prevent you from getting AttributeErrors in cases where you have an unequal list of columns on both sides, as the comparison within the loops accesses certain variables. Ideally the Hyper API would implement eq on the Column class instead of us doing this here. That code was written a long time ago so maybe that exists now

Oct 28 '21 20:10 WillAyd

To give you an update, I have made a class called HyperFile in _hyper_utils. This class has a context manager method for the connection wrapper, and a static method dtypes function. There's some instance methods and attributes I'm still figuring out.

Also, I'd like to suggest three additional features:

Column filtering functionality like usecols in pd.read_csv.
Dataframe generator functionality like chunksize/iterator in pd.read_csv.
Including TableName in pantab (adding it to all in init) and/or handling tuples like ("Schema", "Extract") as a table argument, to be converted to TableName. This is entirely me being lazy. I do a lot of one-off scripts & having to separately import the api when I want to specify the schema is somewhat irritating.

The usecols feature is obviously something that can be done using a query but it's both useful and familiar to people who use pandas. It can be done by adding a usecols arg (taking a list of col names for frame & a dict of lists for frames) & a relatively small change to _read_table:

if usecols is None:
        query = f"SELECT * from {table}"
    else:
        query = f"SELECT {', '.join([str(tab_api.Name(col)) for col in usecols])} from {table}"

The generator/chunksize feature would be more involved from what I can understand of the c code. On the one hand it seems like the code in reader.c already handles the data in chunks, processes the rows in each chunk and appends the result to a list. On the other hand the examples of c iterators from stack exchange seem verbose/annoying to implement and I'm not sure how to handle a c iterator from the python side. Is it alright if I open a ticket for this so someone better at C can work on it?

Let me know which of these features are alright & if I should make a separate issue for the generator. I'll probably push the changes by next Friday.

Dec 03 '21 23:12 SpencerRP

Thanks for the feedback. Yea I would just open dedicated issues for each suggestion - usually easier to discuss and track that way

Dec 04 '21 18:12 WillAyd

pantab 4.0 should solve this - type mismatches report the index of the problem

Jan 28 '24 01:01 WillAyd