ibis
ibis copied to clipboard
fix(decimal): add decimal type inference
closes #4319
This PR addresses the type evolution for decimal types as specified here. As I am relatively new to the ibis development environment, I was not sure where/how to add a testcase for this, so I just used the following code:
import ibis
table = ibis.table(
schema=[("a", "decimal(4, 2)"), ("b", "decimal(5, 3)")], name="t"
)
res = table.projection((table.a + table.b).name('c'))
print(res.c.type())
res = table.projection((table.a * table.b).name('c'))
print(res.c.type())
Currently there are two main shortcomings of this PR that I see, both are hard for me to fix as I do not know ibis in-depth.
- It only handles cases where there are two args given. I cannot see a case where this is not a given as it only works on sub, mul and add, but feel free to point out if I am missing something.
- The maximum precision and maximum scale are currently not handled at all. This is because both seem to be specifiable in some way, something I don't know where it should/could happen in the ibis infrastructure and how to implement it properly. I am very open for suggestions.
Test Results
35 files 35 suites 1h 11m 59s :stopwatch: 8 889 tests 7 098 :heavy_check_mark: 1 791 :zzz: 0 :x: 32 556 runs 25 736 :heavy_check_mark: 6 820 :zzz: 0 :x:
Results for commit cc8519a4.
:recycle: This comment has been updated with latest results.
@webmiche Regarding maximum precision/scale, I think the best place for that is https://github.com/ibis-project/ibis/blob/master/ibis/expr/datatypes/core.py#L443 (the largest
property).
The existing logic is designed to support systems that allow arbitrary precision (I believe that may just be postgres at the moment).
What happens is that if a decimal type has no specified precision or scale then that is left alone, otherwise the defaults from earlier versions of ibis were left intact.
I said this elsewhere, but we're not tied to these particular choices (except that we'd ideally preserve support for arbitrary precision) since decimals haven't gotten any love for some time now.
@webmiche Regarding maximum precision/scale, I think the best place for that is https://github.com/ibis-project/ibis/blob/master/ibis/expr/datatypes/core.py#L443 (the
largest
property).The existing logic is designed to support systems that allow arbitrary precision (I believe that may just be postgres at the moment).
What happens is that if a decimal type has no specified precision or scale then that is left alone, otherwise the defaults from earlier versions of ibis were left intact.
I said this elsewhere, but we're not tied to these particular choices (except that we'd ideally preserve support for arbitrary precision) since decimals haven't gotten any love for some time now.
I have now implemented a version, where the max_prec and max_scale cannot be set explicitely, but they are computed within the function. I feel that having them part of each individual datatype class is not quite what IBM suggests in their docu. I understand it as that max_precision and max_scale are fields that you can set in a global sense, so all decimals must adhere to those maxima.
What is also confusing is that IBM only describes the maxima for the case where you have to operands (i.e. in terms of p and p') and not when you only have one single operand.
I feel that we can't actually get the necessary test coverage just yet as the two lines that are not covered are a problem for later, i.e. when the operations can be variadic, which AFAIK is not yet the case. I feel it still makes sense to keep these lines in, so I suggest to merge with this one CI job failing (obviously only when you guys have expected the code fully :grin:).
I looked into max precision a bit. As always with SQL, things are quite chaotic (even within different version of the same system):
- DB2 on i has a maximum precision of 63 digits: https://www.ibm.com/docs/en/i/7.5?topic=types-numbers
- DB2 on ZOS has a maximum precision of 31 digits: https://www.ibm.com/docs/en/db2-for-zos/13?topic=numbers-decimal-decimal-numeric
- DB2 further allows to configure modes for decimal arithmetic: https://www.ibm.com/docs/en/db2/11.5?topic=parameters-dec-arithmetic-decimal-arithmetic-mode
- (At least) SQL Server 2016 and later have a maximum precision of 38 digits: https://docs.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-2016 (but it says that "earlier versions" only supported up to 28 digits).
Looks quite involved to solve comprehensively...
I am not sure how ibis exactly interacts with the backends, but it seems to be the case that the exact type of a decimal operation does not matter, in that this part seems to have been broken, but all the backends still worked fine.
I think it might make sense to keep this part of type inference relatively simple as it is a bit of a corner case IMO and doesn't seem ot matter much. Or, implement all the different decimal flavors for each backend and choose the right one on the fly. Or, if you can interact with the chosen backend in this way, ask the backend what the type should be...
We can ignore the failing coverage check in this case.
Looks quite involved to solve comprehensively...
Indeed. This PR is a huge improvement over what was there before so I think we'll take this PR as is.
Or, implement all the different decimal flavors for each backend and choose the right one on the fly.
We currently don't have a canonical way of altering expression construction based on backend knowledge. The goal is to be able to construct valid expression cheaply and abstractly (without reference to a backend), which as y'all are pointing out is not truly possible with decimals. I don't think it's worth the effort to what effectively amounts to reimplementing type derivation logic for each backend that ibis supports.
Thinking a bit more broadly, I wonder if we shouldn't get rid of precision and scale entirely and just have a generic decimal
type. That's something we can tackle later, after this PR is merged.
Judging by the documentation of IBM and that I feel that even though we have a decimal(12, 5) or something, it probably still is encoded into a 32bit integer, I feel that having the exact precision might be more of an optimization opportunity than really a semantic difference (discover that something cannot overflow, so we can save a check). I feel that scale however is more important to have in ibis, as it defines how your data really looks. @ingomueller-net probably has better intuition/more context on that though.
So I feel like precision could probably be dropped, but scale really feels important to me.
@webmiche Thanks for the thoughts. I agree that keeping scale around will likely lead to a better ux than dropping it entirely.