pint Slow import times due to always importing optional packages

Hi, thanks to the developers and maintainers for creating and supporting this software.

I've noticed some slow (> 1 second) import times and used Tuna to isolate it as the Pandas import.

$ python
Python 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:28:27)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> from pint import UnitRegistry
>>> 'pandas' in sys.modules
True
>>>
$ conda list | grep pandas
pandas                    1.4.0            py39h4d6be9b_0    conda-forge

Here is Tuna's output of a script containing only from pint import UnitRegistry:

It seems to be here, the result of https://github.com/hgrecco/pint/pull/959 in which it's imported if available alongside some other packages no matter if it's used. Note that I don't have pint-pandas installed so I'm not trying to take advantage of that functionality.

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," but Pandas is used by so many packages that it's hard to avoid it creeping in. Most of of my use cases, i.e. tracking small amounts of data or something as simple as providing a custom unit registry, don't use Pandas and therefore the ~1 second of import time is not a desired side effect.

Feb 08 '22 18:02 mattwthompson

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," [...]

That's not intention with these imports, instead, we need to have a list of types that Pint is not allowed to wrap... it's not about enabling optional features as much as ensuring that erroneous behavior cannot come about with whatever packages a user happens to have in their environment (e.g., we need to make sure that attempting to stick a pandas Series inside a pint Quantity always fails). Having a "deny list" rather than an "allow list" was a decision made early on given Pint's relatively high place in the generally agreed upon type casting hierarchy, although it is one that will likely see revisiting once there is consensus reached on https://github.com/pydata/duck-array-discussion/issues/3.

That being said, using type checks against the actual array types themselves still opens us up to those other libraries' slow import times also slowing down Pint. This style of check has caused circular import issues in the past as well (see https://github.com/pydata/xarray/pull/4751#discussion_r552100016), so regardless of whether there is an "allow list" or "deny list," I think we should move away from isinstance checks and see if fully qualified class name strings would work. @keewis do you have suggestions for this?

Feb 08 '22 19:02 jthielen

not really, besides pushing pydata/duck-array-discussion#3 forward.

For now, the code for fully_qualified_name could be something like:

def fully_qualified_name(obj):
    t = type(obj)
    module = t.__module__
    name = t.__qualname__

    if module is None or module == "__builtin__":
        return name

    return f"{module}.{name}"

and we'd define a tuple / set of strings to compare against. upcast_types = ("xarray.core.dataarray.DataArray", ...), for example?

Feb 08 '22 22:02 keewis

I like @keewis proposal. One improved way will be to "delay" the import until the fully qualified match. The rationale would be as follows:

we compare by fully qualified name,
if it does not match, do nothing
if there is a match, we import the actual package (that should be imported already by the user because we are actually getting and object) and we double check.

Something like this (disregard the name of the variables):

upcast_types = { "fully.qualified.name": actual_type or None}

def is_upcast_type_by_name(other):
    fqn = fully_qualified_name(other)
    try:
        importer = upcast_types[fqn]
    except KeyError:
        return False
    if isinstance(importer, callable):
        cls = importer()
    else:
        module_name, class_name = fqn.rsplit(".", 1)
        cls = getattr(import_module(module_name), cls)

    upcast_types[fqn] = cls
    # This is to check we are importing the same thing.
    # and avoid weird problems. Maybe instead of return
    # we should raise an error if false.
    return isinstance(obj, cls)

def is_upcast_type(other):
    if other in upcast_types.values():
         return True
    return is_upcast_type_by_name(other)

and in compat.py

# Meaning: "we now of its existence and we can handle"
# Alternative we can also allow having a function that does a more complex import. 
upcast_types["xarray.core.dataarray.DataArray"] = None

Mar 01 '22 15:03 hgrecco

pint pint copied to clipboard

Slow import times due to always importing optional packages

pint
pint copied to clipboard