satpy
satpy copied to clipboard
Importing satpy getting slower
As satpy becomes more complex, it is importing more and more libraries. This is becoming problematic for us as we're often running tens of thousands of very small python jobs that don't need many of satpy's features.
For example, according to python -X importtime -c "import satpy" it takes ~2.3 seconds to import the most recent version of satpy, of which ~0.3 seconds is importing rasterio. However, many of our scripts make no use of rasterio, so this import is unneeded.
Is there any way to alter the code structure so that imports are only made when needed? Could, for example, the rasterio imports be moved into the writer so they're only imported when save_dataset is called?
For reference, I ran some stats of how long importing satpy takes across three versions (the older two chosen at random, tbh): v0.18: 1.95 seconds v0.25: 2.18 seconds v0.36: 2.41 seconds
Thanks for bringing this up. Any idea what version(s) of xarray are used when you ran these tests?
Also, how are you timing/profiling this? If you can show me your code and method then I can make sure that any changes I make will show speed ups in your method of timing as well.
The command I used for profiling is listed above, python - x importtime
This is using the latest versions of all non satpy libraries, tested on a fresh conda env. Can find specific versions tonorrow.
Side note: maybe something we should add to the benchmarks?
Do the benchmarks import satpy to start the benchmarks?
The benchmarks setup (including import) is not counted in benchmark times. Does that answer your question?
My worry would be that it has to import Satpy to create/setup the benchmark and then when it actually runs the benchmark and does import satpy it is already cached in python's dict of modules.
@simonrp84 Can you help me understand this importtime output. It looks like xarray is importing trollimage. I assume this is a top-down list where a package is listed and then any imports in that package/module are indented and beneath it.
import time: 193 | 289 | xarray.util.print_versions
import time: 1515 | 341199 | xarray
import time: 218 | 218 | trollimage.version
import time: 18391 | 18609 | trollimage
import time: 180 | 180 | PIL._version
import time: 253 | 433 | PIL
...
import time: 1803 | 6288 | PIL.Image
import time: 176 | 176 | numexpr
import time: 319 | 495 | trollimage.image
import time: 188 | 188 | rasterio._loading
import time: 103 | 103 | rasterio._show_versions
import time: 23076 | 23076 | rasterio._version
...
import time: 1453 | 24156 | rasterio.crs
import time: 159 | 159 | rasterio.drivers
...
import time: 495 | 54820 | rasterio
import time: 3243 | 83885 | trollimage.xrimage
import time: 702 | 702 | trollsift.parser
import time: 198 | 198 | trollsift.version
import time: 26943 | 27842 | trollsift
import time: 659 | 659 | ftplib
To be honest my understanding of the output is limited. I assumed the same as you, so am not sure why it looks like xarray is importing trollimage.
Ah it's backwards. If I go to the bottom of the output satpy is there and everything it imports goes up. That's not confusing at all. On my machine is says 1.059 seconds.
Ok so some red flags I see out of the total 1059032 microseconds it takes to import Satpy.
Aux download and Pooch
import time: 366 | 20732 | requests
import time: 140 | 140 | xxhash
import time: 151 | 290 | pooch.hashes
import time: 155 | 155 | pooch.utils
import time: 188 | 188 | tqdm._monitor
import time: 91 | 91 | tqdm._tqdm_pandas
import time: 342 | 342 | tqdm.utils
import time: 537 | 879 | tqdm.std
import time: 75 | 75 | tqdm._dist_ver
import time: 97 | 171 | tqdm.version
import time: 1789 | 2838 | tqdm.cli
import time: 201 | 201 | tqdm.gui
import time: 195 | 3511 | tqdm
import time: 151 | 151 | paramiko
import time: 193 | 3854 | pooch.downloaders
import time: 420 | 26107 | pooch.core
import time: 29971 | 29971 | tarfile
import time: 219 | 30190 | pooch.processors
import time: 137 | 137 | pooch._version
import time: 261 | 56694 | pooch
import time: 419 | 57112 | satpy.aux_download
Auxiliary downloading loads pooch which is a majority of that import time. The major of pooch's import seems to be tarfile and requests. I feel like tarfile doesn't need to be imported every time.
Edit
I just ran this importime stuff on pooch by itself and it shows most of the time is on requests and that importing tarfile takes no time at all. I'm not sure why this is so inconsistent.
trollsift
import time: 702 | 702 | trollsift.parser
import time: 198 | 198 | trollsift.version
import time: 26943 | 27842 | trollsift
This is a small amount but it shows that trollsift itself and with no other sub-imports takes 27ms to import. That seems surprisingly slow.
trollimage
import time: 23076 | 23076 | rasterio._version
import time: 1175 | 1175 | rasterio._err
import time: 150 | 150 | backports_abc
import time: 136 | 136 | rasterio.control
import time: 153 | 153 | attr._compat
import time: 90 | 90 | attr._config
import time: 175 | 175 | attr.exceptions
import time: 104 | 278 | attr.setters
import time: 1935 | 2302 | attr._make
import time: 178 | 2633 | attr.converters
import time: 101 | 101 | attr.filters
import time: 3226 | 3226 | attr.validators
import time: 108 | 108 | attr._cmp
import time: 120 | 120 | attr._funcs
import time: 486 | 486 | attr._version_info
import time: 120 | 120 | attr._next_gen
import time: 242 | 7032 | attr
import time: 1094 | 8126 | rasterio.rpc
import time: 1197 | 1197 | rasterio._filepath
import time: 168 | 168 | backports_abc
import time: 148 | 148 | certifi.core
import time: 156 | 303 | certifi
import time: 1790 | 3456 | rasterio._env
import time: 386 | 386 | rasterio.errors
import time: 1043 | 1043 | rasterio._path
import time: 144 | 144 | boto3
import time: 270 | 1456 | rasterio.session
import time: 923 | 6220 | rasterio.env
import time: 468 | 6687 | rasterio.dtypes
import time: 190 | 190 | rasterio.coords
import time: 1127 | 1127 | rasterio.enums
import time: 178 | 178 | rasterio.profiles
import time: 475 | 475 | affine
import time: 1194 | 1194 | rasterio._transform
import time: 311 | 1979 | rasterio.transform
import time: 1247 | 1247 | rasterio.windows
import time: 1554 | 22543 | rasterio._base
import time: 162 | 162 | backports_abc
import time: 1453 | 24156 | rasterio.crs
import time: 159 | 159 | rasterio.drivers
import time: 248 | 248 | backports_abc
import time: 157 | 157 | rasterio.sample
import time: 186 | 186 | backports_abc
import time: 1343 | 1528 | rasterio._features
import time: 125 | 125 | backports_abc
import time: 1376 | 3028 | rasterio._warp
import time: 186 | 3214 | rasterio.vrt
import time: 2153 | 5770 | rasterio._io
import time: 877 | 6646 | rasterio.io
import time: 495 | 54820 | rasterio
import time: 3243 | 83885 | trollimage.xrimage
As you stated @simonrp84, the rasterio import is not needed here and is a majority of the importing of trollimage...which also probably doesn't need to be imported on satpy import.
xarray
import time: 1515 | 341199 | xarray
This is obviously pretty bad. I see a lot of red flags in xarray's imports alone. It is likely importing rasterio, cfgrib, and other backend drivers on import. I don't think this is necessary. It also imports dask and zarr and dask.distributed which all take a decent amount of time.
satpy.config
import time: 83491 | 143395 | pkg_resources
...
import time: 325 | 11642 | yaml
import time: 417 | 12900 | donfig.config_obj
import time: 8122 | 21249 | donfig
import time: 591 | 172658 | satpy._config
This shows me that I need to look into avoiding the pkg_resources import. This might be misleading, but I think the import of satpy._config triggers the reading and parsing of all discovered Satpy YAML config files. This is an unfortunate complication of the config system. I'm not sure if there is an easy way to do this without implementing some kind of caching which is risky when it comes to a dynamic configuration system.