satpy icon indicating copy to clipboard operation
satpy copied to clipboard

Importing satpy getting slower

Open simonrp84 opened this issue 3 years ago • 11 comments

As satpy becomes more complex, it is importing more and more libraries. This is becoming problematic for us as we're often running tens of thousands of very small python jobs that don't need many of satpy's features. For example, according to python -X importtime -c "import satpy" it takes ~2.3 seconds to import the most recent version of satpy, of which ~0.3 seconds is importing rasterio. However, many of our scripts make no use of rasterio, so this import is unneeded.

Is there any way to alter the code structure so that imports are only made when needed? Could, for example, the rasterio imports be moved into the writer so they're only imported when save_dataset is called?

For reference, I ran some stats of how long importing satpy takes across three versions (the older two chosen at random, tbh): v0.18: 1.95 seconds v0.25: 2.18 seconds v0.36: 2.41 seconds

simonrp84 avatar Aug 02 '22 09:08 simonrp84

Thanks for bringing this up. Any idea what version(s) of xarray are used when you ran these tests?

djhoese avatar Aug 02 '22 15:08 djhoese

Also, how are you timing/profiling this? If you can show me your code and method then I can make sure that any changes I make will show speed ups in your method of timing as well.

djhoese avatar Aug 02 '22 15:08 djhoese

The command I used for profiling is listed above, python - x importtime

This is using the latest versions of all non satpy libraries, tested on a fresh conda env. Can find specific versions tonorrow.

simonrp84 avatar Aug 03 '22 07:08 simonrp84

Side note: maybe something we should add to the benchmarks?

mraspaud avatar Aug 04 '22 08:08 mraspaud

Do the benchmarks import satpy to start the benchmarks?

djhoese avatar Aug 04 '22 13:08 djhoese

The benchmarks setup (including import) is not counted in benchmark times. Does that answer your question?

mraspaud avatar Aug 04 '22 13:08 mraspaud

My worry would be that it has to import Satpy to create/setup the benchmark and then when it actually runs the benchmark and does import satpy it is already cached in python's dict of modules.

djhoese avatar Aug 04 '22 14:08 djhoese

@simonrp84 Can you help me understand this importtime output. It looks like xarray is importing trollimage. I assume this is a top-down list where a package is listed and then any imports in that package/module are indented and beneath it.

import time:       193 |        289 |             xarray.util.print_versions
import time:      1515 |     341199 |           xarray
import time:       218 |        218 |               trollimage.version
import time:     18391 |      18609 |             trollimage
import time:       180 |        180 |               PIL._version
import time:       253 |        433 |             PIL
...
import time:      1803 |       6288 |             PIL.Image
import time:       176 |        176 |               numexpr
import time:       319 |        495 |             trollimage.image
import time:       188 |        188 |               rasterio._loading
import time:       103 |        103 |               rasterio._show_versions
import time:     23076 |      23076 |               rasterio._version
...
import time:      1453 |      24156 |               rasterio.crs
import time:       159 |        159 |               rasterio.drivers
...
import time:       495 |      54820 |             rasterio
import time:      3243 |      83885 |           trollimage.xrimage
import time:       702 |        702 |             trollsift.parser
import time:       198 |        198 |             trollsift.version
import time:     26943 |      27842 |           trollsift
import time:       659 |        659 |                 ftplib

djhoese avatar Aug 04 '22 14:08 djhoese

To be honest my understanding of the output is limited. I assumed the same as you, so am not sure why it looks like xarray is importing trollimage.

simonrp84 avatar Aug 04 '22 14:08 simonrp84

Ah it's backwards. If I go to the bottom of the output satpy is there and everything it imports goes up. That's not confusing at all. On my machine is says 1.059 seconds.

djhoese avatar Aug 04 '22 14:08 djhoese

Ok so some red flags I see out of the total 1059032 microseconds it takes to import Satpy.

Aux download and Pooch

import time:       366 |      20732 |                 requests
import time:       140 |        140 |                   xxhash
import time:       151 |        290 |                 pooch.hashes
import time:       155 |        155 |                 pooch.utils
import time:       188 |        188 |                     tqdm._monitor
import time:        91 |         91 |                     tqdm._tqdm_pandas
import time:       342 |        342 |                         tqdm.utils
import time:       537 |        879 |                       tqdm.std
import time:        75 |         75 |                         tqdm._dist_ver
import time:        97 |        171 |                       tqdm.version
import time:      1789 |       2838 |                     tqdm.cli
import time:       201 |        201 |                     tqdm.gui
import time:       195 |       3511 |                   tqdm
import time:       151 |        151 |                   paramiko
import time:       193 |       3854 |                 pooch.downloaders
import time:       420 |      26107 |               pooch.core
import time:     29971 |      29971 |                 tarfile
import time:       219 |      30190 |               pooch.processors
import time:       137 |        137 |               pooch._version
import time:       261 |      56694 |             pooch
import time:       419 |      57112 |           satpy.aux_download

Auxiliary downloading loads pooch which is a majority of that import time. The major of pooch's import seems to be tarfile and requests. I feel like tarfile doesn't need to be imported every time.

Edit

I just ran this importime stuff on pooch by itself and it shows most of the time is on requests and that importing tarfile takes no time at all. I'm not sure why this is so inconsistent.

trollsift

import time:       702 |        702 |             trollsift.parser
import time:       198 |        198 |             trollsift.version
import time:     26943 |      27842 |           trollsift

This is a small amount but it shows that trollsift itself and with no other sub-imports takes 27ms to import. That seems surprisingly slow.

trollimage

import time:     23076 |      23076 |               rasterio._version
import time:      1175 |       1175 |                   rasterio._err
import time:       150 |        150 |                   backports_abc
import time:       136 |        136 |                   rasterio.control
import time:       153 |        153 |                         attr._compat
import time:        90 |         90 |                           attr._config
import time:       175 |        175 |                             attr.exceptions
import time:       104 |        278 |                           attr.setters
import time:      1935 |       2302 |                         attr._make
import time:       178 |       2633 |                       attr.converters
import time:       101 |        101 |                       attr.filters
import time:      3226 |       3226 |                       attr.validators
import time:       108 |        108 |                       attr._cmp
import time:       120 |        120 |                       attr._funcs
import time:       486 |        486 |                       attr._version_info
import time:       120 |        120 |                       attr._next_gen
import time:       242 |       7032 |                     attr
import time:      1094 |       8126 |                   rasterio.rpc
import time:      1197 |       1197 |                         rasterio._filepath
import time:       168 |        168 |                         backports_abc
import time:       148 |        148 |                           certifi.core
import time:       156 |        303 |                         certifi
import time:      1790 |       3456 |                       rasterio._env
import time:       386 |        386 |                       rasterio.errors
import time:      1043 |       1043 |                         rasterio._path
import time:       144 |        144 |                         boto3
import time:       270 |       1456 |                       rasterio.session
import time:       923 |       6220 |                     rasterio.env
import time:       468 |       6687 |                   rasterio.dtypes
import time:       190 |        190 |                   rasterio.coords
import time:      1127 |       1127 |                   rasterio.enums
import time:       178 |        178 |                   rasterio.profiles
import time:       475 |        475 |                     affine
import time:      1194 |       1194 |                     rasterio._transform
import time:       311 |       1979 |                   rasterio.transform
import time:      1247 |       1247 |                   rasterio.windows
import time:      1554 |      22543 |                 rasterio._base
import time:       162 |        162 |                 backports_abc
import time:      1453 |      24156 |               rasterio.crs
import time:       159 |        159 |               rasterio.drivers
import time:       248 |        248 |                   backports_abc
import time:       157 |        157 |                   rasterio.sample
import time:       186 |        186 |                         backports_abc
import time:      1343 |       1528 |                       rasterio._features
import time:       125 |        125 |                       backports_abc
import time:      1376 |       3028 |                     rasterio._warp
import time:       186 |       3214 |                   rasterio.vrt
import time:      2153 |       5770 |                 rasterio._io
import time:       877 |       6646 |               rasterio.io
import time:       495 |      54820 |             rasterio
import time:      3243 |      83885 |           trollimage.xrimage

As you stated @simonrp84, the rasterio import is not needed here and is a majority of the importing of trollimage...which also probably doesn't need to be imported on satpy import.

xarray

import time:      1515 |     341199 |           xarray

This is obviously pretty bad. I see a lot of red flags in xarray's imports alone. It is likely importing rasterio, cfgrib, and other backend drivers on import. I don't think this is necessary. It also imports dask and zarr and dask.distributed which all take a decent amount of time.

satpy.config

import time:     83491 |     143395 |     pkg_resources
...
import time:       325 |      11642 |         yaml
import time:       417 |      12900 |       donfig.config_obj
import time:      8122 |      21249 |     donfig
import time:       591 |     172658 |   satpy._config

This shows me that I need to look into avoiding the pkg_resources import. This might be misleading, but I think the import of satpy._config triggers the reading and parsing of all discovered Satpy YAML config files. This is an unfortunate complication of the config system. I'm not sure if there is an easy way to do this without implementing some kind of caching which is risky when it comes to a dynamic configuration system.

djhoese avatar Aug 04 '22 14:08 djhoese