satpy icon indicating copy to clipboard operation
satpy copied to clipboard

Error when using delayed writing using one area but several (writers) formats

Open adybbroe opened this issue 2 months ago • 46 comments

Describe the bug

I get an error when trying to make AVHRR images using the delayed writing functionality (as is used in trollflow2). See the example code below. The original error was discovered when running trollflow2 with a the minimal productlist, see extract outlined below:

  reader: avhrr_l1b_aapp
  mask_area: true
  delay_composites: true
  use_tmp_file: true
  radius_of_influence: 10000
  metadata_aliases:
    variant:
      GDS: global
      EARS: regional
      DR: direct_readout
  min_coverage: 25
  areas:
    euron1:
      areaname: euron1
      products:
        overview:
          min_sunlight_coverage: 10
          productname: overview
          formats:
            - format: tif
              writer: geotiff
              blockxsize: 512
              blockysize: 512
              tiled: true
            - format: png
              writer: simple_image
            - format: jpg
              writer: simple_image
              fill_value: 0
...

If comment out either the png and jpg and leave only the tiff, it works fine. If I comment out the tiff, and leave the png/jpeg it also works fine!

To Reproduce

I managed to reproduce it with this minimal code below:

# Your code here
    scene = Scene(filenames=filenames, reader='avhrr_l1b_aapp')
    parameters = ['overview']
    scene.load(parameters)

    AREAID = 'euron1'
    local_scn = scene.resample(AREAID)

    start_time_txt = local_scn.start_time.strftime('%Y-%m-%d %H:%M')
    timestamp = local_scn.start_time.strftime('%Y%m%d_%H%M')

    obj1 = local_scn.save_dataset('overview', writer='geotiff',
                                  blockxsize=512,
                                  blockysize=512,
                                  tiled=True,
                                  compute=False,
                                  filename='avhrr_ovw_%s_%s.tif' % (timestamp,
                                                                    AREAID))
    obj2 = local_scn.save_dataset('overview', writer='simple_image', fill_value=0,
                                  compute=False,
                                  filename='avhrr_ovw_%s_%s.jpg' % (timestamp,
                                                                    AREAID))
    obj3 = local_scn.save_dataset('overview', writer='simple_image',
                                  compute=False,
                                  filename='avhrr_ovw_%s_%s.png' % (timestamp,
                                                                    AREAID))

    objs = [obj1, obj2, obj3]
    compute_writer_results(objs)


Expected behavior I would expect three images being written, now I get only two, the png and the tiff.

Actual results Text output of actual results or error messages including full tracebacks if applicable.

Traceback (most recent call last):
  File "/home/myuser/test_satpy_save_images.py", line 73, in <module>
    compute_writer_results(objs)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/path/to/conda/env/lib/python3.13/site-packages/satpy/writers/core/compute.py", line 139, in compute_writer_results
    da.compute(delayeds)
    ~~~~~~~~~~^^^^^^^^^^
  File "/path/to/conda/env/lib/python3.13/site-packages/dask/base.py", line 681, in compute
    results = schedule(expr, keys, **kwargs)
  File "/path/to/conda/env/lib/python3.13/site-packages/pyresample/future/resamplers/nearest.py", line 62, in query_no_distance
    distance_array, index_array = kdtree.query(
                                  ~~~~~~~~~~~~^
        coords,
        ^^^^^^^
    ...<2 lines>...
        distance_upper_bound=radius,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        mask=mask)
        ^^^^^^^^^^
  File "pykdtree/kdtree.pyx", line 274, in pykdtree.kdtree.KDTree.query
ValueError: Mask must have the same size as data points

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info:

  • OS: Linux (RHEL9)
  • Satpy Version: 0.57.0
  • Pyresample Version:
  • bokeh: 3.8.0
  • trollimage=1.26.0
  • Readers and writers dependencies (when relevant): [run from satpy.utils import check_satpy; check_satpy()]

See requirements file attached!

requirements-conda.txt

Additional context Add any other context about the problem here.

adybbroe avatar Nov 09 '25 13:11 adybbroe

There is some randomness involved here! Sometimes the minimal script above actually runs without error. But most times it errors...

adybbroe avatar Nov 09 '25 14:11 adybbroe

Try upgrading trollimage and let me know if you get the same results.

djhoese avatar Nov 10 '25 00:11 djhoese

Same behaviour unfortunately using latest Trollimage (1.27.0) and latest Satpy (0.59)! That is with trollflow2 and the full product list at the moment only now though. Struggling now to make a small test case that will fail offline... So far can't...

adybbroe avatar Nov 10 '25 08:11 adybbroe

Okay, right now I have this env, given by the requirements file attached below.

requirements-conda-avhrr-imagery-0.3.2.txt

And I can produce 6 images successfully with this (minimal) trollflow2 product list:

...
  min_coverage: 25
  areas:
    euron1:
      areaname: euron1
      products:
        overview:
          min_sunlight_coverage: 10
          productname: overview
          formats:
            - format: tif
              writer: geotiff
              blockxsize: 512
              blockysize: 512
              tiled: true
            - format: png
              writer: simple_image
            - format: jpg
              writer: simple_image
              fill_value: 0
        night_fog:
          productname: night_fog
          formats:
            - format: tif
              writer: geotiff
              blockxsize: 512
              blockysize: 512
              tiled: true
              fname_pattern: "{platform_name:s}_{start_time:%Y%m%d_%H%M}_{areaname:s}_night_fog.{format}"
            - format: png
              writer: simple_image
              fname_pattern: "{platform_name:s}_{start_time:%Y%m%d_%H%M}_{areaname:s}_nightfog.{format}"
            - format: jpg
              writer: simple_image
              fill_value: 0
              fname_pattern: "{platform_name:s}_{start_time:%Y%m%d_%H%M}_{areaname:s}_nightfog.{format}"

But if fails with the full product list, as attached below:

avhrr_product_list.yaml

adybbroe avatar Nov 10 '25 12:11 adybbroe

This product list first failed, but running it later it ran fine. When I succeeded with it, I ran the original full list again, just to confirm that this still fails...

avhrr_product_list_fail_and_okay.yaml

adybbroe avatar Nov 10 '25 14:11 adybbroe

And it always fails with the same error (the pykdtree mask size error)?

My wild guess is that the various refactorings and removals of "dask delayed" functions has introduced a case where two separate calculations are producing a dask task with the same name. When dask goes to compute and use the results of that task it sometimes gets the expected result and sometimes gets the result for the other band's calculation.

djhoese avatar Nov 10 '25 14:11 djhoese

And it always fails with the same error (the pykdtree mask size error)?

My wild guess is that the various refactorings and removals of "dask delayed" functions has introduced a case where two separate calculations are producing a dask task with the same name. When dask goes to compute and use the results of that task it sometimes gets the expected result and sometimes gets the result for the other band's calculation.

Sounds plausible! Yes, so far I have seen only this error everytime!

adybbroe avatar Nov 10 '25 17:11 adybbroe

I see both a map_blocks and a delayed usage in https://github.com/pytroll/satpy/blob/main/satpy/readers/aapp_l1b.py. It may be worth printing out the delayed object .name and the map_block's array result's .name and see if they are ever equal to previous names. By that I mean you shouldn't compare the delayed name and the map_blocks name, but compare the delayed name for one band/geolocation result with the band/geolocation name of another band/geolocation array. Hopefully this makes sense...

djhoese avatar Nov 10 '25 18:11 djhoese

Ok, guessing this:

265d264
<             logger.debug(f"output_delayed name: {output_delayed.name}")
660c659
<     mapblocks_obj = da.map_blocks(_ir_calibrate_numpy,
---
>     return da.map_blocks(_ir_calibrate_numpy,
664,665c663
<     logger.debug(f"mapblocks_obj name: {mapblocks_obj.name}")
<     return mapblocks_obj
---
> 

?

adybbroe avatar Nov 11 '25 16:11 adybbroe

Yes and similarly for the delayed object before or after it is converted to a dask array (I think it has the same name before and after).

djhoese avatar Nov 11 '25 16:11 djhoese

I have:

    def _interpolate_arrays(self, *input_arrays, geolocation=False):
        lines = input_arrays[0].shape[0]
        try:
            interpolator = self._create_40km_interpolator(lines, *input_arrays, geolocation=geolocation)
        except ImportError:
            logger.warning("Could not interpolate, python-geotiepoints missing.")
            output_arrays = input_arrays
        else:
            output_delayed = delayed(interpolator.interpolate, nout=3)()
            logger.debug(f"output_delayed name: {output_delayed.name}")
            output_arrays = [da.from_delayed(out_array, (lines, 2048), in_array.dtype)
                             for in_array, out_array in zip(input_arrays, output_delayed)]
        return output_arrays

And

...
    bandcor_2 = radtempcnv[0, irchn, 1] / 1e5
    bandcor_3 = radtempcnv[0, irchn, 2] / 1e6
    mapblocks_obj = da.map_blocks(_ir_calibrate_numpy,
                         rad, mask, cwnum, bandcor_2, bandcor_3,
                         dtype=rad.dtype,
                         meta=np.ndarray((), dtype=rad.dtype))
    logger.debug(f"mapblocks_obj name: {mapblocks_obj.name}")
    return mapblocks_obj

adybbroe avatar Nov 11 '25 21:11 adybbroe

Then I see this in the log: (the scene around 19:08 errored)

(base) [safusr.u@lxserv1096 satellit]$ grep "output_delayed" avhrr_imagery_processor.log
[2025-11-11 17:28:22,012 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-842ca8a43ce6e8da930705a9c95d1eb9')
[2025-11-11 17:28:22,093 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-ecb34356680160d7ba036aa06a01e06b')
[2025-11-11 18:24:19,554 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-15c8df91e8dd7f42226de795131190c6')
[2025-11-11 18:24:19,564 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-f056eae4065d95414fc243a0ee658af4')
[2025-11-11 19:08:45,932 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-d0bcd2dddc829e169fdeba8727794ba0')
[2025-11-11 19:08:45,943 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-443bce611f3cca78e06c0e364779bb53')
[2025-11-11 20:04:38,033 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-ee96fd34dfc9e200da0baa8139f4c6d4')
[2025-11-11 20:04:38,445 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-9c7fe50f72003b57c3c1358429f5970a')
[2025-11-11 20:48:55,036 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-46add410c22ea14440f66def117263e7')
[2025-11-11 20:48:55,045 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-b9d1b286f8069c7db11f14e020b62e9b')
[2025-11-11 21:43:44,433 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-c5bb329e7183522ef131a3538c93dfec')
[2025-11-11 21:43:44,505 DEBUG    satpy.readers.aapp_l1b] output_delayed name: Delayed('getattr-51d8c76dab28a903b79383f79ca843eb')
(base) [safusr.u@lxserv1096 satellit]$ grep "mapblocks_obj name" avhrr_imagery_processor.log
[2025-11-11 17:28:22,237 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-1f3cc7fe62af2949f5f3c8464c1d3851
[2025-11-11 17:28:22,305 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-00e0c17662fd2145104c09c271a8873e
[2025-11-11 17:28:22,374 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-99f81273964352911616abbd2982e795
[2025-11-11 18:24:19,703 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-6e8611279f8e38b9e592fa34e5ff5670
[2025-11-11 18:24:19,772 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-85e1aed2268bc009c18a3d7d06699fe7
[2025-11-11 18:24:19,905 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-ad8e493f438c6ebd2577ce6dbdb8c4d2
[2025-11-11 19:08:46,028 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-822eae0e3f7f470fdb9f1f1ebbc7da8c
[2025-11-11 19:08:46,106 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-cdf026775f9886daf30531f726346545
[2025-11-11 19:08:46,264 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-040328b7613c07fc55a5d96f24e18065
[2025-11-11 20:04:38,113 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-ea4b4189603670fcc5db6357742e8bd3
[2025-11-11 20:04:38,278 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-1d495b4e8331190a4dd223a5f14bb9e4
[2025-11-11 20:04:38,356 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-06557651cb69983bffdbc323129da7ae
[2025-11-11 20:48:55,276 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-e287dc12881a6bb0381a88f1ddd12dca
[2025-11-11 20:48:55,352 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-d11141f198b03ac52ea69089f8ff0326
[2025-11-11 20:48:55,430 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-792622b719610e03fb3cb2abcd2b83fa
[2025-11-11 21:43:44,497 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-26e8e35346760bbc077c15916d4b2618
[2025-11-11 21:43:44,562 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-a751e8e453516979bbdd234ff911e6fa
[2025-11-11 21:43:44,737 DEBUG    satpy.readers.aapp_l1b] mapblocks_obj name: _ir_calibrate_numpy-bb59dd29e1fb980adf3697939ec859b8

I can't really see anything fishy here, right?

adybbroe avatar Nov 11 '25 21:11 adybbroe

And the execution where you got that output failed?

djhoese avatar Nov 11 '25 23:11 djhoese

Yes, the attempt to create the products using the above "full" product list failed for the scene around 19:08 UTC. All the othere scenes were okay - all products created and saved. (Sorry for the slow/intermittent answers - inbetween meetings/travels - now in train back home from Stockholm).

adybbroe avatar Nov 12 '25 17:11 adybbroe

No problem. I'm going to try to finish some other stuff today but I'm hoping I can try this myself, maybe with a different reader and see what I get.

djhoese avatar Nov 12 '25 17:11 djhoese

Just for info (maybe not surprising?) I see that the "failed" files with names like tmp1776461ob50qzai actually look okay, they are tiff files... and can be displayed nicely.

And the failure always seem to point at the "overview" RGB image:

%> grep "ERROR: " avhrr_imagery_processor.log | grep -v "Could not load dataset"
[ERROR: 2025-11-12 22:08:40 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-12 22:08:40 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-12 22:08:47 : trollflow2.launcher] Missing file: /san1/polar_out/direct_readout/imagery/Metop-C_20251112_2157_euron1_overview.tif
[ERROR: 2025-11-13 07:52:50 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 07:52:50 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 07:52:59 : trollflow2.launcher] Missing file: /san1/polar_out/direct_readout/imagery/Metop-B_20251113_0736_euron1_overview.tif
[ERROR: 2025-11-13 08:37:26 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 08:37:26 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 08:37:33 : trollflow2.launcher] Missing file: /san1/polar_out/direct_readout/imagery/Metop-C_20251113_0821_euron1_overview.tif
[ERROR: 2025-11-13 11:55:39 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 11:55:39 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 11:55:50 : trollflow2.launcher] Missing file: /san1/polar_out/direct_readout/imagery/Metop-C_20251113_1141_euron1_overview.tif
[ERROR: 2025-11-13 13:30:42 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 13:30:42 : trollflow2.launcher] Process crashed
[ERROR: 2025-11-13 13:30:47 : trollflow2.launcher] Missing file: /san1/polar_out/direct_readout/imagery/Metop-C_20251113_1321_euron1_overview.tif

But I am prety sure I could get it to fail also without that RGB, I think I tested that. But will continue...

adybbroe avatar Nov 13 '25 16:11 adybbroe

Are you saying it is always overview that is the problem? This is AVHRR so all of the bands are the same resolution aren't they? So resampling isn't necessary to create the composites... @pnuu or @adybbroe is reduce_data=False set in trollflow2 or does it use the default?

djhoese avatar Nov 13 '25 16:11 djhoese

It is errors like this where I really wish I would have put more information into the error messages (ex. the error from pykdtree in your original post).

djhoese avatar Nov 13 '25 19:11 djhoese

@adybbroe do you have the ability and comfort to modify your pyresample installation in place to print/log more information if I can't reproduce this locally?

djhoese avatar Nov 13 '25 19:11 djhoese

@adybbroe do you have the ability and comfort to modify your pyresample installation in place to print/log more information if I can't reproduce this locally?

Yes, absolutely! It is a conda-env created and deployed with Ansible, but we are at our development server, so I have full controll, and can change in place, and restart the process

adybbroe avatar Nov 13 '25 19:11 adybbroe

Are you saying it is always overview that is the problem? This is AVHRR so all of the bands are the same resolution aren't they? So resampling isn't necessary to create the composites... @pnuu or @adybbroe is reduce_data=False set in trollflow2 or does it use the default?

Yeah, all same resolution!

adybbroe avatar Nov 13 '25 19:11 adybbroe

Ok so I've started trying to reproduce this and look at few possible suspects, but I'm getting very confused at how this specific error is being produced for this specific reader and this specific set of products. The rest of this comment is mostly me thinking "out loud":

The error you are getting is in the KDTree object (pykdtree) when it is being queried. The query method of the KDTree is being given a "mask" parameter which is essentially the fill mask of the input data. The error though is saying this mask is not the same size as the input coordinates that were passed to the KDTree when it was created. Note that both of these arrays are filtered by "valid input coordinates" so only include pixels where the coordinates have lon/lats (-180/180, -90/90).

Additionally, this is AVHRR data which should be all the same resolution for all bands. If the data is split into chunks then it could be that one chunk is getting mixed up with another chunk, but I've never seen two dask chunks be confused with each other. BUT your full product list also has delay_composites: True which means generate=False gets passed to Scene.load so only the pure band data should be resampled so really the composites should not be a problem here.

So, the variables that we may need to play with:

  1. Number of workers. Not something I'm particularly worried about or want to play with but it may increase or decrease the times when the error shows up.
  2. Chunk size. Since the bare bands are being resampled then I'd be curious if this is dependent on the number of chunks the inputs are being split into. Large swath would possibly mean more chunks, but you've also said @adybbroe that you see the same inputs sometimes fail, sometimes pass.
  3. Data going over the poles. Are failures only seen with data failing over the poles?
  4. Is geolocation different for the 3a/3b?
  5. Your full product list has 2 areas. Maybe the number of areas increases the number of KDTree queries increasing the changes for dask to mix some things up. However, there should still only be one KDTree for each input geolocation (SwathDefinition). It is unlikely that dask is mixing up the KDTree instances...I hope.

I need to come up with some things you can print out to give us more information.

djhoese avatar Nov 13 '25 20:11 djhoese

By default Trolflow2 has reduce_data=True.

pnuu avatar Nov 13 '25 20:11 pnuu

@adybbroe if you're comfortable doing it, could you add a log or print just above this line:

https://github.com/pytroll/pyresample/blob/1f4c7fbe3985be068b4130fac7898f71eb7b5987/pyresample/future/resamplers/nearest.py#L62

And print out valid_input_index.shape, mask.shape, and kdtree.n.

@pnuu Thanks. I'm realizing now that this is swath data so that reduce data shouldn't play a part.

djhoese avatar Nov 13 '25 20:11 djhoese

And just for completenes, the default resampling values are

GLOBAL_RESAMPLER_DEFAULTS = {
    "radius_of_influence": None,
    "resampler": "nearest",
    "reduce_data": True,
    "cache_dir": None,
    "mask_area": False,
    "epsilon": 0.0,
}

pnuu avatar Nov 13 '25 20:11 pnuu

All the AVHRR AAPP L1b channels share the same navigation data.

pnuu avatar Nov 13 '25 20:11 pnuu

Yes, I will try add this debug @djhoese Concerning your items above:

  1. Data from our own antenna in Norrköping, and we only rarely cover the pole, and I think very rarely with AVHRR (it happens more likely with VIIRS) - but I will keep my eyes open for that!
  2. 3a and 3b have the same geo location
  3. I have a feeling the chance of failing increase with number of areas, but not backed by much testing though...

adybbroe avatar Nov 13 '25 20:11 adybbroe

[2025-11-13 21:04:10,466 DEBUG    pyresample.future.resamplers.nearest] Before KDTree query in nearest: (4298, 2048) (8802304,) 8325120
[2025-11-13 21:04:10,641 ERROR    trollflow2.launcher] Process crashed

That is valid_input_index.shape, mask.shape, and kdtree.n

adybbroe avatar Nov 13 '25 21:11 adybbroe

But before that:

[2025-11-13 21:04:03,094 DEBUG    pyresample.future.resamplers.nearest] Before KDTree query in nearest: (4298, 2048) (8802304,) 8802304
[2025-11-13 21:04:03,302 DEBUG    pyresample.future.resamplers.nearest] Before KDTree query in nearest: (4298, 2048) (8802304,) 8802304
[2025-11-13 21:04:03,694 DEBUG    pyresample.future.resamplers.nearest] Before KDTree query in nearest: (4298, 2048) (8802304,) 8802304
[2025-11-13 21:04:05,027 DEBUG    pyresample.future.resamplers.nearest] Before KDTree query in nearest: (4298, 2048) (8802304,) 8802304
[2025-11-13 21:04:05,334 INFO     pyninjotiff.ninjotiff] Successfully created a NinJo tiff file: '/xxxxxxx/imagery/tmp1822051io0rqs_s'

adybbroe avatar Nov 13 '25 21:11 adybbroe

Awesome! Any chance you can get me these data files?

djhoese avatar Nov 13 '25 21:11 djhoese