Scene creation fails for too large number of files due to maximum recursion depth exceeded error
Describe the bug The creation of a scene for data from OLCI fails once the number of files provided via the filenames argument exceeds 3000. The error thrown is RecursionError: maximum recursion depth exceeded. This suggests that the files are read in by an iterative approach internally and not sequentially. This seems a rather unnecessary limitation of the number of files that can be read because an iterative solution could be easily avoided.
To Reproduce
from datetime import datetime
from satpy import Scene, find_files_and_readers
files = find_files_and_readers(sensor='olci',
start_time=datetime(2025, 11, 11, 0, 10),
end_time=datetime(2025, 11, 11, 20, 0),
base_dir='/path/to/data',
reader='olci_l1b')
print(len(files['olci_l1b']))
scn = Scene(filenames=files)
Environment Info:
- OS: Linux
- Satpy Version: 0.59.0
This is...interesting. That is a lot of files. So I have a couple points and couple questions:
- Normally a Satpy Scene can only handle 1 orbit or 1 time step (in the case of geostationary data) of data. Otherwise in the case of polar-orbiting data (swath-based data) resampling will not behave as expected as resampling algorithms do not take time into account and will pick pixels based on location only. This produces awkward output when multiple orbits are blended together.
- I'm surprised by this failing from recursion. Do you have the end of the traceback so we can try to track down and possibly fix this unexpected recursion?
- The default number of files that can be opened by one process in linux is 1024 (
ulimit -n) so even if there wasn't a recursion issue and the Satpy resampling handled it properly, you wouldn't be able to open that many files without modifying your system's settings to allow for that many. - What is your use case? What are you trying to accomplish? Maybe we can suggest a different approach?
Thank you for the quick reply. Here my answers to your points:
- It is good to know that satpy blends the data from different orbits, I was already wondering what happens in the polar region for multiple orbits.
File ~/miniforge3/lib/python3.12/site-packages/satpy/scene.py:155, in Scene.__init__(self, filenames, reader, filter_parameters, reader_kwargs)
152 if filenames:
153 filenames = convert_remote_files_to_fsspec(filenames, storage_options)
--> 155 self._readers = self._create_reader_instances(filenames=filenames,
156 reader=reader,
157 reader_kwargs=cleaned_reader_kwargs)
158 self._datasets = DatasetDict()
159 self._wishlist = set()
File ~/miniforge3/lib/python3.12/site-packages/satpy/scene.py:176, in Scene._create_reader_instances(self, filenames, reader, reader_kwargs)
171 def _create_reader_instances(self,
172 filenames=None,
173 reader=None,
174 reader_kwargs=None):
175 """Find readers and return their instances."""
--> 176 return load_readers(filenames=filenames,
177 reader=reader,
178 reader_kwargs=reader_kwargs)
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/loading.py:65, in load_readers(filenames, reader, reader_kwargs)
63 loadables = reader_instance.select_files_from_pathnames(readers_files)
64 if loadables:
---> 65 reader_instance.create_storage_items(
66 loadables,
67 fh_kwargs=reader_kwargs_without_filter[None if reader is None else reader[idx]])
68 reader_instances[reader_instance.name] = reader_instance
69 remaining_filenames -= set(loadables)
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/yaml_reader.py:618, in FileYAMLReader.create_storage_items(self, files, **kwargs)
616 def create_storage_items(self, files, **kwargs):
617 """Create the storage items."""
--> 618 return self.create_filehandlers(files, **kwargs)
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/yaml_reader.py:643, in FileYAMLReader.create_filehandlers(self, filenames, fh_kwargs)
636 self.file_handlers[filetype] = sorted(
637 self.file_handlers.get(filetype, []) + filehandlers,
638 key=lambda fhd: (fhd.start_time, fhd.filename))
640 # Update dataset IDs with IDs determined dynamically from the file
641 # and/or update any missing metadata that only the file knows.
642 # Check if the dataset ID is loadable from that file.
--> 643 self.update_ds_ids_from_file_handlers()
644 return created_fhs
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/yaml_reader.py:686, in FileYAMLReader.update_ds_ids_from_file_handlers(self)
684 avail_datasets = self._file_handlers_available_datasets()
685 new_ids = {}
--> 686 for is_avail, ds_info in avail_datasets:
687 # especially from the yaml config
688 coordinates = ds_info.get("coordinates")
689 if isinstance(coordinates, list):
690 # xarray doesn't like concatenating attributes that are
691 # lists: https://github.com/pydata/xarray/issues/2060
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/file_handlers.py:275, in BaseFileHandler.available_datasets(self, configured_datasets)
182 def available_datasets(self, configured_datasets=None):
183 """Get information of available datasets in this file.
184
185 This is used for dynamically specifying what datasets are available
(...) 273
274 """
--> 275 for is_avail, ds_info in (configured_datasets or []):
276 if is_avail is not None:
277 # some other file handler said it has this dataset
278 # we don't know any more information than the previous
279 # file handler so let's yield early
280 yield is_avail, ds_info
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/file_handlers.py:275, in BaseFileHandler.available_datasets(self, configured_datasets)
182 def available_datasets(self, configured_datasets=None):
183 """Get information of available datasets in this file.
184
185 This is used for dynamically specifying what datasets are available
(...) 273
274 """
--> 275 for is_avail, ds_info in (configured_datasets or []):
276 if is_avail is not None:
277 # some other file handler said it has this dataset
278 # we don't know any more information than the previous
279 # file handler so let's yield early
280 yield is_avail, ds_info
[... skipping similar frames: BaseFileHandler.available_datasets at line 275 (2969 times)]
File ~/miniforge3/lib/python3.12/site-packages/satpy/readers/core/file_handlers.py:275, in BaseFileHandler.available_datasets(self, configured_datasets)
182 def available_datasets(self, configured_datasets=None):
183 """Get information of available datasets in this file.
184
185 This is used for dynamically specifying what datasets are available
(...) 273
274 """
--> 275 for is_avail, ds_info in (configured_datasets or []):
276 if is_avail is not None:
277 # some other file handler said it has this dataset
278 # we don't know any more information than the previous
279 # file handler so let's yield early
280 yield is_avail, ds_info
RecursionError: maximum recursion depth exceeded
- Yes, but it should be possible to open and close the files one after another to extract data from them all, shouldn't it?
- The use case is to create a daily (downsampled) geoprojected plot of all the data collected from OLCI during this day. This would be used for internal monitoring of the instrument status, because missing data from one of the OLCI cameras is very easy to spot in such a plot. We are anyways considering though as an alternative to create one plot for each orbit, or at least for only a few orbits.
- Yes, I wouldn't recommend doing more than one orbit at a time even if the rest of this wasn't failing.
- Very interesting. That isn't technically recursion in the normal sense, but rather a generator being passed to a generator being passed to a generator and so on. The hope with this implementation was to avoid generating and iterating over a list multiple times. A fix probably isn't too hard, but I'd have to consider pros and cons.
- Yes, but the reading code would have to be very smart about how it does things. With dask, which Satpy is using, this gets even more complicated as the easiest solution is to open the file and let dask hold on to it until it needs the data and eventually loads it. Having dask open the file repeatedly in each thread that it needs it is typically not great for performance and is not the most obvious "out of the box" solution. You end up having to open the file in the main thread, parse out all the attributes and variables that you want/need to use, then pass the original filename to a function that dask will call later in a separate thread. In that function, you re-open the file, pull the data out, then close the file. Depending on the file format this gets even more difficult.
- Sounds good. If I was doing this I would process one orbit at a time, save to a geotiff, then call something like
gdal_mergeon the command line to merge them all together. If you know the target area/extent that you want the final image to be on you could resample to that large grid with Satpy and that would speed things up forgdal_merge. Satpy also has aMultiScenewhich could do this blending of orbits for you, but honestly I'm not sure this is the right use case for it. Plus if you process an orbit at a time you can do it in parallel with simpler code or in real-time when the orbit data becomes available.
On point 4 I would resample it to a world map and then save it as a NetCDF. Then you can read them again with satpy, and as they would all have the same area, you could read all the orbit files for a day together and use some bucket averaging resampler or such to get the averages. With geotiff you can read images written with satpy, but you this is not designed to retain pixel values.
@gerritholl I hadn't considered reading the geotiffs with Satpy and yes this could be troublesome (given other filed issues related to that generic_image reader). I offered geotiff as a suggestion because if the desired end result is a geotiff then creating that will be fast and easy to manipulate with GDAL (gdal_merge).
Just my 2 cents: we use a dynamic area in the right projection to resample each segment and the use a wms to put them all together (with viirs granules). A wms is not necessary in your case i guess a simple gdal merge should work if you just need a daily image.