SNAPHU `Killed` Error Causes Missing scenes in output timeseries (Only 15/34 scenes showing)
EDIT: Updated title to reflect underlying SNAPHU issue. (out of memory)
Hi,
I've managed to create an analysis using 34 Sentinel-1 images. However, around 50% of the data is missing when I did a single reference network. I put in 34 SLC images and the output timeseries and network only contains 15 data points. Is that normal? The network is still nicely distributed but the temporal resolution is less than expected due to the missing data.
I've checked the output of stackSentinel.py and the missing dates are there in the output folders such as merged/interferograms, baselines, coreg_secondarys, etc.
I've inspected the slcStack.h5 file and the "slc" key has shape (34, 1029, 5864) which I assume means all Sentinel-1 data was able to be ingested.
Input stackSentinel code
stackSentinel.py -s /mnt/e/data/insar-highways/demak \
--workflow interferogram \
--working_directory /mnt/e/data/insar-highways/demak_v5 \
-n 1 --bbox "-6.980585 -6.896600 110.435772 110.636444" \
-o /mnt/e/data/insar-highways/demak_v5/orbits \
-a /mnt/e/data/insar-highways/demak_v5/auxfiles \
-d /mnt/e/data/insar-highways/demak_v5/dem/dem.geo \
-V False \
-z 4 \
-r 20
Then I ran the miaplpyApp using miaplpyApp.py demak.cfg --dir /mnt/e/data/insar-highways/demak_v5/miaplpy using the cfg below.
################
miaplpy.load.processor = isce #[isce,snap,gamma,roipac], auto for isceTops
miaplpy.load.updateMode = no #[yes / no], auto for yes, skip re-loading if HDF5 files are complete
miaplpy.load.compression = auto #[gzip / lzf / no], auto for no.
miaplpy.load.autoPath = no # [yes, no] auto for no
miaplpy.load.slcFile = /mnt/e/data/insar-highways/demak_v5/merged/SLC/*/*.slc.full #[path2slc_file]
##---------for ISCE only:
miaplpy.load.metaFile = /mnt/e/data/insar-highways/demak_v5/reference/IW*.xml
miaplpy.load.baselineDir = /mnt/e/data/insar-highways/demak_v5/baselines
##---------geometry datasets:
miaplpy.load.demFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/hgt.rdr.full
miaplpy.load.lookupYFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/lat.rdr.full
miaplpy.load.lookupXFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/lon.rdr.full
miaplpy.load.incAngleFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/los.rdr.full
miaplpy.load.azAngleFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/los.rdr.full
miaplpy.load.shadowMaskFile = /mnt/e/data/insar-highways/demak_v5/merged/geom_reference/shadowMask.rdr.full
##---------miaplpy.load.waterMaskFile = /mnt/e/data/insar-highways/demak_v4/water_mask/swbdLat_S08_S06_Lon_E110_E111.wbd
##---------interferogram datasets:
miaplpy.load.unwFile = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*fine*.unw
miaplpy.load.corFile = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*fine*.cor
miaplpy.load.connCompFile = /mnt/e/data/insar-highways/demak_v5/miaplpy/inverted/interferograms_single_reference/*/*.unw.conncomp
##---------subset (optional):
## if both yx and lalo are specified, use lalo option unless a) no lookup file AND b) dataset is in radar coord
miaplpy.subset.lalo = -6.980585:-6.896600,110.435772:110.636444
# MiaplPy options
miaplpy.multiprocessing.numProcessor = 10
miaplpy.interferograms.type = single_reference
## Mintpy options
mintpy.compute.cluster = local # if dask is not available, set this option to no
mintpy.compute.numWorker = 4
mintpy.reference.lalo = -6.9062397501293855, 110.62864532047873
mintpy.troposphericDelay.method = no
Just an update. I fixed the error.
I saw output in step 5 unwrap_ifgram where it says Killed multiple times which I think my computer just ran out of memory. So I think this resulted in the downstream processing assuming there were only 15 datasets because of many SNAPHU failures.
Checking my run_05_miaplpy_unwrap_ifgram file I see it has a lot of run commands executing at the same time. There are 20 run commands then a wait. After that 13 run commands before the last wait. I rewrote the file so there is a wait command after every 4 SNAPHU command. I'm not sure which parameter I originally used in the cfg, maybe miaplpy.compute.numCores. I set it to 20 because I have 20 CPU cores.
I run again with no problem. But then I had to delete numInvIfgram.h5, timeseries.h5, and temporalCoherence.h5 due to a mismatch in dataset sizes in later steps but the resulting output is good.
My thoughts on this for the project team:
- Could there be a specific parameter for number of jobs for SNAPHU? I was fine tuning the CFG file for the phase linking step since it takes so long so I wanted to maximize the CPU cores. But if I'm understanding correctly the same parameter was used for number of jobs for SNAPHU which lead to unsafe process terminations.
- There should be a way to safely catch the SNAPHU out of memory error in
unwrap_ifgram. It doesn't raise an exception and the rest of the MiaplPy was able to run and just assumed it was valid despite 50% of the dataset missing.
Yes you are right, I am planning to use a python version of Snaphu and fix this issue in the near future.
Potential Solution
The plan to solve the bug involves addressing the memory management issues during the SNAPHU unwrapping process, which is likely causing the "Killed" error due to excessive memory usage. By optimizing the configuration settings for SNAPHU, implementing memory checks, and enhancing error handling, we can prevent the process from being terminated unexpectedly and ensure all scenes are processed correctly.
What is Causing This Bug?
The bug is primarily caused by the SNAPHU unwrapping process consuming more memory than is available, leading to the process being killed by the operating system. This is likely due to the size of the interferograms being processed and the configuration settings not being optimized for the available system resources. Additionally, the lack of memory checks and detailed error handling in the scripts contributes to the issue.
Code
-
Optimize SNAPHU Configuration: Adjust the SNAPHU configuration parameters to better match the available system resources. This may involve reducing the number of tiles or adjusting other parameters to reduce memory usage.
# Example of adjusting SNAPHU configuration snaphu_config = { 'NLOOKSRANGE': 1, 'NLOOKSAZ': 1, 'TILEDIR': '/path/to/tiledir', 'NUM_TILES': 4, # Reduce the number of tiles if memory is limited 'MAX_DISCONTINUITY': 10 # Adjust based on dataset characteristics } -
Implement Memory Checks: Add checks to ensure that sufficient memory is available before starting the SNAPHU process.
import psutil def check_memory_availability(required_memory_gb): available_memory_gb = psutil.virtual_memory().available / (1024 ** 3) if available_memory_gb < required_memory_gb: raise MemoryError(f"Insufficient memory: {available_memory_gb} GB available, {required_memory_gb} GB required.") # Example usage check_memory_availability(8) # Check if at least 8 GB of memory is available -
Enhanced Error Handling: Improve error handling to catch and report memory-related errors more gracefully.
try: # Run SNAPHU process run_snaphu_process() except MemoryError as e: print(f"Memory error encountered: {e}") # Additional logging or cleanup actions except RuntimeError as e: print(f"Runtime error encountered: {e}") # Additional logging or cleanup actions -
Logging and Monitoring: Enhance logging to provide more detailed information about the execution of each step, including memory usage and any errors encountered.
import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def log_memory_usage(): memory_info = psutil.virtual_memory() logger.info(f"Memory usage: {memory_info.percent}% used, {memory_info.available / (1024 ** 3):.2f} GB available") # Example usage log_memory_usage()
How to Replicate the Bug
- Prepare a dataset of 34 Sentinel-1 SLC images and configure the processing pipeline as described in the user input.
- Run the
stackSentinel.pyscript with the specified parameters to generate the necessary input files. - Execute the
miaplpyApp.pyscript using the provided configuration file. - Observe the output timeseries and network for missing scenes, and check for any "Killed" error messages in the logs.
By following these steps, the bug should be replicated, allowing for further testing and validation of the proposed solution.
Click here to create a Pull Request with the proposed solution
Files used for this task:
Changes on src/miaplpy/unwrap_ifgram.py
Analysis of src/miaplpy/unwrap_ifgram.py
Overview
The file unwrap_ifgram.py is responsible for unwrapping interferograms using the SNAPHU software. The script includes several functions and classes that manage the configuration and execution of the SNAPHU unwrapping process. The main class, Snaphu, handles the setup and execution of the unwrapping, including determining whether to split the process into tiles and managing the configuration files.
Potential Causes of the Bug
-
Memory Management:
- The error message "Killed" typically indicates that the process was terminated by the operating system, often due to excessive memory usage. The script does not appear to have explicit memory management or optimization strategies, which could lead to high memory consumption, especially when processing large datasets or multiple tiles.
-
Tile Management:
- The script includes logic to determine whether the unwrapping should be split into tiles (
unwrap_tilemethod). If the number of tiles is not optimally configured, it could lead to inefficient memory usage. The calculation ofy_tileandx_tilemight not be optimal for the given dataset size.
- The script includes logic to determine whether the unwrapping should be split into tiles (
-
Configuration File Handling:
- The configuration for SNAPHU is dynamically generated and written to a file. If the configuration parameters (e.g.,
NLOOKSRANGE,NLOOKSAZ,TILEDIR) are not set correctly, it could lead to inefficient processing and increased memory usage.
- The configuration for SNAPHU is dynamically generated and written to a file. If the configuration parameters (e.g.,
-
Error Handling:
- The script raises a
RuntimeErrorif SNAPHU returns an error. However, it does not provide detailed logging or handling for memory-specific errors, which could help diagnose the issue.
- The script raises a
Recommendations
-
Optimize Memory Usage:
- Implement memory profiling to identify bottlenecks and optimize memory usage. Consider using memory-efficient data structures or processing techniques.
-
Tile Configuration:
- Review and optimize the logic for determining the number of tiles (
get_nproc_tile). Ensure that the tile size and number are appropriate for the dataset size and available system memory.
- Review and optimize the logic for determining the number of tiles (
-
Configuration Parameters:
- Verify that the configuration parameters for SNAPHU are set optimally for the dataset and system. Consider allowing user input for critical parameters that affect memory usage.
-
Enhanced Logging and Error Handling:
- Implement detailed logging to capture memory usage statistics and SNAPHU output. Enhance error handling to provide more informative messages, particularly for memory-related issues.
-
Testing with Smaller Datasets:
- Test the unwrapping process with smaller datasets to ensure that the logic and configuration are correct before scaling up to larger datasets.
By addressing these areas, the script can be improved to handle large datasets more efficiently and reduce the likelihood of memory-related errors.
Changes on src/miaplpy/miaplpyApp.py
Analysis of miaplpyApp.py
The miaplpyApp.py script is a part of a larger workflow for processing InSAR data using MiaplPy. The script orchestrates various steps in the processing pipeline, including loading data, generating interferograms, unwrapping them, and performing time series analysis. The script is designed to be flexible, allowing for different configurations and processing options through a configuration file.
Key Observations
-
Workflow Steps: The script defines a series of steps (
STEP_LIST) that are executed in sequence. These steps include loading data, phase linking, generating interferograms, unwrapping interferograms, and more. -
Configuration Management: The script uses a configuration file to manage various parameters and settings. This includes paths to input data, processing options, and output directories.
-
Integration with SNAPHU: The unwrapping step (
run_unwrap) is particularly relevant to the issue at hand. This step involves calling an external script (unwrap_ifgram.py) to perform phase unwrapping using SNAPHU. The script constructs command-line arguments for SNAPHU based on the configuration settings. -
Resource Management: The script attempts to manage computational resources by determining the number of CPU cores available and adjusting the number of tasks accordingly. However, there is no explicit memory management or checks for available memory before executing memory-intensive tasks like unwrapping.
-
Error Handling: There is limited error handling in the script. If SNAPHU runs out of memory, it may simply terminate with a "Killed" message, which is consistent with the user's reported issue.
Potential Causes of the Bug
-
Memory Usage: The SNAPHU process may be consuming more memory than is available, leading to the process being killed by the operating system. This could be due to the size of the interferograms or the number of tiles being processed simultaneously.
-
Configuration Settings: The configuration settings for SNAPHU, such as the number of tiles (
--num_tiles) and the maximum discontinuity (--max_discontinuity), may not be optimal for the available system resources. -
Lack of Memory Checks: The script does not perform any checks on available system memory before starting the SNAPHU process. Implementing such checks could prevent the process from being killed unexpectedly.
Recommendations
-
Optimize SNAPHU Configuration: Review and adjust the SNAPHU configuration settings to better match the available system resources. This may involve reducing the number of tiles or adjusting other parameters to reduce memory usage.
-
Implement Memory Checks: Add checks to ensure that sufficient memory is available before starting the SNAPHU process. This could involve querying the system for available memory and adjusting the processing parameters accordingly.
-
Error Handling: Improve error handling to catch and report memory-related errors more gracefully. This could involve capturing the output of the SNAPHU process and checking for specific error messages.
-
Logging and Monitoring: Enhance logging to provide more detailed information about the execution of each step, including memory usage and any errors encountered. This would aid in diagnosing issues more effectively.
By addressing these areas, the likelihood of encountering out-of-memory errors during the SNAPHU process can be reduced, potentially resolving the issue of missing scenes in the output timeseries.
Changes on src/miaplpy/objects/slcStack.py
Analysis of src/miaplpy/objects/slcStack.py
The slcStack.py file is responsible for handling the loading and processing of Sentinel-1 SLC (Single Look Complex) data into a format suitable for time-series analysis. The main class, slcStackDict, manages a collection of SLCs, while the slcStack class handles the reading and writing of these datasets to and from HDF5 files.
Key Functions and Potential Issues
-
Data Loading and Processing:
- The
slcStackDictclass initializes with a dictionary of SLC pairs, which are then processed to extract metadata, size, and date information. - The
write2hdf5method writes the processed SLC data into an HDF5 file, ensuring that the data is correctly structured and metadata is included.
- The
-
Date and Scene Management:
- The
get_date_listmethod retrieves a sorted list of dates from the SLC pairs, which is crucial for ensuring all scenes are accounted for in the time-series. - The
write2hdf5method includes a section for creating a dataset of dates, which should match the number of SLCs processed.
- The
-
Potential Causes for Missing Scenes:
- Incomplete Data Handling: If the
pairsDictdoes not contain all expected SLC pairs, some scenes might be missing from the output. This could be due to an error in how the pairs are generated or filtered. - HDF5 Writing Issues: The
write2hdf5method might not correctly write all datasets if there are issues with the input data or if the method encounters an error during execution. - Metadata Misalignment: If the metadata does not correctly reflect the number of scenes or their dates, this could lead to discrepancies in the output.
- Incomplete Data Handling: If the
-
Memory Management:
- The file does not explicitly handle memory management, which could be a concern given the large size of SLC datasets. This might indirectly affect the processing if the system runs out of memory, leading to incomplete data processing.
Recommendations
- Verify Input Data: Ensure that the
pairsDictcontains all expected SLC pairs and that they are correctly processed. This might involve checking the input data source and any filtering logic applied before this point. - Error Handling: Implement additional error handling in the
write2hdf5method to catch and log any issues that occur during the writing process. - Memory Usage: Consider optimizing memory usage, especially if the system is prone to running out of memory during processing. This might involve processing data in smaller chunks or using more efficient data structures.
Overall, the file appears to be correctly structured for its intended purpose, but careful attention should be paid to the input data and the handling of potential errors during processing.
Thanks @mirzaees. I've updated the Issue title to reflect the actual issue.