MintPy
MintPy copied to clipboard
S3 Bucket as data_dir
Enable the usage of S3 buckets as data directories for hyp3 input files.
Is your feature request related to a problem? Please describe
Describe the solution you'd like Usage of S3 bucket URL’s as data input directories for hyp3 files.
Describe alternatives you have considered
Additional context Depending on temporal baseline the total storage size can be near 100 GB. When working on a cloud environment, storing files in S3 is cheaper than keeping files on EBS / EFS storage.
Are you willing to help implement and maintain this feature?
- [ ] Yes
- [X] No
👋 Thanks for opening your first issue here! Please filled out the template with as much details as possible. We appreciate that you took the time to contribute! Make sure you read our contributing guidelines.
Potential solution
To enable the usage of S3 buckets as data directories for HyP3 input files, we need to modify the existing codebase to recognize and handle S3 URLs. This involves integrating AWS S3 interaction capabilities using the boto3 library, which allows us to download or stream files from S3. The solution involves updating file handling logic across multiple files to support S3 URLs, ensuring that data can be accessed seamlessly whether it's stored locally or in an S3 bucket.
How to implement
-
Install Boto3: Ensure that the
boto3library is installed in your environment. This library is essential for interacting with AWS S3. -
Modify
src/mintpy/load_data.py:- Add functions to identify S3 URLs and download files from S3.
- Update the data loading logic to handle S3 URLs by downloading files to a temporary directory.
import boto3 import os import tempfile def is_s3_url(path): return path.startswith('s3://') def download_from_s3(s3_url, local_dir): s3 = boto3.client('s3') bucket_name, key = s3_url.replace("s3://", "").split("/", 1) local_path = os.path.join(local_dir, os.path.basename(key)) s3.download_file(bucket_name, key, local_path) return local_path def load_data_from_path(path): if is_s3_url(path): with tempfile.TemporaryDirectory() as temp_dir: local_path = download_from_s3(path, temp_dir) # Proceed with loading data from local_path else: # Proceed with loading data from the local filesystem -
Modify
src/mintpy/prep_hyp3.py:- Import
boto3and add functions to handle S3 URLs. - Update the
add_hyp3_metadatafunction to download files from S3 if necessary.
import os import boto3 from urllib.parse import urlparse s3_client = boto3.client('s3') def is_s3_url(url): return url.startswith('s3://') def download_from_s3(s3_url, local_path): parsed_url = urlparse(s3_url) bucket = parsed_url.netloc key = parsed_url.path.lstrip('/') s3_client.download_file(bucket, key, local_path) def add_hyp3_metadata(fname, meta, is_ifg=True): if is_s3_url(fname): local_fname = '/tmp/' + os.path.basename(fname) download_from_s3(fname, local_fname) fname = local_fname job_id = '_'.join(os.path.basename(fname).split('_')[:8]) # ... rest of the function remains unchanged def prep_hyp3(inps): inps.file = ut.get_file_list(inps.file, abspath=True) for fname in inps.file: is_ifg = any([x in fname for x in ['unw_phase','corr']]) meta = readfile.read_gdal_vrt(fname) meta = add_hyp3_metadata(fname, meta, is_ifg=is_ifg) rsc_file = fname+'.rsc' writefile.write_roipac_rsc(meta, out_file=rsc_file) return - Import
-
Modify
src/mintpy/defaults/auto_path.py:- Update the
read_str2dictfunction to recognize S3 URLs and handle them appropriately.
def read_str2dict(inString, delimiter='=', print_msg=False): strDict = {} lines = inString.split('\n') for line in lines: c = [i.strip() for i in line.strip().split(delimiter, 1)] if len(c) < 2 or line.startswith(('%', '#')): continue else: key = c[0] value = str.replace(c[1], '\n', '').split("#")[0].strip() if value != '': if value.startswith('s3://'): strDict[key] = value else: strDict[key] = value for key, value in strDict.items(): if value.lower() == 'none': strDict[key] = None return strDict - Update the
-
Testing: Ensure that the changes are tested with various configurations to verify that S3 URLs are correctly recognized and processed. This includes testing with both local and S3 paths to ensure seamless integration.
By following these steps, the codebase will be extended to support S3 bucket URLs as input paths, allowing for more flexible data storage options in cloud environments.
Click here to create a Pull Request with the proposed solution
Files used for this task:
Changes on src/mintpy/load_data.py
To enable the usage of S3 bucket URLs as data directories in the src/mintpy/load_data.py file, you can follow these steps:
-
Install Boto3: Ensure that the
boto3library is installed in your environment. This library is essential for interacting with AWS S3. -
Modify Data Loading Logic: Update the functions responsible for loading data to handle S3 URLs. This involves checking if the input path is an S3 URL and using
boto3to download or stream the data. -
Identify S3 URLs: You can identify S3 URLs by checking if the path starts with
s3://. -
Download or Stream Data: Use
boto3to access the S3 bucket. You can either download the files to a temporary directory or stream them directly if your application supports it. -
Update File Handling: Ensure that all file handling logic can work with files downloaded from S3. This might involve using temporary files or in-memory file-like objects.
Here's a basic outline of how you might implement these changes:
import boto3
import os
import tempfile
def is_s3_url(path):
return path.startswith('s3://')
def download_from_s3(s3_url, local_dir):
s3 = boto3.client('s3')
bucket_name, key = s3_url.replace("s3://", "").split("/", 1)
local_path = os.path.join(local_dir, os.path.basename(key))
s3.download_file(bucket_name, key, local_path)
return local_path
def load_data_from_path(path):
if is_s3_url(path):
with tempfile.TemporaryDirectory() as temp_dir:
local_path = download_from_s3(path, temp_dir)
# Proceed with loading data from local_path
else:
# Proceed with loading data from the local filesystem
- Integrate with Existing Logic: Integrate the above logic into the existing data loading functions, ensuring that all parts of the code that handle file paths are updated to support S3 URLs.
By following these steps, you can extend the functionality of the load_data.py file to support S3 bucket URLs as input paths for data loading.
Changes on src/mintpy/prep_hyp3.py
To modify the src/mintpy/prep_hyp3.py file to support S3 bucket URLs as input paths for HyP3 metadata preparation, you can follow these steps:
-
Import Necessary Libraries: You will need to use
boto3to interact with S3. Ensure thatboto3is imported at the beginning of the file. -
Modify File Handling: Update the file handling logic to check if the input path is an S3 URL. If it is, use
boto3to download the file locally or read it directly from S3. -
Update
add_hyp3_metadataFunction: Modify the function to handle S3 paths. This may involve checking if thefnameis an S3 URL and then usingboto3to access the metadata file.
Here is a concrete proposal for the changes:
import datetime as dt
import os
import boto3
from urllib.parse import urlparse
from mintpy.constants import SPEED_OF_LIGHT
from mintpy.objects import sensor
from mintpy.utils import readfile, utils1 as ut, writefile
# Initialize S3 client
s3_client = boto3.client('s3')
def is_s3_url(url):
return url.startswith('s3://')
def download_from_s3(s3_url, local_path):
parsed_url = urlparse(s3_url)
bucket = parsed_url.netloc
key = parsed_url.path.lstrip('/')
s3_client.download_file(bucket, key, local_path)
def add_hyp3_metadata(fname, meta, is_ifg=True):
# Check if the file is on S3
if is_s3_url(fname):
local_fname = '/tmp/' + os.path.basename(fname)
download_from_s3(fname, local_fname)
fname = local_fname
# Existing logic follows...
job_id = '_'.join(os.path.basename(fname).split('_')[:8])
# ... rest of the function remains unchanged
def prep_hyp3(inps):
inps.file = ut.get_file_list(inps.file, abspath=True)
for fname in inps.file:
is_ifg = any([x in fname for x in ['unw_phase','corr']])
meta = readfile.read_gdal_vrt(fname)
meta = add_hyp3_metadata(fname, meta, is_ifg=is_ifg)
rsc_file = fname+'.rsc'
writefile.write_roipac_rsc(meta, out_file=rsc_file)
return
Key Points:
- S3 URL Handling: The
is_s3_urlfunction checks if a path is an S3 URL. If it is, thedownload_from_s3function downloads the file to a temporary local path. - Temporary Local Storage: Files from S3 are downloaded to a temporary directory (
/tmp/) for processing. - Boto3 Client: A
boto3client is used to interact with S3, which requires AWS credentials to be configured in the environment where this script runs.
This approach allows the existing logic to remain mostly unchanged while adding support for S3 URLs.
Changes on src/mintpy/defaults/auto_path.py
To extend the path configuration utilities in src/mintpy/defaults/auto_path.py to recognize S3 URLs, you can follow these steps:
-
Identify S3 URL Patterns: Determine how S3 URLs will be specified in the configuration. Typically, they follow the pattern
s3://bucket-name/path/to/object. -
Modify
read_str2dictFunction: Update theread_str2dictfunction to recognize and handle S3 URLs. You might want to add logic to check if a value starts withs3://and handle it accordingly. -
Update
get_auto_pathFunction: Modify theget_auto_pathfunction to process S3 URLs. This might involve checking if any paths in theauto_path_dictare S3 URLs and ensuring they are handled correctly. -
Add S3 Handling Logic: If necessary, add logic to handle S3 paths, such as downloading files from S3 to a local directory or directly accessing them using a library like
boto3. -
Testing: Ensure that the changes are tested with various configurations to verify that S3 URLs are correctly recognized and processed.
Here's a conceptual example of how you might start modifying the read_str2dict function:
def read_str2dict(inString, delimiter='=', print_msg=False):
'''Read multiple lines of string into dict
Based on mintpy.utils.readfile.read_template()
'''
strDict = {}
lines = inString.split('\n')
for line in lines:
c = [i.strip() for i in line.strip().split(delimiter, 1)]
if len(c) < 2 or line.startswith(('%', '#')):
continue
else:
key = c[0]
value = str.replace(c[1], '\n', '').split("#")[0].strip()
if value != '':
# Check if the value is an S3 URL
if value.startswith('s3://'):
# Handle S3 URL (e.g., store it differently or mark it for special processing)
strDict[key] = value
else:
strDict[key] = value
# set 'None' to None
for key, value in strDict.items():
if value.lower() == 'none':
strDict[key] = None
return strDict
This is a starting point, and you'll need to integrate this with the rest of the codebase, ensuring that any S3-specific logic is appropriately handled in other parts of the application.
When you say you want to use "S3 buckets as data directories for hyp3 input files", do you mean
- avoid downloading the Hyp3 geotiffs before creating the local mintpy HDF5 files
- use s3 for all storage instead of the local mintpy HDF5 files
(or something else?)
The former is a smaller change which saves a little space; the latter would require a very large rewrite for most of Mintpy to read from and write to S3 buckets directly.
When you say you want to use "S3 buckets as data directories for hyp3 input files", do you mean
- avoid downloading the Hyp3 geotiffs before creating the local mintpy HDF5 files
- use s3 for all storage instead of the local mintpy HDF5 files
(or something else?)
The former is a smaller change which saves a little space; the latter would require a very large rewrite for most of Mintpy to read from and write to S3 buckets directly.
The idea would be having the latter since it would reduce a lot of costs involved in storage. I took a look at the code and indeed there isn't an easy way of implementing an S3Path to MintPy right away.
Do you mean expensive for long term storage? Or for any use at all?
Based on https://aws.amazon.com/ebs/pricing/ Having a 100 GB block for 24 hours of processing would be about $0.30 Since most mintpy processing takes less time than that, it doesn't seem to expensive to run mintpy normally after provisioning a large disk space, then saving the HDF5 files to S3 afterwards.
Are you seeing different prices? Or were you picturing another more expensive use case?
Our intended use case is to deploy over multiple AOI's globally. Right now, using a temporal baseline of 37 days, 10x2 looks and acquiring data from 2017 to 2025 we are getting roughly 700 pairs per AOI (burst), resulting in approximately 80 GB of input data + MintPy results / Burst.
The biggest problem around storage costs is that we are leveraging AWS Batch / AWS Processing Jobs to execute AOI's in parallel to scale the workflow. When executing in parallel, i need to spin up multiple machines / workers with dedicated Storage space.
I had a much larger number in mind than $0.30 cents a day, but i'll run some tests with this workflow and report back in terms of costs here.
Thanks a lot for the great discussion @scottstanie
@mthsdiniz-usp for HyP3, we use AWS Batch as well. We leverage the SSDs included on board some EC2s for local storage when processing, then just upload the results to S3 in the end (we've had good luck with r6id instances), so we're not really paying for storage outside of S3. And if you're storing the products for a while, using intelligent teirring usual results in a cost savings.
Feel free to ping me if you want to chat about how we've got things set up and share experiences.