gdalraster
gdalraster copied to clipboard
Add bindings to the GDAL VSIVirtualHandle API
This function reads a file as bytes into R.
// [[Rcpp::export]]
Rcpp::RawVector vsi_read_file(Rcpp::CharacterVector src_file,
bool show_progress = false) {
GDALProgressFunc pfnProgress = nullptr;
std::string src_file_in;
src_file_in = Rcpp::as<std::string>(_check_gdal_filename(src_file));
if (show_progress)
pfnProgress = GDALTermProgressR;
VSILFILE *fp = VSIFOpenL(src_file_in.c_str(), "rb");
//ASSERT_TRUE(fp != nullptr);
VSIFSeekL(fp, 0, SEEK_END);
size_t nSize = (size_t)VSIFTellL(fp);
VSIFSeekL(fp, 0, SEEK_SET);
//void *pRefBuf = CPLMalloc(nSize);
Rcpp::RawVector out(nSize);
void* buffer = CPLMalloc(nSize);
VSIFReadL(buffer, 1, nSize, fp);
memcpy(&(out[0]), buffer, nSize);
VSIFCloseL(fp);
return out;
}
I've experimented with it by putting this into gdal_vsi.cpp
locally. The motivation is to be able to do this, a specific file format from another source but optionally without creating a file (we create a MEM GeoTIFF with GDAL's clean virtual handling):
lcp_file <- system.file("extdata/storm_lake.lcp", package="gdalraster")
mem_file <- "/vsimem/file.tif"
gdalraster::createCopy("GTiff", mem_file, lcp_file)
bytes <- gdalraster::vsi_read_file(mem_file)
## now send those bytes somewhere (like to a COG on S3), here just a tempfile
tf <- tempfile(fileext = ".tif")
writeBin(bytes, tf )
If it's of interest, it's fairly harmless as a standalone function and would be useful to me (needs some more checks and error handling etc).
Just for more context, the motivation was to do file-less ops in lambda on aws, everything in memory from URL to S3, but I'm illustrating with local examples.
I was more ambitious when I set out to explore this, I wanted to be able to do this kind of workflow as in Python, but that's probably unlikely to be easy or useful with an actual virtual file connection (at least I will need a lot more work to learn how to work that way):
dsn = "/vsicurl/https://noaadata.apps.nsidc.org/NOAA/G02135/south/daily/geotiff/2024/05_May/S_20240508_concentration_v3.0.tif"
from osgeo import gdal
gdal.UseExceptions()
ds = gdal.Open(dsn)
temp_name = '/vsimem/current.tif'
tiff_driver = gdal.GetDriverByName('GTiff')
tiff_driver.CreateCopy(temp_name, ds)
# Read the raw data from the virtual filesystem
f = gdal.VSIFOpenL(temp_name, 'rb')
gdal.VSIFSeekL(f, 0, 2) # seek to end
size = gdal.VSIFTellL(f)
gdal.VSIFSeekL(f, 0, 0) # seek to beginning
data = gdal.VSIFReadL(1, size, f)
gdal.VSIFCloseL(f)
# Upload the raw data to s3 (or something)
#s3.put_object(Key=key, Bucket=bucket_name, Body=data, ContentLength=size)
gdal.Unlink(temp_name)
I started on a VSIFile
class a while back, but haven't had a chance to finish. Definition below. I believe this would do what you're talking about. This could be implemented sometime soon, but doesn't necessarily preclude having vsi_read_file()
either.
I believe vsi_read_file()
would be equivalent to VSIFile::ingest()
in the class interface. This would be a wrapper of VSIIngestFile()
, see:
https://gdal.org/api/cpl.html#_CPPv413VSIIngestFileP8VSILFILEPKcPP5GByteP12vsi_l_offset7GIntBig
/* class VSIFile
Bindings to the GDAL VSIVirtualHandle API. Encapsulates a VSIVirtualHandle.
Chris Toney <chris.toney at usda.gov> */
#ifndef SRC_VSIFILE_H_
#define SRC_VSIFILE_H_
#include <cstdint>
#include "Rcpp.h"
#include "RcppInt64"
#include "gdal.h"
//#include "cpl_string.h"
#include "cpl_vsi.h"
// define R_VSI_L_OFFSET_MAX = lim.integer64()[2]
class VSIFile {
private:
std::string filename_in;
std::string access_in; // access: "r", "r+", "w", "a"
VSIVirtualHandle VSILFILE;
public:
VSIFile(Rcpp::CharacterVector filename);
VSIFile(Rcpp::CharacterVector filename, std::string access);
VSIFile(Rcpp::CharacterVector filename, std::string access,
Rcpp::Nullable<Rcpp::CharacterVector> options);
SEXP open();
int close();
SEXP stat(std::string info);
int seek(int64_t offset, int origin);
int64_t tell() const;
void rewind();
Rcpp::RawVector read(std::size_t size, std::size_t count);
std::size_t write(Rcpp::RawVector& buf);
bool eof() const;
int truncate(GUIntBig offset);
int flush();
int printf(std::string fmt);
int putc(int c);
Rcpp::RawVector ingest(int64_t max_size);
};
RCPP_EXPOSED_CLASS(VSIFile)
#endif // SRC_VSIFILE_H_
Oh awesome that's excellent 👌
If you think it is worthwhile to add VSIFile
, could you take a look at the class definition above and let me know if that looks okay? This would be another exposed class, so GDALRaster
and the others are the template. A few small things would change in the definition of VSIFile
, mainly the actual data types that will be used for R <-> C++ (e.g., int64_t
, GUIntBig
would change). But what is shown above is basically the net effect (this would use {bit64} on the R side for integer64 type).
One hesitation I had, is that the R documentation for seek()
says:
Warning Use of seek on Windows is discouraged. We have found so many errors in the Windows implementation of file positioning that users are advised to use it only at their own risk, and asked not to waste the R developers' time with bug reports on Windows' deficiencies.
I was concerned about what headaches may be involved with trying to expose a VSI file interface for reasons like that. But that would be on the GDAL side anyway, and maybe it handles it fine. I haven't tested anything on Windows yet.
I remember that from years ago, no seek on windows.
I absolutely don't mind if it doesn't work there but I'll try it out for our use case. I expect seeking for a GDAL dataset size is virtualized in a way not affected by the file system but idk.
But, isn't that why we'd use the VSI open and seek etc? seek() itself won't work on vsimem I expect.
I'll pursue and PR 👌
Yes, we'd use VSI open/seek etc. I just meant that, if it's true there are "so many errors in the Windows implementation of file positioning", such that R's seek()
comes with that warning, then we might still see issues with a VSI file interface used on that OS. But maybe not, just something I noticed and wondered about.
Ah, I think I understand, your public seek() etc would use the VSI versions in their definitions this is only the header of what's exposed to R (excuse me thinking aloud I'm still pretty dim on classes in C++). 🙏 I'll check the windows seek, in my experience it's worked fine for simple cases (jump forward) but it's been ages since I did any of that.
Right, the class would encapsulate a VSIVirtualHandle
. So in R it would look something like:
vf <- new(VSIFile, "/vsimem/file.tif")
bytes <- vf$ingest(-1) # param is max_size, if no limit set to negative value
# or
# vf$seek(...)
# bytes <- vf$read(...)
# etc
vf$close()
Of course, this would work on regular file systems too. It's probably fine on Windows with GDAL VSI file. The R warning in the documentation just seemed rather strong.
Class VSIFile
is now in main, documented at:
https://usdaforestservice.github.io/gdalraster/reference/VSIFile-class.html
Testing appreciated, thanks.
Works in the examples I've tried, perfectly fine for in memory COG and VRT, for ingest() and tell(). I'll try to get my tests cases on Windows as well as Linux in coming days.