gdal [Feature Request] Python multiprocessing with DEM

Hi all, actually this will fails, which would be great to be working!:

from pathos.pools import ProcessPool
import osgeo.gdal

gdal.UseExceptions()
ds = osgeo.gdal.Open("dem.tif")
pool = ProcessPool(nodes=4)
results = pool.map(lambda x:x, [ds])

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pipe/.local/lib/python3.11/site-packages/pathos/multiprocessing.py", line 135, in map
    return _pool.map(star(f), zip(*args)) # chunksize
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pipe/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pipe/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
  File "/home/pipe/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 540, in _handle_tasks
    put(task)
  File "/home/pipe/.local/lib/python3.11/site-packages/multiprocess/connection.py", line 214, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pipe/.local/lib/python3.11/site-packages/multiprocess/reduction.py", line 54, in dumps
    cls(buf, protocol, *args, **kwds).dump(obj)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 394, in dump
    StockPickler.dump(self, obj)
  File "/usr/lib/python3.11/pickle.py", line 487, in dump
    self.save(obj)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 902, in save_tuple
    save(element)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/usr/lib/python3.11/pickle.py", line 887, in save_tuple
    save(element)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python3.11/pickle.py", line 717, in save_reduce
    save(state)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
    ^^^^^^^^^^^^
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 1186, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/lib/python3.11/pickle.py", line 972, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.11/pickle.py", line 1003, in _batch_setitems
    save(v)
  File "/home/pipe/.local/lib/python3.11/site-packages/dill/_dill.py", line 388, in save
    StockPickler.save(self, obj, save_persistent_id)
  File "/usr/lib/python3.11/pickle.py", line 578, in save
    rv = reduce(self.proto)
         ^^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'SwigPyObject' object

At least... this issue can be split in two:

Do ds be pickeable
Depends in how the first point is done, be able to the elements, when are in other process, request on the same ds, so we don't have a lot of different ds reading possible the same data

I think, for this issue, would be better focus on the first point, the second one could depends on other issues too.

Thx

Jan 17 '24 14:01 latot

https://stackoverflow.com/a/3604099/23076275 mentions "Things that are usually not pickable are, for example, sockets, file(handler)s, database connections, and so on". A GDALDataset is a (complicated) file, hence I don't think it is appropriate to make it pickable. Users might pass the dataset name and gdal.Open() it in worker thread/process. A pickable GDALDataset would have to do that, but it is probably deceptive to do that behind the back of the user. Furthermore in situations where the dataset would have been created with gdal.GetDriverByName("MEM").Create("", xsize, ysize), that wouldn't even work.

Hence I don't believe this feature request can / should be implemented

Jan 17 '24 19:01 rouault

Seems pretty complex, actually I think at some extent or way would be able to support it, due to python does not have a static type this mypy which allows us to have static types check for the code, but pass file paths on params is not the ideal case for the workflows.

Maybe the point is not make a dataset pickeable, the question can change to which would be the right way to use datasets from multiprocessing?

Still, from python, open a dataset multiple times, means be possible to decode, read from disk, uncompress the same data.

Jan 17 '24 20:01 latot

which would be the right way to use datasets from multiprocessing?

don't do that. That's forbidden by the C++ side of things. A given GDALDataset must be used from a single thread/process. If you have several thread/process, you need one GDALDataset instance in each one.

Jan 17 '24 20:01 rouault

Even in C++ we can't do multiprocess?

Jan 17 '24 20:01 latot

You can do multiprocessing, but you need to be careful on how to do it. If you had a file with fixed length records and would want to split its processing among several processes, you wouldn't try to use the same FILE* instance in all those processes .You'd open one for each process and would operate on distinct areas of that file. A GDALDataset is mostly a complex FILE*. cf https://mastodon.social/@EvenRouault/111773113881667298 and https://gdal.org/user/multithreading.html

Jan 17 '24 20:01 rouault

well, that is not exactly a gdal multiprocessing, the idea of a dataset multiprocessing is be able to read/edit from the same object in a safe way from multiple process and threads, maybe gdal multiprocessing is not actually supported in C++, from there would be no possible to do it in python.

Maybe is related to https://github.com/OSGeo/gdal/issues/8448?

or what do you think to support parallel on c++? do you think could be worth talking about it?

Jan 17 '24 21:01 latot

Maybe is related to https://github.com/OSGeo/gdal/issues/8448?

Yes, this is totally related

what do you think to support parallel on c++? do you think could be worth talking about it?

I believe having the GDALDataset and related API to be thread-safe (that is a single instance can be safely used from concurrent threads) is out of reach in practice. It would have to rely on locks, and be sure to have locking in place in all places where it is required. That would require changing all GDAL drivers in many places, and the C++ language doesn't offer any compile time guarantee that you do that correctly. And it would cause users to pay for the price of locks even in scenarios where they don't need them. And it would require changes in the core of the GDAL block cache so that RasterIO() can be safely called from multiple threads, without a global lock. All in all, the costs and risks to do all of that are very high compared to the potential benefits.

Jan 17 '24 21:01 rouault

mm, is more like... a Rust work than a C++ one...

I think for python, still would be good a workaroud for this, python is a dinamic language, with all the problems it means, and python now support static types, for that we need the MyPy library, rn seems is being used a lot, with good reasons.

With the actual methods, we are unable to use DataSets as a param type for functions and multiprocessing options at the same time, I would ever think to be a good idea pickle just creating new datasets, even if are not a true multiprocessing, because in C++ there is no multiprocessing too, the only point would be add to the docs this behavior.

Maybe other workaround to be able to use static types on python would be good :)

Jan 19 '24 18:01 latot