titiler icon indicating copy to clipboard operation
titiler copied to clipboard

gdal_config option doesn't work for threaded readers

Open vincentsarago opened this issue 4 years ago • 4 comments

We already tried to solve this using apiroute_factory https://developmentseed.org/titiler/concepts/APIRoute_and_environment_variables/

from fastapi import FastAPI, APIRouter
from rasterio._env import get_gdal_config
from titiler.custom.routing import apiroute_factory
from titiler.endpoints.factory import TilerFactory

app = FastAPI()
route_class = apiroute_factory({"GDAL_DISABLE_READDIR_ON_OPEN": "FALSE"})
router = APIRouter(route_class=route_class)

tiler = TilerFactory(router=router)

While this approach worked for a while, it didn't seems to work in python 3.8 and felt quite of a hack.

We then switched and added gdal_config options in the factory (ref #170). I though this worked ... but I was wrong.

In FastAPI, when you define a function with a simple def myfunction(..., fastapi will use starlette's run_in_threadpool to run the function (so the API isn't blocked). This makes the function to be run in a thread ... which isn't the MainThread, and this is important. In Rasterio, when using with rasterio.Env( block, rasterio will check if we are running it in the MainThread or not (https://github.com/mapbox/rasterio/blob/master/rasterio/_env.pyx#L158-L161) and then use different function to set the GDAL config.

The problem here seems that because we use simple def we run the endpoint function in a sub Thread (ThreadPoolExecutor-X_X) and then the env cannot be forwarded to threads within the threads... (yes threads within threads seems bad anyway).

It's kinda hard for me to explain so here is an example

from concurrent import futures
import rasterio
from rasterio._env import get_gdal_config
import threading

from starlette.concurrency import run_in_threadpool

def f(r=None):
    return get_gdal_config("GDAL_DISABLE_READDIR_ON_OPEN"), str(threading.current_thread())


print()
print("1 - simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
    print(f())

print()
print("2 - async simple - OK")
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
    print(await run_in_threadpool(f))

def g():
    print("Where am I: " + str(threading.current_thread())) # print what thread is used when calling rasterio.Env
    with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="FALSE"):
        with futures.ThreadPoolExecutor() as executor:
            return list(executor.map(f, range(1)))[0]

print()
print("3 - simple multi threads - OK")
print(g())

print()
print("4 - async multithread - NOK") 
print(await run_in_threadpool(g))
1 - simple - OK
('FALSE', '<_MainThread(MainThread, started 4487112128)>')

2 - async simple - OK
('FALSE', '<Thread(ThreadPoolExecutor-0_13, started daemon 123145455685632)>')

3 - simple multi threads - OK
Where am I: <_MainThread(MainThread, started 4487112128)>
('FALSE', '<Thread(ThreadPoolExecutor-23_0, started daemon 123145492471808)>')

4 - async multithread - NOK
Where am I: <Thread(ThreadPoolExecutor-0_4, started daemon 123145408389120)>
('EMPTY_DIR', '<Thread(ThreadPoolExecutor-24_0, started daemon 123145492471808)>')

☝️ where we use run_in_threadpool we simulate the actual setting in titiler (def), fastAPI doesn't use run_in_threadpool when using async def.

Fix ?

  1. use async def for functions definition

If you are using a third party library that communicates with something (a database, an API, the file system, etc) and doesn't have support for using await, (this is currently the case for most database libraries), then declare your path operation functions as normally, with just def, like:

FastAPI docs seems to say we shouldn't (and I think @geospatial-jeff told the same in the past)

  1. see if we can change rasterio (🙅)

Not an expert here, but I guess there is a good reason to use CPLSetConfigOption only in MainThread (cc @sgillies, sorry for the ping but it's only a FYI)

ref: https://github.com/mapbox/rasterio/issues/1012 & https://github.com/mapbox/rasterio/pull/997

  1. change rio-tiler reader to forward gdal config 🤷‍♂️

Seems the safest choice 😭

vincentsarago avatar Jan 05 '21 22:01 vincentsarago

possible solution:

do not use rasterio.Env in titiler but use something like

from contextlib import contextmanager

@contextmanager
def Env(**env):
    """Temporarily set environment variables inside the context manager and
    fully restore previous environment afterwards

    from: https://gist.github.com/igniteflow/7267431#gistcomment-2553451
    """
    env = env or {}

    original_env = {key: os.getenv(key) for key in env}
    os.environ.update(env)
    try:
        yield
    finally:
        for key, value in original_env.items():
            if value is None:
                del os.environ[key]
            else:
                os.environ[key] = value

taken from https://gist.github.com/igniteflow/7267431#gistcomment-2553451#185

when entering with Env(self.gdal_config), this will update the environment variable and restore on exist. Is this safe? 🤷‍♂️

vincentsarago avatar Jan 05 '21 22:01 vincentsarago

Just to be clear the problem is because we are setting env withing threads (unintentionally).

def f():
    return get_gdal_config("ENV")

def g():
    with rasterio.Env(ENV="FALSE"):
        with futures.ThreadPoolExecutor() as executor:
            r = executor.submit(f)
            return r.result()

print(g())
>> "FALSE"

with futures.ThreadPoolExecutor() as executor:
    r = executor.submit(g)
    print(r.result())

>> None

vincentsarago avatar Jan 05 '21 22:01 vincentsarago

(yes threads within threads seems bad anyway)

I don't think you can do this. Really what's happening is there are two thread pool executors. The first is used by FastAPI to execute the def route handlers (this is the event loop's default executor). The second is instantiated by rio-tiler in the tasks module. It's not quite thread within a thread but I think you are still right that the problem is because the thread is not the main thread which is true in this case regardless.

I think another good solution is to stop using rasterio.Env in favor of standard environment variables (I'm pretty sure this works). I know this is less koshur but I think @vincentsarago you are correct, if the problem is because rasterio.Env doesn't span threads with how the code is executed by FastAPI, we are left with either a hacky and potentially unsafe solution, hamstringing the performance of the app by switching to async def, or removing rasterio.Env entirely (unless we can change rasterio). Injecting rasterio.Env into rio-tiler seems like the best of the worst options but its quite contrived.

geospatial-jeff avatar Jan 06 '21 00:01 geospatial-jeff

I think another good solution is to stop using rasterio.Env in favor of standard environment variables (I'm pretty sure this works).

Yes this will work but won't allow us to deploy application with different supports, e.g. :

  • NAIP: we need to set {"AWS_REQUEST_PAYER": "requester"},
  • Landsat 8 (aws) we need to set {"GDAL_DISABLE_READDIR_ON_OPEN": "FALSE"} so we can use external overviews.

Ideally, yes global env variable would be enought but won't give flexibility 🤷‍♂️

vincentsarago avatar Jan 06 '21 14:01 vincentsarago