User context being optional is problematic for Python extensions.
Using Halide generated code inside Python requires user_context. For JIT, this is always present. For AOT, this would require specifying the user_context option throughout the entire pipeline used inside of Python. This proves to be problematic.
One possible solution would be to have Halide generate code to get the user context from a thread local variable in the case where the user context option is not provided. This would require also propagating the user_context value to the thread local variable in worker threads when distributing parallel work. (For the case where externs call other Halide kernels.)
It turns out LLVM provides support for thread local variables without extra libraries on ELF based platforms. We could support other platforms with a small OS specific library.
This mechanism would possibly also allow better ease of use of Halide generated code as the choice of whether to have user_context or not is confusing. Communicating a thread local between the calling code and its runtime overrides may be more sensible.
Another possible design is to register one or more thread local variables with our runtime and propagate all of them to worker threads. This would allow multiple runtime overrides to use different thread locals and would be independent of user context. Runtime overrides using this mechanism would not depend on user context. This requires some care in design as it possibly involves passing a variable number of values from a parallel work invoker to the thread pool implementation.
This is an area of investigation and is written up here for refinement and tracking.
Another possibility that I have considered is to just quietly add additional entry points that always accept a ucon, with the suffix _u on the name. It would be simple to do and have minimal overhead, but it feels clunky and awkward.