grass.script: More carefully setup env in create_project
This changes the condition used to create the runtime environment in create_project so that the environment is not set up not only when GISBASE is not set, but also when GISBASE is not in PATH which (at least at this point) indicates that no tools can run, so a call to g.proj later (and technically the potential g.message call before that) will fail.
This helps when GISBASE is set, even correctly, for some reason, but PATH is not. This may happen when someone is trying to GISBASE in some special way before GRASS actually runs. I have suspicion this is happening on Windows, but hard to tell without reading the setup scripts in details (when running on Windows you could print the environment and see). However, judging from @echoix experiments in #5717, this may be unrelated to the issues with create_project. Anyway, before this PR, the following results in a traceback, but with this PR it works.
GISBASE="/.../dist.x86_64-pc-linux-gnu/" python -c "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project', epsg=3358); d.cleanup();"
Traceback (most recent call last):
File "<string>", line 1, in <module>
...
raise OSError(_("Cannot find the executable {0}").format(args[0]))
OSError: Cannot find the executable g.proj
The env setup is done only when needed, so simple XY does not trigger creation of the environment, and so it is more lightweight in terms of things happening and simply faster.
| code | time | command |
|---|---|---|
| original create_project | 558 usec per loop | python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();" |
| new create_project | 365 usec per loop | python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.create_project(d.name, 'project'); d.cleanup();" |
| low-level calls | 309 usec per loop | python -m timeit -n 1000 -- "import grass.script as gs; import tempfile; d = tempfile.TemporaryDirectory(); gs.core._create_location_xy(d.name, 'project'); gs.core._set_location_description(d.name, 'project', None); d.cleanup();" |
With faster create_project, I replaced the old gs.core._create_location_xy calls in tests by simple gs.create_project calls.
The code to ensure runtime environment is moved to grass.script.setup, broken down and made reusable. The difference for usage in create_project is that os.environ is now always copied even if it already has GISBASE (we saved that copy operation before if GISBASE was already present). The handling of env, however, is consistent with other functions in grass.script.setup.
Also, do you want me to redo some tests/exploration on test+pytest, especially on windows, and also randomizing the order to see if this helps the global situation discussed in #4480?
(Ie, if I can get further than just what was ready in #5584)
About redoing the tests, yes, perhaps better done after merging this, but I would definitely appreciate some help with debugging create_project on Windows. For starters, I don't know if we have a test which would actually test the same thing you a @landam did manually when you both got to g.proj not found.
I need more time to respond to the rest.
What's the real difference between
gs.create_project()andgs.setup.init()...From the adjustments made in the tests, it seems one would always need to use both (aka boilerplate)...why would we need another layer above to patch the shortcomings of gs.setup.init()
You need to have a project (data) and an active session using a project (or more precisely: env variables to make the runtime work and connection to a project). gs.setup.init() could create a project if we add additional parameters or it could work on some delayed start basis and having a user set up the project later.
# One-step init + create
gs.setup.init("~/myproject", crs="some PROJ-accepted string") # crs implies create
# Delayed CRS setup
session = gs.setup.init("~/myproject") # creates an empty XY project if dir does not exist
session.set_crs("some PROJ-accepted string")
The topic of performance was brought in into the PRs description. Even though currently, I think (as I'm not able to properly measure and profile due to how all grass+python works currently, or at least do it easily) the biggest slowdowns/bottlenecks are the uncountable process calls (launching a binary)...
I measured these times way back when in experiment with small Python processes. While I don't have the code or the results anymore, I'm sure it is still an issue. Tools written in Python, even when executed directly in command line, invoke g.parser to deal with their command line parameters. In #2923, I'm even adding another subprocess call to deal with with command line parameters before the tool is executed. This is not ideal, but usually more stable and straightforward than going through ctypes. Ultimately, having direct library function access to the tools (in C or whatever) would be the best, but it is worth seeing what the best options are now for the subprocesses. From API perspective, that's what #2923 is doing. From performance perspective, we are mostly limited by the subprocess call cost which is out of our hands, but from some experiment at some point, it looked like Linux is much faster than Windows for subprocess execution, so perhaps there is still room for improvement. Let's have this discussion some time later.
...What is the "cost" of the imports we do, and ending up to "import the world". It's a bit easier to measure, as there's a python cli flag to trace the import times...
We did some work to avoid ctypes for stability reasons, but the Python library was originally minimal and started in the time when from x import * was the favorite style, so a lot of that legacy is still there. I'm experimenting with lazy imports through module-level __getattr__ in #2923, so that's perhaps a way how to reduce some of the times. It may not make sense for grass.script imports from core.py, but it may make sense for other things.
I already know from older exploration that when an import of one of our modules is made to one of our ctypes wrappers, that the import times significantly increase...
The complexity of ctypes is one of the reasons why subprocesses, while far from perfect, still seem to be somewhat okay. :-)
Looks promising and eager to try it out