eis_toolkit icon indicating copy to clipboard operation
eis_toolkit copied to clipboard

384 optimize distance computation

Open okolekar opened this issue 1 year ago • 13 comments

Hello, I have added the function "distance_to_anomaly_gdal_ComputeProximity" to reduce the runtime for distance computation. The new function "distance_to_anomaly_gdal_ComputeProximity" is a C++ based API function and hence is faster as compared to "distance_to_anomaly" (which was using the point based scheme) and "distance_to_anomaly_gdal" (which was using the python API for distance computation) function. All the three produce the exact same result which is verified in the python notebook distance_to_anomaly.ipynb.

Below I have attached some graphs inorder showcase the reduction in time achieved by the new function.

POINT TO NOTE: I have kept the precommits off due to errors.

Thankyou.

image

okolekar avatar Sep 05 '24 07:09 okolekar

I will do a more in-depth review later but first please add a test to check the implemented functions! See e.g. tests/raster_processing/test_distance_to_anomaly.py for inspiration on how to implement tests or other files in the tests/ directory. The tests are run by pytest. The purpose of the test(s) is to mainly check that the function works with some pre-defined input and the result is as expected (type is correct, value is sensible, etc.). I see you already added examples of use in the notebook but those are not checked automatically on Github actions so it does not replace actual tests.

nialov avatar Sep 05 '24 13:09 nialov

@nialov , do you have time to review this sometime this month? It'd be better if you can, but I can too if needed.

nmaarnio avatar Sep 20 '24 09:09 nmaarnio

I will try to review next week @nmaarnio though I can not promise it. Might get delayed to 7.10. due to work trip.

Based on a preliminary look, as this uses the GDAL API, I can foresee problems with Windows which will delay merging.

@okolekar The tests (All checks have failed) are not running successfully, see e.g.:

https://github.com/GispoCoding/eis_toolkit/actions/runs/10809648448/job/30251164407?pr=423

You can check the errors and try to see if you can get them fixed. I can help with this during review but as said before that might take time.

nialov avatar Sep 20 '24 11:09 nialov

Hello to all, I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer. What I can do is :- I can add the function @pytest.mark.xfail( sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError) Which essentially skips the testing.

Regards! Omkar Kolekar

okolekar avatar Sep 20 '24 12:09 okolekar

Hello to all, I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer. What I can do is :- I can add the function @pytest.mark.xfail( sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError) Which essentially skips the testing.

Regards! Omkar Kolekar

Do you use a Windows computer and have you installed eis_toolkit with conda, pip or poetry (or some other method)? If the problem is that Windows gdal library does not have the same bindings as linux, then the optimization is not available for Windows users. The target audience probably mostly uses Windows.

nialov avatar Sep 23 '24 05:09 nialov

Hello to all, I checked the error message @nialov. It appears to me that the test fails due to the ---> ModuleNotFoundError: No module named '_gdal_array' which is a problem with GDAL. This is due to the GDAL library. The test passes at my end on my computer. What I can do is :- I can add the function @pytest.mark.xfail( sys.platform == "win32", reason="GDAL utilities are not available on Windows.", raises=ModuleNotFoundError) Which essentially skips the testing. Regards! Omkar Kolekar

Do you use a Windows computer and have you installed eis_toolkit with conda, pip or poetry (or some other method)? If the problem is that Windows gdal library does not have the same bindings as linux, then the optimization is not available for Windows users. The target audience probably mostly uses Windows.

I do use a Windows machine with a conda environment but for development purposes, I have not installed eis_toolkit in this environment. If this still is a problem, I have an idea: we can create a C++ library/function to calculate the distance to anomaly and call this function in python program. The only issue would be to understand how gdal_ComputeProximity works. Warm Regards, Omkar

okolekar avatar Sep 24 '24 07:09 okolekar

I think the primary question here is whythe eis_toolkit environment does not have _gdal_array importable in the Linux (and probably Windows environemnt). What version of gdal do you use? To check version:

➜ gdalinfo --version
GDAL 3.9.2, released 2024/08/13

You should try to get your code to work in the environment defined in environment.yaml or in the eis_toolkit environment installed with conda. See instructions/dev_setup_without_docker_with_conda.md and if it does not work in the environment, try to pinpoint why exactly, i.e., what is the difference between your working environment and environments of eis_toolkit.

nialov avatar Sep 24 '24 10:09 nialov

Also I misspoke earlier about Windows difficulties, I was correlating tests failing with issues on Windows platform but the checks here on GitHub run on linux (Ubuntu) only currently. So I suppose the issue is the Linux environment which might be difficult for you to solve if you on Windows. I can take a look but it will sadly get delayed to 7.10. or onwards

nialov avatar Sep 24 '24 11:09 nialov

Also I misspoke earlier about Windows difficulties, I was correlating tests failing with issues on Windows platform but the checks here on GitHub run on linux (Ubuntu) only currently. So I suppose the issue is the Linux environment which might be difficult for you to solve if you on Windows. I can take a look but it will sadly get delayed to 7.10. or onwards

I checked the GDAL version I use gdalinfo --version GDAL 3.6.2, released 2023/01/02 I do work on Windows OS. I need to check the possibility to work on Linux (Ubuntu).

okolekar avatar Sep 24 '24 11:09 okolekar

This branch is based on a somewhat old version of eis_toolkit. Can you do a merge and push it. Run the following commands in the repository with your edits but make sure everything is committed first (no unstaged changes):

# Check that upstream points to GispoCoding/eis_toolkit:
git remote -v
# If it does not or it does not exist you can report back here. Otherwise, continue:
git fetch upstream
git merge upstream/master
# origin should point to okolekar/eis_toolkit
git push origin

You should not run into any conflicts but if the merge is not completed successfully and you get a dirty git status, you can do git merge --abort to abort and report back here.

nialov avatar Sep 24 '24 12:09 nialov

The issue seems to be that gdal is not installed with numpy bindings when installed with poetry, disabling some raster behavior. I believe I have ran into this before. Your tests run successfully on both Linux and Windows when using conda to install the environment.

nialov avatar Sep 24 '24 13:09 nialov

This branch is based on a somewhat old version of eis_toolkit. Can you do a merge and push it. Run the following commands in the repository with your edits but make sure everything is committed first (no unstaged changes):

# Check that upstream points to GispoCoding/eis_toolkit:
git remote -v
# If it does not or it does not exist you can report back here. Otherwise, continue:
git fetch upstream
git merge upstream/master
# origin should point to okolekar/eis_toolkit
git push origin

You should not run into any conflicts but if the merge is not completed successfully and you get a dirty git status, you can do git merge --abort to abort and report back here.

My merge is completed without any conflicts.

okolekar avatar Sep 24 '24 13:09 okolekar

I will continue this after 7.10. There might be ways to circumvent the poetry environment issue but I am going to guess that if the code is used with the current imports it cannot be run in the poetry environment but works on conda across platforms (and with pip).

nialov avatar Sep 27 '24 12:09 nialov

The code in terms of functionality seems to be working well in tests and in the notebook. My comments mostly relate to style and trying to minimize code duplication, especially in tests. If you have the interest, you could try to refactor some of the code which is identical in functions _distance_to_anomaly_gdal_ComputeProximity and _distance_to_anomaly so that same code is used only once.

To avoid repetition of code, I have created a function named '_validate_threshold_criteria' that contains the common code from both the functions ('_distance_to_anomaly_gdal_ComputeProximity' and '_distance_to_anomaly') and both the functions call '_validate_threshold_criteria' internally.

okolekar avatar Oct 16 '24 08:10 okolekar

I suppose this is ready for review again? I will take a look in a week (or two).

nialov avatar Oct 21 '24 11:10 nialov

I suppose this is ready for review again? I will take a look in a week (or two).

Yes, it is ready for review again. Thank you for the support. I have also taken care of precommits this time, as I have solved the issues with the precommits.

okolekar avatar Oct 21 '24 11:10 okolekar

This PR needs to be merged this week if we want to include it in the released planned for Friday this week. If Nikolas is busy, I might jump in and do a review tomorrow so that testers will be able to run the new version before progress meet.

nmaarnio avatar Oct 21 '24 11:10 nmaarnio

This PR needs to be merged this week if we want to include it in the released planned for Friday this week. If Nikolas is busy, I might jump in and do a review tomorrow so that testers will be able to run the new version before progress meet.

Sound okay. If there are any changes needed, I am available to modify them as soon as possible to ensure timely delivery of the code. Also, a point to note here is that the changes were mainly cosmetic.

okolekar avatar Oct 21 '24 11:10 okolekar

Hi @okolekar . I only now realized you have been optimizing distance_to_anomaly here, notdistance_computation. The issue and PR name refer to distance_computation which is the vector processing tool that the original implementation of distance_computation was using under the hood. Did you use any time to think about optimizing distance_computation, or only distance_to_anomaly so far?

Either way, it's of course good that you have been optimizing this tool since it's used a lot. However, we direly need also distance_computation optimized since it's a commonly used tool in MPM. Optimization of distance_computation would also optimize vector_density.

When it comes to actual review of this PR, you should mark the failing test functions to fail if platform is not Windows, so like this: @pytest.mark.xfail( sys.platform != "win32", reason="gdal_array available only on Windows.", raises=ModuleNotFoundError ) Also, I think we could comment out the old distance_to_anomaly_gdal tool if it is not used, the module has now a lot of tools that might confuse users.

nmaarnio avatar Oct 22 '24 08:10 nmaarnio

Hi @okolekar . I only now realized you have been optimizing distance_to_anomaly here, notdistance_computation. The issue and PR name refer to distance_computation which is the vector processing tool that the original implementation of distance_computation was using under the hood. Did you use any time to think about optimizing distance_computation, or only distance_to_anomaly so far?

Either way, it's of course good that you have been optimizing this tool since it's used a lot. However, we direly need also distance_computation optimized since it's a commonly used tool in MPM. Optimization of distance_computation would also optimize vector_density.

When it comes to actual review of this PR, you should mark the failing test functions to fail if platform is not Windows, so like this: @pytest.mark.xfail( sys.platform != "win32", reason="gdal_array available only on Windows.", raises=ModuleNotFoundError ) Also, I think we could comment out the old distance_to_anomaly_gdal tool if it is not used, the module has now a lot of tools that might confuse users.

Hi @nmaarnio, I was asked to optimize the distance_to_anomaly only. However, I can start with that tool as well. But it will take some time.

Regarding the PR, will upload the suggested changes in an hour or two. Thanks!

okolekar avatar Oct 22 '24 09:10 okolekar

Okay. I think there has been some miscommunication at some point. Anyway, I will rename this PR and relink the issue as this does not concern distance_computation

nmaarnio avatar Oct 22 '24 09:10 nmaarnio

@nmaarnio I have made the requested changes. Now I will shift my focus to optimize the Distance_computation in the vector processing tool.

okolekar avatar Oct 22 '24 11:10 okolekar

And sorry @nialov for bypassing you in the review process. There is just some pressure to get this, among some other features/improvements, merged before the next progress meeting. You are of course free to review the implementation when you have time and make improvement suggestions even after this version has been merged.

nmaarnio avatar Oct 23 '24 07:10 nmaarnio

Hey, no problem @nmaarnio . The pull request was quite fine already with tests confirming the functionality so second review would not have been too important.

Also to note that the gdal_array error does not work due to poetry, independent of platform. So it would work if the test installed with pip on linux. Of course poetry is only used on Linux so disabling it on that platform works

nialov avatar Oct 24 '24 06:10 nialov