geocompr icon indicating copy to clipboard operation
geocompr copied to clipboard

Benchmarks

Open Robinlovelace opened this issue 2 years ago • 5 comments

The book would benefit from more benchmarks, people have asked about this and others already created great resources comparing different implementations in R and other languages. Opening this as a general place to discuss benchmarks. We could at the very least link to existing bencharking materials.

I'm also thinking it could be good to have a dedicated benchmarking repo from the ground up. Note: I tried this several years ago with the geobench project, some ideas (and the name!) from that may be useful: https://github.com/atfutures/geobench

Discussed today with @Nowosad if anyone else has thoughts on this and how best to proceed here could be a good place to comment, heads-up @urschrei, @paleolimbot.

Robinlovelace avatar Jan 23 '22 11:01 Robinlovelace

Also @kadyb -- your feedback is welcomed here.

Nowosad avatar Jan 23 '22 11:01 Nowosad

In general, it is a very good idea to create and publish benchmark because there are many packages ({raster}, {stars}, {terra}, {sf}) in R and the users will probably be interested in which is the fastest. Also, some users will be faced with the choice of R vs Python, so I think it's worth including that too.

I created raster benchmark and the response was noticeable. A few performance issues were also fixed by the package developers. I received some requests to include more Python packages ({PCRaster}, {EarthPy}, etc.). Some people suggested to test whole workflows, not single tasks, but in my opinion it's better to test single tasks. Also, my supervisior suggested to add C++ and Julia languages as well, but it seems to me that such comparison between packages (one-line functions) and low-level languages is not appropriate. I have some ideas written here too, but I didn't have time to do it and they are not my priority.

I recently started a similar benchmark, but for vector data. It is in its early stages and I plan to publish it in February/March. The surprising result is that {terra} is most likely the fastest (compared to {sf} and {geopandas}). I recently did a {terra} vs {sf} survey for vector data and {sf} is the most used.

BTW: I think that not only the benchmark is important, but also the user experience. Compared to Python, R offers more features and is generally a simpler language for non-programmers (e.g. less code to write, more automation). But of course Python also has its advantages (e.g. more data types).

kadyb avatar Jan 23 '22 21:01 kadyb

Here I have done benchmark of packages ({sf}, {s2}, {terra}, {geos}, {geopandas}) for vector data: https://github.com/kadyb/vector-benchmark, but there is still a lot of room for improvement. I should probably create Docker and use some real dataset instead of synthetic data. Based on the results, it looks like {terra} is the fastest, but as you can see I only compared 6 operations. If you have any comments, please let me know!

kadyb avatar Feb 21 '22 21:02 kadyb

Many thanks for the links @kadyb, this is very interesting stuff. I think we can help with a docker image, try

docker run -d -p 8786:8787 -v $(pwd):/home/rstudio/data -e USERID=$UID -e PASSWORD=pw geocompr/geocompr:python

And you should have everthing you need to reproduce that. I've taken a look and can say: interesting results! How would you feel about a geocompr/benchmarks repo linked to this project inspired by your work or directly building on it? This will be of interest to @michaeldorman also who has done lots of geopy stuff and has started a 'geocompy' resource: https://geocompr.github.io/py/

Robinlovelace avatar Feb 22 '22 13:02 Robinlovelace

Thanks! I've never used Docker so it will take me a while to figure it out.

How would you feel about a geocompr/benchmarks repo linked to this project inspired by your work or directly building on it?

Great, I'm in favor. I'm also open to contributions and development of this project. Nevertheless, I have to wait to fully disseminate the results until all packages developers have had a chance to comment. I've just accepted suggestions and patch from the {geopandas} developers and it completely changes the results, but I don't expect such changes from R.

Here are some more benchmarks used in the packages: pyogrio, geopandas, and shapely.

kadyb avatar Feb 23 '22 14:02 kadyb

https://github.com/kadyb/OGH2022

Nowosad avatar Oct 10 '22 07:10 Nowosad

@Robinlovelace do you plan to do any action regarding this issue?

Nowosad avatar Feb 09 '23 17:02 Nowosad

@Robinlovelace do you plan to do any action regarding this issue?

No, out of scope, closing for now.

Robinlovelace avatar Nov 24 '23 05:11 Robinlovelace