osmdata icon indicating copy to clipboard operation
osmdata copied to clipboard

Speed up fill_overpass_data

Open FlxPo opened this issue 2 years ago • 2 comments

PR for issue #247 (this is my first PR ever, tell me if there's anything missing !)

FlxPo avatar Nov 02 '21 13:11 FlxPo

@FlxPo Thanks for the PR. It's definitely a good idea, but unfortunately doesn't generally speed things up at all in the full context of the fill_overpass_data() function:

library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
doc <- "test.osm"

if (!file.exists (doc)) {
    q <- opq ("soest germany") |> # or wherever, but that's around 130,000 lines so big enough
        add_osm_feature (key="highway")
    osmdata_xml (q, filename="test.osm", quiet = FALSE)
}

f1 <- function (obj, doc) {
    doc <- xml2::read_xml (doc)
    obj <- get_metadata (obj, doc)
    doc <- as.character (doc)
    list (obj, doc)
}
f2 <- function (obj, doc) {
    nch <- file.info (doc)$size
    doc_xml <- xml2::read_xml (doc)
    obj <- get_metadata (obj, doc_xml)
    doc <- readChar (doc, nch)
    list (obj, doc)
}

obj <- osmdata ()
bench::mark (f1 = f1 (obj, doc),
             f2 = f2 (obj, doc))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1            163ms    164ms      5.95        NA     0   
#> 2 f2            252ms    426ms      2.35        NA     2.35

Created on 2021-11-04 by the reprex package (v2.0.1.9000)

The key statistic in the table is the "median" time, which is more than twice as slow via this PR. In spite of that, i do suspect you're on to something here - the reason that it is so much slower is because you're effectively reading the document twice: once via read_xml, and then the second time via readChar. In your case, the read_xml is only used to extract the metadata. That suggests we should be able to re-implement extraction of metadata directly from a readChar version, then we'd really be getting somewhere.

That said, the extraction of metadata is currently done entirely as XML nodes. Hand coding a version for just the bits we need here would be possible, but could only really be done efficiently in C/C++, and even then really would verge of re-inventing the wheel. Even then, this actual reading is a relatively small portion of the total processing time for large queries, so it would be a lot of work for relatively very little gain. For those reasons, I think this PR should just be closed. Sorry about that, especially acknowledging that it's your first PR, but rest assured that it really does reflect a good idea - just not one that is ready to be implemented here.

I'll wait to hear back from you before closing. Thanks

mpadge avatar Nov 04 '21 08:11 mpadge

Thanks for testing, i should have provided a benchmark !

I'm surprised by your results, I'm getting the opposite conclusion when I run your code (slightly modified to avoid a weird "Error: Each result must equal the first result: f1 does not equal f2" error) :

library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
doc <- "test.osm"

if (!file.exists (doc)) {
  q <- opq ("soest germany") |> # or wherever, but that's around 130,000 lines so big enough
    add_osm_feature (key="highway")
  osmdata_xml (q, filename="test.osm", quiet = FALSE)
}

f1 <- function (obj, doc) {
  doc <- xml2::read_xml (doc)
  obj <- get_metadata (obj, doc)
  doc <- as.character (doc)
  res <- list (obj, doc)
  return(TRUE)
}
f2 <- function (obj, doc) {
  nch <- file.info (doc)$size
  doc_xml <- xml2::read_xml (doc)
  obj <- get_metadata (obj, doc_xml)
  doc <- readChar (doc, nch)
  res <- list(obj, doc)
  return(TRUE)
}

obj <- osmdata ()
bench::mark (f1 = f1 (obj, doc),
             f2 = f2 (obj, doc))

#> A tibble: 2 x 13
#>  expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time  gc   
#>  <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis> <lis>
#> 1 f1            852ms   852ms      1.17  424.35KB        0     1     0      852ms <lgl ~ <Rpro~ <ben~ <tib~
#> 2 f2            135ms   143ms      7.02    4.45MB        0     4     0      570ms <lgl ~ <Rpro~ <ben~ <tib~

I'm running this with R 4.1.1 (Windows 64-bit, 6 cores, 32 Go of RAM).

I also tried f1 and f2 on a much bigger file (360 Mo) : f2 completes after a few seconds, but f1 runs forever. I tried to use the bench::mark function to get precise estimates but I finally killed it, I need processing power for other tasks at the moment. I will try to complete the benchmark another time.

FlxPo avatar Nov 22 '21 09:11 FlxPo