exist icon indicating copy to clipboard operation
exist copied to clipboard

Terrible performance in non-distinct values queries

Open adamretter opened this issue 4 years ago • 12 comments

As raised in the xml.com Slack channel by Daniel Zimmel, the following queries have very poor performance on eXist-db (5.3.0-SNAPSHOT) when compared to BaseX (9.4.5) and Saxon (HE 10.3).

  1. FLWOR version Contributed by Daniel Zimmel:
let $seq := (1 to 50000, 1)
let $non-distinct-values := for $v in distinct-values($seq) return $v[count(index-of($seq,$v)) > 1]
return
  $non-distinct-values

BaseX: 21 Seconds Saxon: 24 Seconds eXist-db: 9 Minutes


  1. Group by version As contributed by Gerrit Imsieke.
for $h in (1 to 50000, 1)
let $k := $h
group by $k
where count($h) gt 1
return
  $h[1]

BaseX: 10 Milliseconds eXist-db: 65 Milliseconds Saxon: 1.2 Seconds


  1. Predicate version As contributed by Adam Retter.
let $seq := (1 to 50000, 1)
return
  distinct-values($seq[count(index-of($seq, .)) gt 1])

Saxon: 20 Seconds BaseX: 24 Seconds eXist-db: 12 Minutes


  1. Map version As contributed by Mads Hansen
declare namespace map =
    "http://www.w3.org/2005/xpath-functions/map";
let $seq := (1 to 50000, 1)
let $map := map:merge(($seq ! map:entry(string(.), .)), map{"duplicates":"combine"})
return
  map:keys($map) ! tail(map:get($map, .))

BaseX: 30 Milliseconds Saxon: 2 Seconds eXist-db: Unsupported on eXist-db as map:merge#2 is not implemented!


  1. Succinct predicate version As contributed by Joel Kalvesmaki
let $seq := (1 to 50000, 1)
return
  $seq[index-of($seq, .)[2]]

Saxon: 20 Seconds BaseX: 35 Seconds eXist-db: 11 Minutes


  1. Map fold-left version 1 As contributed by Liam Quin
declare namespace map =
    "http://www.w3.org/2005/xpath-functions/map";
let $minocc := 2, $seq := (1 to 50000, 1 to 17, 12 to 30),
$counts := fold-left( $seq, map { }, function($sofar as item()*, $this as xs:integer) {
  let $seen := ($sofar($this), 0)[1]
  return
    map:merge((map { $this : $seen +1 }, $sofar))
})
return
  map:keys($counts)[$counts(.) gt $minocc]

BaseX: 140ms Saxon: 9.5 Minutes eXist-db: 25 minutes and then raises the exception:

An exception occurred during query execution: err:XPTY0004 checking function parameter 1 in call map:get(untyped-value-check[xs:anyAtomicType, .]): XPTY0004: The actual cardinality for parameter 1 does not match the cardinality declared in the function's signature: map:get($key as xs:anyAtomicType) item()*. Expected cardinality: exactly one, got 50000. [at line 10, column 32, source: String/-6775637318469898051]

  1. Map fold-left version 2 As contributed by Christian Grün
declare namespace map =
    "http://www.w3.org/2005/xpath-functions/map";
let $seq := (1, 1 to 500000, 1, 2)
let $counts := fold-left($seq, map { }, function($sofar, $this) {
  map:put($sofar, $this, head(($sofar($this) + 1, 1)))
}) return
  map:for-each($counts, function($key, $count) {
    $key[$count > 1]
  })

BaseX: 430 Milliseconds eXist-db: 2 Seconds Saxon: Throws an exception:

Exception in thread "main" java.lang.StackOverflowError
	at net.sf.saxon.om.Chain$ChainIterator.next(Chain.java:306)
	at net.sf.saxon.om.Chain$ChainIterator.next(Chain.java:337)

  1. Window version As contributed by Christian Grün
let $seq := (1, 1 to 500000, 1, 2)
for tumbling window $w in sort($seq)
  start when true()
  end $e next $n when $e != $n
return $w[2]

BaseX: 180 Milliseconds Saxon: 2 Seconds eXist-db: Unsupported on eXist-db as Windows are not implemented!

adamretter avatar Dec 23 '20 15:12 adamretter

@adamretter Do you need help getting the timings labeled "???"? Saxon could be added for reference.

joewiz avatar Dec 23 '20 18:12 joewiz

@joewiz Thanks, but no. I just had to wait for some other work on my laptop to finish first - so that I could get fair timings.

adamretter avatar Dec 23 '20 21:12 adamretter

Thanks for the summary, Adam. If you got some spare time, it would be interesting to see timings for the fold-left and window solutions added.

ChristianGruen avatar Dec 24 '20 09:12 ChristianGruen

@ChristianGruen Thanks, I have done that now. It was interesting that your fold-left variant works well on eXist-db, but the other variant is pathological!

adamretter avatar Dec 24 '20 14:12 adamretter

Lots of surprises indeed! I learned quite something by comparing the different approaches.

ChristianGruen avatar Dec 24 '20 14:12 ChristianGruen

Note - BaseX just added a native implementation - https://docs.basex.org/wiki/Utility_Module#util:duplicates

adamretter avatar Dec 29 '20 18:12 adamretter

I am surprised who is surprised here. As we have multiple examples that show the above (#3406 for example). See also the implementation of xbow:distinct-duplicates dating back to 2019. Calls to distinct-values are to be avoided. group-by and map:contains are the way to go. I really thought this was common knowledge in the community. But a nice summary indeed. And I would love to have a performant implementation of distinct-values.

line-o avatar Mar 01 '21 11:03 line-o

xquery version "3.1";

declare namespace map = "http://www.w3.org/2005/xpath-functions/map";
    
declare function local:disdup-reducer ($result as map(*), $next as xs:string) as map(*) {
    if (exists($result?distinct($next)))
    then (map:put($result, 'duplicates', ($result?duplicates, $next)))
    else (map:put($result, 'distinct', map:put($result?distinct, $next, true())))
};

declare function local:distinct-duplicates($s as item()*) as map(*) {
    fold-left($s, map { 'distinct': map {}, 'duplicates': () }, local:disdup-reducer#2)
};

local:distinct-duplicates((1 to 50000, 1 to 17, 12 to 30))?duplicates

finishes after ~200ms

line-o avatar Mar 01 '21 12:03 line-o

I would also like to see that we get to reproducible performance tests by stating on which hardware ran it how many times and what was the fastest/slowest/mean execution time. #3406 is a great example of how to approach this.

line-o avatar Mar 01 '21 12:03 line-o

Can someone provide a link to the original discussion?

line-o avatar Mar 01 '21 12:03 line-o

https://xmlcom.slack.com/archives/C011NLXE4DU/p1608718126076600

line-o avatar Mar 01 '21 12:03 line-o

Micheal Kay came up with his neat solution:

let $seq := (1 to 50000, 1)
let $sorted := sort($seq)
return for-each-pair($sorted, tail($sorted), 
    function($x, $y) { if ($x eq $y) then $x else () })

execution times in existdb (v5.1.1 on my MacBook) vary quite a lot between 190 and 600ms (mean should be at 300ms)

line-o avatar Mar 01 '21 22:03 line-o