exist
exist copied to clipboard
Terrible performance in non-distinct values queries
As raised in the xml.com Slack channel by Daniel Zimmel, the following queries have very poor performance on eXist-db (5.3.0-SNAPSHOT) when compared to BaseX (9.4.5) and Saxon (HE 10.3).
- FLWOR version Contributed by Daniel Zimmel:
let $seq := (1 to 50000, 1)
let $non-distinct-values := for $v in distinct-values($seq) return $v[count(index-of($seq,$v)) > 1]
return
$non-distinct-values
BaseX: 21 Seconds Saxon: 24 Seconds eXist-db: 9 Minutes
- Group by version As contributed by Gerrit Imsieke.
for $h in (1 to 50000, 1)
let $k := $h
group by $k
where count($h) gt 1
return
$h[1]
BaseX: 10 Milliseconds eXist-db: 65 Milliseconds Saxon: 1.2 Seconds
- Predicate version As contributed by Adam Retter.
let $seq := (1 to 50000, 1)
return
distinct-values($seq[count(index-of($seq, .)) gt 1])
Saxon: 20 Seconds BaseX: 24 Seconds eXist-db: 12 Minutes
- Map version As contributed by Mads Hansen
declare namespace map =
"http://www.w3.org/2005/xpath-functions/map";
let $seq := (1 to 50000, 1)
let $map := map:merge(($seq ! map:entry(string(.), .)), map{"duplicates":"combine"})
return
map:keys($map) ! tail(map:get($map, .))
BaseX: 30 Milliseconds
Saxon: 2 Seconds
eXist-db: Unsupported on eXist-db as map:merge#2 is not implemented!
- Succinct predicate version As contributed by Joel Kalvesmaki
let $seq := (1 to 50000, 1)
return
$seq[index-of($seq, .)[2]]
Saxon: 20 Seconds BaseX: 35 Seconds eXist-db: 11 Minutes
- Map fold-left version 1 As contributed by Liam Quin
declare namespace map =
"http://www.w3.org/2005/xpath-functions/map";
let $minocc := 2, $seq := (1 to 50000, 1 to 17, 12 to 30),
$counts := fold-left( $seq, map { }, function($sofar as item()*, $this as xs:integer) {
let $seen := ($sofar($this), 0)[1]
return
map:merge((map { $this : $seen +1 }, $sofar))
})
return
map:keys($counts)[$counts(.) gt $minocc]
BaseX: 140ms Saxon: 9.5 Minutes eXist-db: 25 minutes and then raises the exception:
An exception occurred during query execution: err:XPTY0004 checking function parameter 1 in call map:get(untyped-value-check[xs:anyAtomicType, .]): XPTY0004: The actual cardinality for parameter 1 does not match the cardinality declared in the function's signature: map:get($key as xs:anyAtomicType) item()*. Expected cardinality: exactly one, got 50000. [at line 10, column 32, source: String/-6775637318469898051]
- Map fold-left version 2 As contributed by Christian Grün
declare namespace map =
"http://www.w3.org/2005/xpath-functions/map";
let $seq := (1, 1 to 500000, 1, 2)
let $counts := fold-left($seq, map { }, function($sofar, $this) {
map:put($sofar, $this, head(($sofar($this) + 1, 1)))
}) return
map:for-each($counts, function($key, $count) {
$key[$count > 1]
})
BaseX: 430 Milliseconds eXist-db: 2 Seconds Saxon: Throws an exception:
Exception in thread "main" java.lang.StackOverflowError
at net.sf.saxon.om.Chain$ChainIterator.next(Chain.java:306)
at net.sf.saxon.om.Chain$ChainIterator.next(Chain.java:337)
- Window version As contributed by Christian Grün
let $seq := (1, 1 to 500000, 1, 2)
for tumbling window $w in sort($seq)
start when true()
end $e next $n when $e != $n
return $w[2]
BaseX: 180 Milliseconds Saxon: 2 Seconds eXist-db: Unsupported on eXist-db as Windows are not implemented!
@adamretter Do you need help getting the timings labeled "???"? Saxon could be added for reference.
@joewiz Thanks, but no. I just had to wait for some other work on my laptop to finish first - so that I could get fair timings.
Thanks for the summary, Adam. If you got some spare time, it would be interesting to see timings for the fold-left and window solutions added.
@ChristianGruen Thanks, I have done that now. It was interesting that your fold-left variant works well on eXist-db, but the other variant is pathological!
Lots of surprises indeed! I learned quite something by comparing the different approaches.
Note - BaseX just added a native implementation - https://docs.basex.org/wiki/Utility_Module#util:duplicates
I am surprised who is surprised here. As we have multiple examples that show the above (#3406 for example). See also the implementation of xbow:distinct-duplicates dating back to 2019.
Calls to distinct-values are to be avoided. group-by and map:contains are the way to go.
I really thought this was common knowledge in the community. But a nice summary indeed.
And I would love to have a performant implementation of distinct-values.
xquery version "3.1";
declare namespace map = "http://www.w3.org/2005/xpath-functions/map";
declare function local:disdup-reducer ($result as map(*), $next as xs:string) as map(*) {
if (exists($result?distinct($next)))
then (map:put($result, 'duplicates', ($result?duplicates, $next)))
else (map:put($result, 'distinct', map:put($result?distinct, $next, true())))
};
declare function local:distinct-duplicates($s as item()*) as map(*) {
fold-left($s, map { 'distinct': map {}, 'duplicates': () }, local:disdup-reducer#2)
};
local:distinct-duplicates((1 to 50000, 1 to 17, 12 to 30))?duplicates
finishes after ~200ms
I would also like to see that we get to reproducible performance tests by stating on which hardware ran it how many times and what was the fastest/slowest/mean execution time. #3406 is a great example of how to approach this.
Can someone provide a link to the original discussion?
https://xmlcom.slack.com/archives/C011NLXE4DU/p1608718126076600
Micheal Kay came up with his neat solution:
let $seq := (1 to 50000, 1)
let $sorted := sort($seq)
return for-each-pair($sorted, tail($sorted),
function($x, $y) { if ($x eq $y) then $x else () })
execution times in existdb (v5.1.1 on my MacBook) vary quite a lot between 190 and 600ms (mean should be at 300ms)