surrealdb icon indicating copy to clipboard operation
surrealdb copied to clipboard

Bug: hidden time cost proportional to the input size in SurrealQL

Open HackTales opened this issue 11 months ago • 5 comments

Describe the bug

Using arrays and result set from previously computed operation, like SELECT and functions, involves a cost proportional to the size of the array.

Steps to reproduce

Setup surrealdb to run in memory in order to exclude any possible noise from the store. Create a table and fill it with some elements. ie. Create a table with 65536 elements = 2^16 elements.

REMOVE TABLE IF EXISTS test_tbl;
DEFINE TABLE test_tbl;
INSERT INTO test_tbl (SELECT $this as id, $this as n FROM (function(){
        return Array.from(Array(65536).keys())
}));

Load $N elements from the table into array

$arr = SELECT VALUE n FROM test_tbl LIMIT $N;

Perform some operations with different big O behavior.

array::first($arr); //O(1)
array::sort::desc($arr)[0]; //O(n log n)
math::sum($arr); // O(n)

and run the above operations with multiple $N s.

I run the following test on a modern i7 with multiple runs and report the min/max time value.

$N=1
O(1) ~ 10-20µs
O(n log n) ~ 5-7us
O(n) ~ 5-8us

1000 times 16 times the previous $N, no proportional change, but still.

$N=1024
O(1) ~ 100-300µs
O(n log n) ~ 95-300us
O(n) ~ 260-415us

16 times the previous $N seems to emerge a cost proportional to $N

$N=16384
O(1) ~ 1.3ms-1.5ms
O(n log n) ~ 1.4ms-1.5ms
O(n) ~ 2ms-2.2ms

4 times the previous $N, the cost increase.

$N=65536
O(1) ~ 5.6ms-6.1ms
O(n log n) ~ 5.7ms-6.3ms
O(n) ~ 8ms-9.5ms

2 times the previous $N, the cost increase again.

$N=131072
O(1) ~ 11ms-13ms
O(n log n) ~ 12ms-15ms
O(n) ~ 17ms-20ms

2 times the previous $N, the pattern seems obvious.

$N=262144
O(1) ~ 23-28ms
O(n log n) ~ 27ms-28ms
O(n) ~ 34ms-36ms

Note that self assignation follows the same pattern as O(1):

$arr = $arr;

Expected behaviour

Functions and operations with sub linear time (especially with costant time) should not be impacted by linear cost.

I expect that the times spent calling functions like array::len($arr) or array::first($arr) are very similar for an array with 1 element and 10000 elements.

If needed I can help providing more profiling of queries or examples or build a simple tool to automate this.

SurrealDB version

1.3.1 for linux on x86_64

Contact Details

No response

Is there an existing issue for this?

  • [X] I have searched the existing issues

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

HackTales avatar Mar 24 '24 16:03 HackTales

Potentially related or be a root issue of the linear cost: https://github.com/surrealdb/surrealdb/issues/26

HackTales avatar Mar 24 '24 16:03 HackTales

Hey @HackTales, this one is fixed in #3657 and will be out in 1.4.0 soon!

naisofly avatar Mar 27 '24 14:03 naisofly

Hi @HackTales, yes is directly as a result of #26. It's something we know about (obviously being one of the first issues in the repository), and will get to in due course. We have done some analysis and benchmarking, and generally in queries, the cost is only about 7-10% additional overhead, so there are other performance improvements we are focussing on first. However, this is definitely something we want to fix in due course!

tobiemh avatar Mar 29 '24 10:03 tobiemh

Hey @HackTales, this one is fixed in #3657 and will be out in 1.4.0 soon!

Hi @naisofly, unfortunately the issue is still present in 1.4.0 for linux on x86_64 (I think the one you mentioned solves a similar problem about aggregations).

Hi @HackTales, yes is directly as a result of #26. It's something we know about (obviously being one of the first issues in the repository), and will get to in due course. We have done some analysis and benchmarking, and generally in queries, the cost is only about 7-10% additional overhead, so there are other performance improvements we are focussing on first. However, this is definitely something we want to fix in due course!

Hi @tobiemh thank you for clarify that and your commitment on this project and also on the general performance of Surreal. I understand the business-need to have a general measure of performance on a set of different queries, however SurrealQL is designed to be extremly powerful to express data manipulation (and I love that, being that the choice we use it) - so any query is expected to express more-than-single-SQL statement - but passing data around is costly: right now returning the results of query alongside the total size of them hits hard the performance for big dataset/resultset (that have just their problems only for being big!).

I'd like to ask:

  1. Do you have the dataset dump and the queries you are using to calculate the general performance of SurrealQL? This could help community to understand better your short/long vision about performance objectives.
  2. Do you have any monitoring to see how the performance changes after every surreal release? Is it public? Otherwise I'd like to setup a simple benchmarking tool of the query times of Surreal on various queries to be run after each releases / merge / commit to help the Surreal devs and the community to monitor the progress of the performance (or detect any regression), but only if you are comfortable with it.

Also considering the importance Surreal is giving to the performance itself (on readme, site and on the contribution guide) I think that being driven by data is the only way to show that you achieved it or you are working towards it!

HackTales avatar Apr 10 '24 11:04 HackTales

Hey @HackTales,

Regarding performance monitoring for each PR, we have those scripts here: https://github.com/surrealdb/surrealdb/tree/main/lib/benches You are welcome to contribute to that if you have ideas for how it could be improved :raised_hands:

Regarding comparative benchmarking, we have been working on a range of benchmarking from simple CRUD and standard comparisons to more advanced queries focused on use cases.

Hope this helps!

naisofly avatar Apr 18 '24 13:04 naisofly