fast-sqlite3-inserts icon indicating copy to clipboard operation
fast-sqlite3-inserts copied to clipboard

Use SQL for generation

Open grishy opened this issue 4 years ago • 10 comments

What about a using SQLite for test data generation?

Example:

CREATE TABLE test (
  id INTEGER PRIMARY KEY NOT NULL, 
  x REAL NOT NULL,
  y REAL NOT NULL,
  z REAL NOT NULL
);
INSERT INTO test
    WITH RECURSIVE
      cnt( id, x, y, z) AS (
      VALUES(1 , random(), random(), random()) UNION ALL 
      SELECT id+1,random(),random(), random() FROM cnt WHERE ID<1000)
    select * from cnt;

https://paulbradley.org/sqlite-test-data/ https://stackoverflow.com/questions/17931320/how-to-insert-random-data-into-a-sqlite-table-using-only-queries

grishy avatar Jul 19 '21 03:07 grishy

I was thinking about the same thing. It would be interesting to compare if everything was offloaded to a single transaction script like this and see if offloading all the math to SQLite makes sense. While there are range restrictions on the values, there aren't distribution restrictions (normal, poisson, what have you) and the post already says that any regular PRNG is fine and crypto-strength not necessary, so SQLite random() should be capable of meeting the constraints.

Having a quick scratch at the math, it would look something like:

Insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000
)
select * from usr

WorldMaker avatar Jul 20 '21 17:07 WorldMaker

@WorldMaker hey, could you send this code as a PR?

avinassh avatar Jul 25 '21 06:07 avinassh

I ran this locally, this is quite slow. Do let me know if i am missing something.

  1. First I manually created a DB, created a table.
  2. Then I ran this following script:
#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

saved this in a script and ran it with time:

time ./sq.sh
./sq.sh  111.59s user 2.35s system 96% cpu 1:58.00 total

avinassh avatar Jul 25 '21 06:07 avinassh

@avinassh It turns out - is it at least faster than PyPy?

grishy avatar Jul 25 '21 09:07 grishy

PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;

Do they influence here?

grishy avatar Jul 25 '21 09:07 grishy

I had tried with them too, but it did not make much difference:

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA cache_size = 1000000;
PRAGMA locking_mode = EXCLUSIVE;
PRAGMA temp_store = MEMORY;
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
END_SQL

output:

$ time ./sq.sh

off
exclusive
./sq.sh  114.55s user 3.31s system 99% cpu 1:58.93 total

avinassh avatar Jul 25 '21 10:07 avinassh

Interesting results. We're about at the limit of my personal experience with SQLite directly and a lot of the things you might try to speed things up in other SQL environments obviously don't apply to SQLite.

Reviewing the Stack Overflow posted above, the sqlite3 shell supports an extension named generate_series which the documentation says is faster, so you could try that too (since you are already using the sqlite3 shell):

#!/bin/sh

sqlite3 pure_sql.db <<'END_SQL'
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
END_SQL

WorldMaker avatar Jul 25 '21 14:07 WorldMaker

generate_series is definitely faster than a recursive query, about 2x in my test.

Using generate_series:

-- load.sql
insert into user (area, age, active)
select abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1)
from generate_series(0, 100000000);
$ time sqlite3 t1.db '.read schema.sql' '.read load.sql'

real	0m55.747s
user	0m45.945s
sys	0m5.894s

Using recursive:

-- load-rec.sql
insert into user
with recursive usr(id, area, age, active) as (
    values (1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1))
    union all select id + 1, abs(random() % 99999), (abs(random() % 3) + 1) * 5, abs(random() % 1) from usr
    where id < 100000000
)
select * from usr;
$ time sqlite3 t2.db '.read schema.sql' '.read load-rec.sql'

real	1m42.659s
user	1m35.587s
sys	0m4.820s

Adding in the pragma statements didnt make a difference in my test, in fact is slowed it down a few seconds.

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

INSERT INTO user (area, age, active) 
  SELECT 
    1000000, 
    5, 
    1 
  FROM generate_series(1, 100000000, 1);
real	0m27.924s
user	0m17.080s
sys	0m6.985s

So generate_series comes out at 55 seconds, still slower than the best rust method, but seemingly faster than the other ones. And there's about 25 seconds room for improvement if there's a faster random generator!

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

asg017 avatar Nov 04 '21 18:11 asg017

Interestingly, I tried without random() calls and hardcoded some numbers, and it was even faster:

Someone else did a flamegraph, I think some good time is being spent in random (in rust solution)

Unsure if you're still accepting PRs, but lmk if you want to see one using generate_series!

I don't have a pure SQL solution, so I will be happy to accept a PR on this since is close to python one :)

avinassh avatar Nov 05 '21 09:11 avinassh

Created a PR for this in #23! Based on of my machine, it appears to be faster than even the fastest rust solution, but would love to see a more official benchmark run

asg017 avatar Dec 13 '21 01:12 asg017