grass icon indicating copy to clipboard operation
grass copied to clipboard

v.db.join: speed up processing by using fewer db.execute commands

Open griembauer opened this issue 1 year ago • 1 comments

This PR speeds up v.db.join for large tables. Previously, individual v.db.addcolumn and db.execute commands were run for each column to join. In this PR, one single v.db.addcolumn and db.execute command is executed for a chunk of max 100 columns at once.

Previous approach (deprecated): [Here, all individual v.db.addcolumn commands are grouped into one single command. Additionally, all db.execute UPDATE table... commands to add data to the new columns are grouped into a single command. As the SQL command may become very long this way (and incomplete input in sqlite3_prepare() errors may be encountered, see also https://github.com/OSGeo/grass/pull/3273), commands with a total string length of > ~10.000 are again split into individual SQL UPDATE commands.]

griembauer avatar Dec 04 '23 11:12 griembauer

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L189-L191

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/include/grass/dbmi.h#L142

This line truncates the SQL string to 8206 chars.

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L196

Example of long SQL string:

  1. Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"
  1. Try to execute this SQL string with db.execute input="-" command
  2. db.execute command fails because SQL string is truncated and not valid

For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

tmszi avatar Dec 20 '23 08:12 tmszi

The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L189-L191

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/include/grass/dbmi.h#L142

This line truncates the SQL string to 8206 chars.

https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L196

Example of long SQL string:

1. Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
    sql += f", soiltype{i} varchar(10)"
sql += ")"
2. Try to execute this SQL string with `db.execute input="-"` command

3. `db.execute` command fails because SQL string is truncated and not valid

For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql

Thanks! I now updated both v.db.join and v.db.update so that they both use db.executewith a temporary sql file as input instead of stdin. For joining 500 INT columns to a test vector with ~750 objects this brings calculation time down from 110s to 9s on my local machine. However, in the sql_files there are still separate sql statements:

[...]
UPDATE builtup_vectorized_base SET test_column_491 = (SELECT test_column_491 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_492 = (SELECT test_column_492 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_493 = (SELECT test_column_493 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_494 = (SELECT test_column_494 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
[...]

instead of having just one UPDATE statement. I am not sure if this would significantly improve the processing time again, but I couldn't manage to put together a single working UPDATE statement - input is welcome!

griembauer avatar Feb 26 '24 10:02 griembauer

Probably a set of SQL commands should be wrapped into a TRANSACTION?

    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")

Random example:

https://github.com/OSGeo/grass/blob/24365e1d47a5684be65648e3b99e3021270c80e3/scripts/v.dissolve/v.dissolve.py#L200

neteler avatar Feb 26 '24 11:02 neteler

Probably a set of SQL commands should be wrapped into a TRANSACTION?

    sql = ["BEGIN TRANSACTION"]
   ...
    sql.append("END TRANSACTION")

Random example:

https://github.com/OSGeo/grass/blob/24365e1d47a5684be65648e3b99e3021270c80e3/scripts/v.dissolve/v.dissolve.py#L200

Thanks! Adding this to both v.db.addcolumn and v.db.join seems to double the processing speed

griembauer avatar Feb 26 '24 12:02 griembauer