grass
grass copied to clipboard
v.db.join: speed up processing by using fewer db.execute commands
This PR speeds up v.db.join for large tables. Previously, individual v.db.addcolumn and db.execute commands were run for each column to join.
In this PR, one single v.db.addcolumn and db.execute command is executed for a chunk of max 100 columns at once.
Previous approach (deprecated):
[Here, all individual v.db.addcolumn commands are grouped into one single command. Additionally, all db.execute UPDATE table... commands to add data to the new columns are grouped into a single command. As the SQL command may become very long this way (and incomplete input in sqlite3_prepare() errors may be encountered, see also https://github.com/OSGeo/grass/pull/3273), commands with a total string length of > ~10.000 are again split into individual SQL UPDATE commands.]
The problem with db.execute input="-" is the statically defined character buffer size if the SQL string is read from stdin input="-".
https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L189-L191
https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/include/grass/dbmi.h#L142
This line truncates the SQL string to 8206 chars.
https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L196
Example of long SQL string:
- Generate long SQL string with 30145 chars (create table) with Python
sql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)"
for i in range(1200):
sql += f", soiltype{i} varchar(10)"
sql += ")"
- Try to execute this SQL string with
db.execute input="-"command db.executecommand fails because SQL string is truncated and not valid
For this long SQL string, it is better to use read SQL string from the file db.execute input=/tmp/create_table.sql
The problem with
db.execute input="-"is the statically defined character buffer size if the SQL string is read from stdininput="-".https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L189-L191
https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/include/grass/dbmi.h#L142
This line truncates the SQL string to 8206 chars.
https://github.com/OSGeo/grass/blob/ea8f3eae3299d70be3e46f2d1beeb9a72635e525/db/db.execute/main.c#L196
Example of long SQL string:
1. Generate long SQL string with 30145 chars (create table) with Pythonsql = "CREATE TABLE soils_test (cat integer, soiltype varchar(10)" for i in range(1200): sql += f", soiltype{i} varchar(10)" sql += ")"2. Try to execute this SQL string with `db.execute input="-"` command 3. `db.execute` command fails because SQL string is truncated and not validFor this long SQL string, it is better to use read SQL string from the file
db.execute input=/tmp/create_table.sql
Thanks! I now updated both v.db.join and v.db.update so that they both use db.executewith a temporary sql file as input instead of stdin. For joining 500 INT columns to a test vector with ~750 objects this brings calculation time down from 110s to 9s on my local machine. However, in the sql_files there are still separate sql statements:
[...]
UPDATE builtup_vectorized_base SET test_column_491 = (SELECT test_column_491 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_492 = (SELECT test_column_492 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_493 = (SELECT test_column_493 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
UPDATE builtup_vectorized_base SET test_column_494 = (SELECT test_column_494 FROM builtup_vectorized_attributes WHERE builtup_vectorized_attributes.cat=builtup_vectorized_base.cat);
[...]
instead of having just one UPDATE statement. I am not sure if this would significantly improve the processing time again, but I couldn't manage to put together a single working UPDATE statement - input is welcome!
Probably a set of SQL commands should be wrapped into a TRANSACTION?
sql = ["BEGIN TRANSACTION"]
...
sql.append("END TRANSACTION")
Random example:
https://github.com/OSGeo/grass/blob/24365e1d47a5684be65648e3b99e3021270c80e3/scripts/v.dissolve/v.dissolve.py#L200
Probably a set of SQL commands should be wrapped into a TRANSACTION?
sql = ["BEGIN TRANSACTION"] ... sql.append("END TRANSACTION")Random example:
https://github.com/OSGeo/grass/blob/24365e1d47a5684be65648e3b99e3021270c80e3/scripts/v.dissolve/v.dissolve.py#L200
Thanks! Adding this to both v.db.addcolumn and v.db.join seems to double the processing speed