db icon indicating copy to clipboard operation
db copied to clipboard

insertBatch to generate multiple INSERT statements

Open samdark opened this issue 1 year ago • 8 comments

Right now, if the batch is too big, we're getting into limits of DBMS like the following:

image

The screenshot is from PostgreSQL.

I suggest doing two things:

  1. Introduce configurable parameter to configure max number of parameters.
  2. Default it per DBMS. For PostgreSQL it could be 65535.
  3. If max is reached, move the rest of the data to the next INSERT statement.

samdark avatar Oct 17 '24 07:10 samdark

That will change the characteristic of this operation: from atomic to non-atomic.

rob006 avatar Oct 18 '24 07:10 rob006

You can't make it atomic (except transactions) in case of exceeding max number of parameters anyway.

samdark avatar Oct 18 '24 09:10 samdark

Yest, but implicit changing atomic transaction to non-atomic is just a massive footgun. This should be explicitly enabled by programmer who makes the call, so developer will know that this batch insert may result multiple queries.

Also, splitting one massive insert to multiple smaller ones is useful not only to handle parameters limit. I generally avoid really big inserts as they may block table for other queries, so system may be more responsive if you do 10 small inserts instead of one big one. Having a method that would do this for you would be useful even if I don't exceed parameters limit in your queries.

rob006 avatar Oct 18 '24 10:10 rob006

Makes sense. Default value of such option must be null then to disable splitting.

samdark avatar Oct 18 '24 19:10 samdark

Good point, but

  1. It is better if it works by default. Currently it does not work.
  2. The documentation should mention this.

By default it is better to specify driver specific max value and allow the developer to change it.

Tigrov avatar Oct 19 '24 02:10 Tigrov

Either way is fine for me, if we'll put a warning about it being atomic or not in the docs.

samdark avatar Oct 19 '24 11:10 samdark

By default it is better to specify driver specific max value and allow the developer to change it.

To be clear: I was talking about limit based on number of inserted rows, not number of params used by query. If I want to insert 50k records in 50 queries (1k records per query), then limit I need to pass to the function should 1000.

rob006 avatar Oct 19 '24 13:10 rob006

Yes. That sounds useful as well.

samdark avatar Oct 20 '24 11:10 samdark

If it is driver-specific, shouldn't this be done on DBMS level? In the following packages:

yiisoft/db-mssql
yiisoft/db-mysql
yiisoft/db-oracle
yiisoft/db-pgsql
yiisoft/db-sqlite

evil1 avatar Dec 24 '24 15:12 evil1

By default it is better to specify driver specific max value and allow the developer to change it.

To be clear: I was talking about limit based on number of inserted rows, not number of params used by query. If I want to insert 50k records in 50 queries (1k records per query), then limit I need to pass to the function should 1000.

My vision is to introduce a parameter, let's say batchInsertChunkSize and default it to 0 (which means no limit), but if you set it to any non-negative value, then split insert queries in the appropriate chunks.

evil1 avatar Dec 24 '24 15:12 evil1

shouldn't this be done on DBMS level?

Implementation itself — yes. The interface should be common, though.

samdark avatar Dec 24 '24 20:12 samdark

shouldn't this be done on DBMS level?

Implementation itself — yes. The interface should be common, though.

Basically we are discussing here two similar questions at once. As for the original issue described in the very first message - I think we can incapsulate this logic inside the DBMS package itself. For PostgreSQL we could add a property int $paramsLimit = 65535;. Same for other PDO Drivers.

In prepareBatchInsertValues(string $table, iterable $rows, array $columns, array &$params, int $chunkSize = 0) in AbstractDMLQueryBuilder where we can make use of this parameter. And also we can introduce an int $chunkSize = 0 parameter. We will accept this parameter in insertBatch(string $table, iterable $rows, array $columns = [], array &$params = [], int $chunkSize = 0) and pass it here. Method will return array of chunks (By default only one single chunk).

evil1 avatar Dec 25 '24 05:12 evil1

This will require insertBatch() to return array of sql queries. Unfortunately it will break some BC.

evil1 avatar Dec 26 '24 06:12 evil1

It seem's like there is no easy way, to implement these features. I've reproduced the bug for pgsql Снимок экрана 2024-12-27 в 16 09 25

One of the ways is to create method addRawSql() and something like addBindValues(). And store $sql as an array. And iterate this queries on execure() call

    public function batchInsert(string $table, array $columns, iterable $rows): static
    {
        $table = $this->getQueryBuilder()->quoter()->quoteSql($table);

        /** @psalm-var string[] $columns */
        foreach ($columns as &$column) {
            $column = $this->getQueryBuilder()->quoter()->quoteSql($column);
        }

        unset($column);

        $params = [];
        $sql = $this->getQueryBuilder()->batchInsert($table, $columns, $rows, $params);

        $this->setRawSql($sql);
        $this->bindValues($params);

        return $this;
    }

Or maybe think of introducing a DTO. Somethink like BatchInsertChunk which will store $sql and $params in it. Add addBatchInsertChunk() method and store them as an array inside of the Command. And iterate on execure() call.

evil1 avatar Dec 27 '24 11:12 evil1