exporter icon indicating copy to clipboard operation
exporter copied to clipboard

Optimizing export

Open kirya-dev opened this issue 3 years ago • 23 comments

For export large data need call flush() in StreamedResponse. its allow clear buffer and not use extra memory. Todo as feature of framework?

My solution:

# config/services.yml
services:
    app.export.writer.csv:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.csv

    app.export.writer.json:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.json

    app.export.writer.xml:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.xml

    app.export.writer.xls:
        class: App\Writer\FlushWriterDecorator
        decorates: sonata.exporter.writer.xls
<?php
declare(strict_types=1);

namespace App\Writer;

use Sonata\Exporter\Writer\TypedWriterInterface;

class FlushWriterDecorator implements TypedWriterInterface
{
    private const FLUSH_EVERY = 1000;

    private int $position;
    private TypedWriterInterface $writer;

    public function __construct(TypedWriterInterface $writer)
    {
        $this->writer = $writer;
    }

    public function open(): void
    {
        $this->writer->open();

        $this->position = 0;
    }

    public function write(array $data): void
    {
        $this->writer->write($data);

        $this->position++;
        if (0 === ($this->position % self::FLUSH_EVERY)) {
            flush();
        }
    }

    public function close(): void
    {
        $this->writer->close();
    }

    public function getDefaultMimeType(): string
    {
        return $this->writer->getDefaultMimeType();
    }

    public function getFormat(): string
    {
        return $this->writer->getFormat();
    }
}

kirya-dev avatar Aug 05 '20 11:08 kirya-dev

Hi, this feature could be great. Wanna start a Pr ? :)

VincentLanglet avatar Aug 05 '20 12:08 VincentLanglet

Yes! I have next questions:

  1. Is optional decorator?
  2. Value of FLUSH_EVERY pass by parameters for more flexability? If null (on default) no need use decorator

kirya-dev avatar Aug 05 '20 12:08 kirya-dev

Yes!

I have next questions:

  1. Is optional decorator?

  2. Value of FLUSH_EVERY pass by parameters for more flexability?

If null (on default) no need use decorator

  1. dont know 😅

  2. this value could come from a sonata exporter config parameter indeed. I'm not sure if we should keep null as default value or if we can abritrary chose a number.

VincentLanglet avatar Aug 05 '20 12:08 VincentLanglet

There have other problem with get large data from doctrine. My solve this problem by using many paginated queries. Need solve this also in this bundle?

Im my test combine both optimizing dont eat memory more than 60MB.. Exported file size it was 650MB

kirya-dev avatar Aug 05 '20 14:08 kirya-dev

This improved version of DoctrineORMQuerySourceIterator

<?php
declare(strict_types=1);

namespace App\Export\Source;

use Doctrine\ORM\Query;
use Sonata\Exporter\Source\DoctrineORMQuerySourceIterator;

final class PaginatedDoctrineORMQuerySourceIterator extends DoctrineORMQuerySourceIterator
{
    private const PAGE_SIZE = 1000;

    private int $page = 0;

    public function __construct(Query $query, array $fields, string $dateTimeFormat = 'r')
    {
        parent::__construct($query, $fields, $dateTimeFormat);

        $this->query->setMaxResults(self::PAGE_SIZE);
        $this->query->setFirstResult(0);
    }

    public function next(): void
    {
        $this->iterator->next();

        if (!$this->iterator->valid()) {
            $this->page++;
            $this->query->setFirstResult($this->page * self::PAGE_SIZE);
            $this->query->getEntityManager()->clear();

            $this->iterator = null;
            $this->rewind();
        }
    }
}

kirya-dev avatar Aug 05 '20 15:08 kirya-dev

That screenshot indicates low memory usage when exporting big data. Im apply both optimization. Downloaded file with size is 637.4 MB

image

kirya-dev avatar Aug 05 '20 15:08 kirya-dev

This improved version of DoctrineORMQuerySourceIterator


<?php

declare(strict_types=1);



namespace App\Export\Source;



use Doctrine\ORM\Query;

use Sonata\Exporter\Source\DoctrineORMQuerySourceIterator;



final class PaginatedDoctrineORMQuerySourceIterator extends DoctrineORMQuerySourceIterator

{

    private const PAGE_SIZE = 1000;



    private int $page = 0;



    public function __construct(Query $query, array $fields, string $dateTimeFormat = 'r')

    {

        parent::__construct($query, $fields, $dateTimeFormat);



        $this->query->setMaxResults(self::PAGE_SIZE);

        $this->query->setFirstResult(0);

    }



    public function next(): void

    {

        $this->iterator->next();



        if (!$this->iterator->valid()) {

            $this->page++;

            $this->query->setFirstResult($this->page * self::PAGE_SIZE);

            $this->query->getEntityManager()->clear();



            $this->iterator = null;

            $this->rewind();

        }

    }

}

You need to take in account the fact that the query may already have a first result and a max result.

For example: If I want all the result from 3 to 2500, you'll have to do 3 to 1000 then 1000 to 2000 then 2000 to 2500.

VincentLanglet avatar Aug 05 '20 16:08 VincentLanglet

Doesnt work if there is 3000 result in my example, because you'll export all of them even if I would like to stop at 2500.

In construct OriginalFirstResult = query->getfirstresult(); OriginalMaxResult = query->getMaxResult()

Query->setMaxResult(min(originalMaxResult, originalFirstResult + page size))

Then, the check would be If iterator not valid and originalMaxResult > currentMaxResult (where currentMaxResult = query->getMaxResult())

SetMaxResult((min(originalMaxResult, currentMaxResult + page size)) SetFirstResult(currentFirstResult + page size)

With this formula you dont need the page property, neither the originalFirstResult property ; just the originalMaxResult property.

I'm on phone, if it's not clear enough, I'll improve my message in two days.

Edit: did you remove your message or I'm crazy ? 😂

VincentLanglet avatar Aug 05 '20 17:08 VincentLanglet

If we use pagination, then we may not be sure that the next query will return next data. For example, if you sort in descending order by creation date (from new to old), then at one point it may turn out that there will be several duplicate records in the upload. Also, the database does not guarantee that the data will be selected in the same sequence using limits and offsets. This can be solved by resetting the ascending sorting ID. And in the next request, take records strictly older than this value, or vice versa.

kirya-dev avatar Aug 05 '20 18:08 kirya-dev

I think we just left note in README that for using paginate export need not changed and ordered data.

kirya-dev avatar Aug 05 '20 19:08 kirya-dev

If we use pagination, then we may not be sure that the next query will return next data. For example, if you sort in descending order by creation date (from new to old), then at one point it may turn out that there will be several duplicate records in the upload. Also, the database does not guarantee that the data will be selected in the same sequence using limits and offsets. This can be solved by resetting the ascending sorting ID. And in the next request, take records strictly older than this value, or vice versa.

Oh yes indeed. Maybe @greg0ire can help about this kind of problem. But anyway, currently the export doesn't work when there is too much data, so any improvement is good to take.

If you start a PR, the reviews can lead us to the right way ;)

VincentLanglet avatar Aug 06 '20 19:08 VincentLanglet

You can fix such issues with cursor based pagination: https://use-the-index-luke.com/no-offset

greg0ire avatar Aug 06 '20 19:08 greg0ire

So:

  1. Validate: Initial query must havnt offset & limit.
  2. Validate: must have order by in initial query.
  3. Add limit (PAGE_SIZE) for query.
  4. Extract seek field from order by.
  5. Fetch first page.
  6. Do export.

If count(results) = PAGE_SIZE:

  1. Add seek clause. If query contains where clause wee need wrap it into parenthesis and add and seek clause. (Avoid repeat this step)
  2. Extract last seek value for previous results. (From last results)
  3. Fetch next results.
  4. Go to 6

kirya-dev avatar Aug 09 '20 12:08 kirya-dev

Why 1 ? Cant you use the max result and the offset provided ?

Why 2 ? Cant you add one order if needed ?

VincentLanglet avatar Aug 09 '20 13:08 VincentLanglet

Why 1 ? Cant you use the max result and the offset provided ?

Extra where in 7 broke offset & limit.

Why 2 ? Cant you add one order if needed ?

We must order results for correct work this method. Yes, we can add order by if needed. But problem: What field? For ex: id can not exists (Or provide this field name in constructor?)

kirya-dev avatar Aug 09 '20 14:08 kirya-dev

@kirya-dev do you want to start a Pr ?

I think it would be easier to discuss about this with code and some tests.

VincentLanglet avatar Aug 09 '20 14:08 VincentLanglet

@kirya-dev Hi, I tried to use your solution with latest version of exporter, but export is still crashing on memory (even with 500MB memory and only 15k records). Could you please point me at right direction? Thanks in advance.

nowaja avatar Feb 02 '21 22:02 nowaja

@kirya-dev Hi, I tried to use your solution with latest version of exporter, but export is still crashing on memory (even with 500MB memory and only 15k records). Could you please point me at right direction? Thanks in advance.

Hi! Please ensure that decorators are enabled and you are using custom source interator.

Also can help you https://www.php.net/manual/en/function.flush.php

kirya-dev avatar Feb 03 '21 06:02 kirya-dev

Do you still plan to make some PR in order to optimize the export @kirya-dev ?

I'm getting some Error: Maximum execution time of ... seconds exceeded error when exporting via excel. So I'm looking for way to fix this.

VincentLanglet avatar May 17 '21 13:05 VincentLanglet

Hello! Decide this problem is no simple task. Good idea was suggest to using database cursor. Doctrine dosent supports this functionality officcialy. But im found package for resolve this missing https://github.com/paysera/lib-pagination What do you think about it?

kirya-dev avatar May 17 '21 21:05 kirya-dev

https://github.com/paysera/lib-pagination What do you think about it?

We won't add a dependency to a low maintenance library, but

You can fix such issues with cursor based pagination: https://use-the-index-luke.com/no-offset

Cursor based pagination was the advice of @greg0ire. So we could try something similar.

VincentLanglet avatar May 17 '21 21:05 VincentLanglet

Implementations for many platforms can be different. We must implements this feature for every popular database platforms.. Its a big work

kirya-dev avatar May 18 '21 11:05 kirya-dev

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 14 '21 12:11 github-actions[bot]