spout icon indicating copy to clipboard operation
spout copied to clipboard

Implement \Countable interface

Open sandrokeil opened this issue 9 years ago • 18 comments

It would be great to implement a \Countable interface. Simply loop over rows/lines without extracting data.

What do you think?

sandrokeil avatar Jul 19 '16 09:07 sandrokeil

The main problem implementing Countable is that you have to go through the entire XML file storing the sheet's structure to get the count. And if you want the values, you have to go through it one more time. For large spreadsheets, this is really time/resources consuming and I don't think the benefits are big enough to add this. But I may be wrong...

What's your use case?

adrilo avatar Jul 20 '16 06:07 adrilo

I have an import on CLI and I want to show a progress bar / number of entries (Symfony console is used). I know that this is time consuming, but without converting it's faster as my current workaround (iterate over all entries and increase a count variable, after that, the real import is started).

Have you another idea how to show a progress bar?

sandrokeil avatar Jul 20 '16 15:07 sandrokeil

I did a quick performance test to see how long it would take to get a count. For an ODS file containing 1 million rows (and 3 million cells), it takes between 12 seconds (using DOMDocument) and 23 seconds (using XMLReader). I expect similar results for XLSX files.

The results actually better than I expected :) So I may consider adding it to the library.

adrilo avatar Jul 20 '16 17:07 adrilo

For reference, here is my code:

function usingDOMDocument($filePath) {
    $dom = new DOMDocument();
    $dom->load($filePath);
    $nodes = $dom->getElementsByTagNameNS('urn:oasis:names:tc:opendocument:xmlns:table:1.0', 'table-row');
    return $nodes->length;
}

function usingXMLReader($filePath) {
    $xmlReader = new XMLReader();
    $xmlReader->open($filePath, null, LIBXML_NONET);

    $count = 0;
    while ($xmlReader->read()) {
        if ($xmlReader->name === 'table:table-row' && $xmlReader->nodeType === XMLReader::ELEMENT) {
            $count++;
        }
    }
    return $count;
}

adrilo avatar Jul 20 '16 17:07 adrilo

BTW there is no other way to show a progress bar, you need to get the count first :)

adrilo avatar Jul 20 '16 18:07 adrilo

@adrilo That sounds awesome. :+1: I have only 5000 rows at the moment.

sandrokeil avatar Jul 20 '16 20:07 sandrokeil

If this is implemented in 3.0 would it also be possible to get the max columns for a given sheet?

beausif avatar Jun 15 '18 17:06 beausif

I need the exactly same feature for XLSX (also for the same reason; creating reports and showing the progress in the CLI).

andreas-gruenwald avatar Jan 18 '19 08:01 andreas-gruenwald

I just tried the XLSX approach and it worked like a charm: This is what I essentially added in the XLSX Row Iterator class:

namespace Box\Spout\Reader\XLSX;
class RowIterator implements IteratorInterface, \Countable {
   //... 
   protected $numTotalRows = 0;
   //...
   public function rewind() {
     //...
     $filePath = $this->xmlReader->getRealPathURIForFileInZip($this->filePath, $this >sheetDataXMLFilePath);
     $dom = new \DOMDocument();
     $dom->load($filePath);
     $nodes = $dom->getElementsByTagName('row');
     $this->numTotalRows = $nodes->length;
    //...
   }
   //...
    public function count() {
        return $this->numTotalRows;
    }
   //...
}

Can we add that to the code base?

andreas-gruenwald avatar Jan 18 '19 09:01 andreas-gruenwald

Having such option is a must.

mmihalev avatar Jan 23 '19 09:01 mmihalev

The implementation itself is not difficult. What's more difficult is telling people that getting the row count can be slow. It creates confusion as people usually thinks that's it's a simple value to read. With big spreadsheets, going through all "rows" may take a while... @andreas-gruenwald Have you tried running your code on a spreadsheet with 1 million rows? I'm curious about the perf

adrilo avatar Jan 25 '19 12:01 adrilo

@adrilo I tested with a spreadsheet consisting of 7 sheets, consisting of 100.000s of lines (some of them have between 200.000 and 500.000 lines).

Without my add-on the initial setup of the iterator took more than 50 minutes, then i stopped the script. With the add-on it takes just a couple of seconds and is super fast.

andreas-gruenwald avatar Jan 25 '19 12:01 andreas-gruenwald

Any updates on this one?

andreas-gruenwald avatar Apr 12 '19 19:04 andreas-gruenwald

Any updates or ETA for this?

ratkaiga avatar Aug 22 '20 08:08 ratkaiga

I needed a way of getting total rows of file and I've added a function in my Symfony service that works.

filename is a property in the service, you can pass it to the function instead.

It can be improved with constants, but this works for me.

    public function getTotalRows($reader)
    {
        $fileSystem = new Filesystem();
        $finder = new Finder();
        $zip = new \ZipArchive();
        $dom = new \DOMDocument();
        $baseName = pathinfo($this->fileName)['basename'];
        $path = sys_get_temp_dir() . '/' . $baseName;
        $nodesCount = 0;

        if ($reader instanceof XLSXReader) {
            $xmlPath = $path . '/xl/worksheets';

            if ($zip->open($this->fileName) === TRUE) {
                $zip->extractTo($path);

                $finder->files()->in($xmlPath);

                foreach ($finder as $xmlFile) {
                    $dom->load($xmlFile->getRealPath());

                    $nodes = $dom->getElementsByTagName('row');
                    $nodesCount += $nodes->length - 1; // to skip header
                }

                $fileSystem->remove($path);
            }

            return $nodesCount;
        }

        if ($reader instanceof ODSReader) {
            $xmlPath = $path . '/content.xml';

            if ($zip->open($this->fileName) === TRUE) {
                $zip->extractTo($path);

                $dom->load($xmlPath);

                $nodes = $dom->getElementsByTagNameNS('urn:oasis:names:tc:opendocument:xmlns:table:1.0', 'table-row');
                $nodesCount += $nodes->length - 1; // to skip header

                $fileSystem->remove($path);
            }

            return $nodesCount;
        }

        /**
         * Ref: https://stackoverflow.com/a/43075929
         */
        if ($reader instanceof CSVReader) {
            $file = new \SplFileObject($this->fileName, 'r');
            $file->setFlags(\SplFileObject::READ_AHEAD | \SplFileObject::SKIP_EMPTY | \SplFileObject::DROP_NEW_LINE);
            $file->seek(PHP_INT_MAX);

            return $file->key() + 1;
        }

        return null;
    }

Tersoal avatar Oct 03 '20 19:10 Tersoal

Note that @Tersoal's solution for CSV does not take into account values that contain line breaks. For a proper count, there's no other way than parsing every row.

adrilo avatar May 15 '21 09:05 adrilo

I see. The method must be improved. Also it's coupled to symfony. Sorry for the inconvenience.

Tersoal avatar May 15 '21 10:05 Tersoal

@Tersoal No worries, the comment was more a note for myself that it was not fully compliant. The other implementations look good though :)

adrilo avatar May 15 '21 14:05 adrilo