DataCleaner icon indicating copy to clipboard operation
DataCleaner copied to clipboard

Using Excelsheet for output severly degrades performance

Open LosD opened this issue 8 years ago • 4 comments

Originally submitted by @jacques-hi: When a DC job is configured to write to an excel sheet, data cleaner becomes so slow that editing the job becomes nearly impossible. This only happens if the excel sheet actually contains (a reasonable amount of) data. I looks as if datacleaner loads the complete output file (with results from the previous time the job has run) in memory.

LosD avatar May 31 '16 06:05 LosD

I tested this issue by taking the customers.csv which comes with DataCleaner and copying/pasting all records in it until it was 10 times its original size (from 5115 records to 51150 records). Then I built a job using the larger customers.csv as a datastore and just adding a Create Excel spreadsheet writer to it which just write all values of the customers datastore to the excelsheet. When you run the job you see it slowly slows down. On my computer at first it takes about 4 seconds to process 2000 records. That keeps taking more time and at the end it takes over 10 seconds to process 2000 records.

By adding some logging I saw that the org.datacleaner.job.runner.ConsumeRowHandlerDelegate#consume kept taking more and more time, which can be taken as a starting point for further investigating this issue.

arjansh avatar Jun 02 '16 09:06 arjansh

I configured the Create Excel spreadsheet writer in this manner:

        <analyzer>
            <descriptor ref="Create Excel spreadsheet"/>
            <metadata-properties>
                <property name="CoordinatesX">306</property>
                <property name="CoordinatesY">160</property>
            </metadata-properties>
            <properties>
                <property name="Fields" value="[id,given_name,family_name,company,address_line,post_code,city,country,email,birthdate,gender,job_title,income_amount,income_currency]"/>
                <property name="File" value="C:/Temp/DataCleaner-staging.xlsx"/>
                <property name="Overwrite sheet if exists" value="true"/>
                <property name="Sheet name" value="test"/>
            </properties>
            <input ref="col_id" name="Columns"/>
            <input ref="col_given_name" name="Columns"/>
            <input ref="col_family_name" name="Columns"/>
            <input ref="col_company" name="Columns"/>
            <input ref="col_address_line" name="Columns"/>
            <input ref="col_post_code" name="Columns"/>
            <input ref="col_city" name="Columns"/>
            <input ref="col_country" name="Columns"/>
            <input ref="col_email" name="Columns"/>
            <input ref="col_birthdate" name="Columns"/>
            <input ref="col_gender" name="Columns"/>
            <input ref="col_job_title" name="Columns"/>
            <input ref="col_income_amount" name="Columns"/>
            <input ref="col_income_currency" name="Columns"/>
        </analyzer>

arjansh avatar Jun 02 '16 09:06 arjansh

I pinpointed it a bit more. It seems like most time is spent in the org.apache.metamodel.excel.ExcelDataContext#executeUpdate(UpdateScript) method, which first spends a significant portion of time on invoking UpdateScript#run(UpdateCallback) and then another significant portion of time on invoking UpdateCallback#close().

arjansh avatar Jun 02 '16 11:06 arjansh

I believe that Apache POI does not release the memory after usage, therefore causing DC slowness.

2016-06-09 13_48_57-java visualvm

ClaudiaPHI avatar Jun 20 '16 13:06 ClaudiaPHI