DataCleaner
DataCleaner copied to clipboard
Using Excelsheet for output severly degrades performance
Originally submitted by @jacques-hi: When a DC job is configured to write to an excel sheet, data cleaner becomes so slow that editing the job becomes nearly impossible. This only happens if the excel sheet actually contains (a reasonable amount of) data. I looks as if datacleaner loads the complete output file (with results from the previous time the job has run) in memory.
I tested this issue by taking the customers.csv which comes with DataCleaner and copying/pasting all records in it until it was 10 times its original size (from 5115 records to 51150 records). Then I built a job using the larger customers.csv as a datastore and just adding a Create Excel spreadsheet writer to it which just write all values of the customers datastore to the excelsheet. When you run the job you see it slowly slows down. On my computer at first it takes about 4 seconds to process 2000 records. That keeps taking more time and at the end it takes over 10 seconds to process 2000 records.
By adding some logging I saw that the org.datacleaner.job.runner.ConsumeRowHandlerDelegate#consume
kept taking more and more time, which can be taken as a starting point for further investigating this issue.
I configured the Create Excel spreadsheet writer in this manner:
<analyzer>
<descriptor ref="Create Excel spreadsheet"/>
<metadata-properties>
<property name="CoordinatesX">306</property>
<property name="CoordinatesY">160</property>
</metadata-properties>
<properties>
<property name="Fields" value="[id,given_name,family_name,company,address_line,post_code,city,country,email,birthdate,gender,job_title,income_amount,income_currency]"/>
<property name="File" value="C:/Temp/DataCleaner-staging.xlsx"/>
<property name="Overwrite sheet if exists" value="true"/>
<property name="Sheet name" value="test"/>
</properties>
<input ref="col_id" name="Columns"/>
<input ref="col_given_name" name="Columns"/>
<input ref="col_family_name" name="Columns"/>
<input ref="col_company" name="Columns"/>
<input ref="col_address_line" name="Columns"/>
<input ref="col_post_code" name="Columns"/>
<input ref="col_city" name="Columns"/>
<input ref="col_country" name="Columns"/>
<input ref="col_email" name="Columns"/>
<input ref="col_birthdate" name="Columns"/>
<input ref="col_gender" name="Columns"/>
<input ref="col_job_title" name="Columns"/>
<input ref="col_income_amount" name="Columns"/>
<input ref="col_income_currency" name="Columns"/>
</analyzer>
I pinpointed it a bit more. It seems like most time is spent in the org.apache.metamodel.excel.ExcelDataContext#executeUpdate(UpdateScript)
method, which first spends a significant portion of time on invoking UpdateScript#run(UpdateCallback)
and then another significant portion of time on invoking UpdateCallback#close()
.
I believe that Apache POI does not release the memory after usage, therefore causing DC slowness.