odftoolkit icon indicating copy to clipboard operation
odftoolkit copied to clipboard

Fill the table with few threads

Open apb12 opened this issue 1 year ago • 9 comments

this works

 public void writeTable(Table table, List<String> requiredColumns, List<DTO> indexList) {
        for (int i = 0; i < requiredColumns.size(); i++) {
         
            writeValue(table, indexList, requiredColumns.get(i), iI));
        }
    
    }    

this not works

public void writeTable(Table table, List<String> requiredColumns, List<DTO> indexList) {
        ExecutorService executorService = Executors.newFixedThreadPool(requiredColumns.size());
        for (int i = 0; i < requiredColumns.size(); i++) {
            int finalI = i;
            executorService.submit(() -> writeValue(table, indexList, requiredColumns.get(finalI), finalI));
        }
        executorService.shutdown();
        try {
            executorService.awaitTermination(20, TimeUnit.MINUTES);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

commom method

    private void writeValue(Table table, List<DTO> indexList, String columnName, int columnNumber) {
        table.getCellByPosition(0, columnNumber).setStringValue(columnName);
        int rowNumber = 1;
        for (DTO index : indexList) {
            if (columnName.equals("id")) {
                table.getCellByPosition(rowNumber++, columnNumber).setStringValue(index.getId);
            }
            if (columnName.equals("name")) {
                table.getCellByPosition(rowNumber++, columnNumber).setStringValue(index.getName);
            }
            etc.....
        }
    }

If i do it in single thread it works, but it take too long - 14 columns 590k rows ,it take more then 30 mins..

ive tried to do it with code above in multithreading - each column will be writing in new thread and it works rly fast but file that i get is not readable.

Is there any chance to do it faster? My app is based on your lib. Please any idea

apb12 avatar Sep 11 '24 07:09 apb12

  1. ODFDOM isn't threadsafe, so the client needs to make sure it's only accessed by one thread at a time.

  2. ODFDOM architecture isn't really a good fit for processing "huge" spreadsheets, it's much better to use LibreOffice Calc for big spreadsheets, with its highly optimized C++ core... you can use it from Java via UNO API

  3. if you want to speed up ODFDOM, start by profiling where it's spending all the time - i bet it can be made substantially faster than it is currently, but it will never be performance competitive with Calc

mistmist avatar Sep 11 '24 08:09 mistmist

Allow me just to put some further meat on Michael's answer... and allow me I might get a bit off-topic when talking on history and design.. ;-)

1a. Michael is totally correct, ODFDOM was not implemented to have thread safeness in mind, still if the threads are not editing simultaneously the document (writing in the same place) , but work on specific parts "just filling a distinct part of the table", then you "just need to make certain that they do not write simultaneously on the same data".

1b. I will not dive into the problem too deeply, but if you can manage to divide the spreadsheet in parts (parts of rows) and let the parts be created by different threads and afterwards be able to glue these parts in the end together - you should be safe - still uncertain if there has to be some post work (references?)... For instance, 300k of rows being split up into 100k and 100k and 100k. The interesting question is how do you mark these? Perhaps very user specific, if you know it is a large table and you know that the action to be performed could be performed in multiple threads.

2+3. Some history... We started decades ago to work on the ODF Toolkit and first on ODFDOM. From the beginning we started to generate ODFDOM Java sources from the RelaxNG XML grammar, but we could not use existing Java XML Binding (JAXB), as JAXB works only with XSD grammars not RNG. Therefore we generated the Java sources ourselves, first with XSLT later split the complexity within the XSLT into

a) reading the RNG into Multi Schema Validator (MSV) and wrapping the MSV model with our Generator (Schema2Template) module https://odftoolkit.org/generator/index.html b) to fill templates from Apache Velocity (Template Engine) https://velocity.apache.org/engine/2.0/user-guide.html as well glued together by our Generator (Schema2Template) module https://odftoolkit.org/generator/index.html

The focus of our approach was to make the code developer friendly, we created a DOM model for the ODF XML tree(s) of the ODF document. Easy to debug and even 3rd party XSLT is able to be applied directly to such an ODFDOM document - see XSLT Runner module https://odftoolkit.org/xsltrunner/ODFXSLTRunner.html. But a spreadsheet is in all documents format an anomaly in the document trees, as it is a two dimensional array within a tree structure. The spreadsheet can become arbitrary big and with some (hopefully past) naive ODF usages of ODF application emulating the background color of a table by the repetition of thousands of rows and setting the background color in each cell, were each row and cell is being created as DOM objects, would certainly kill any performance and consume far too much memory for that kind of table background color feature, it is simply inefficient and unnecessary.

To avoid the larger DOM creation some program designs sought to express the loaded XML data with as little memory as possible - like the Xerces XSD validator is using bits for boolean attributes (not cells but same problem), instead of full Java objects. This is just one of the optimizations Michael is likely talking about. But "never" as fast as calc as Michael stated, is a strong word, but never with a DOM implementation is surely correct. Still ODF XML runtime by DOM is in the end just an implementation detail. We can go even a step further, if you load a document in LibreOffice, the user often does not know what kind of syntax the document was loaded from. It could have been an ODT, docbook, doc, docx, rtf, etc. as long the feature set fills into the document syntax - and there is an importer/exporter -there is no difference. Quite similar to liquid that was filled into a glass, the features can be saved and reopened in various syntaxes of file formats . Therefore, nowadays I would even "layer away"/hide the specific XML dialect (syntax) with a semantic user layer. Such a semantic layer would focus on concepts as paragraphs, tables, characters and styles, which ar all used across many file formats, just with level of feature support various for the syntaxes (for instance, text files are very poor with format & styles, but markdown can rescue the basic features). Therefore, wrapping all these syntaxes with a semantic layer would have been a good idea from the beginning. To embrace multithreading from the beginning, I assume that even our file format requires some refactoring and honestly to me it does not matter if a format is using XML, JSON and binary variations for its syntax as long as there are open-source transformations on standardisation level available for everyone to change from variant into the other...

Back to your question, the most important tipp and next step for you was already given by Michael. Take a Java profiler... :-)

Good luck! Svante

On Wed, 11 Sept 2024 at 10:52, Michael Stahl @.***> wrote:

ODFDOM isn't threadsafe, so the client needs to make sure it's only accessed by one thread at a time. 2.

ODFDOM architecture isn't really a good fit for processing "huge" spreadsheets, it's much better to use LibreOffice Calc for big spreadsheets, with its highly optimized C++ core... you can use it from Java via UNO API 3.

if you want to speed up ODFDOM, start by profiling where it's spending all the time - i bet it can be made substantially faster than it is currently, but it will never be performance competitive with Calc

— Reply to this email directly, view it on GitHub https://github.com/tdf/odftoolkit/issues/325#issuecomment-2343041388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJNWYTD3RFSSFADK25SKDZWAAEBAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGA2DCMZYHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

svanteschubert avatar Sep 11 '24 11:09 svanteschubert

svanteschubert thanks for this deep answer. But im not sure for 1a. This code does not write data in same plase. All that it does fill the column completely from top to bottom so if you have 14 columns 14 threads will fill all of them and every thread write only to same coumn. That means threads will never write in same cell, its impossible because every thread gets his own column number and works with it only. thead 1 - get cell at position(0, 0) write data to it then going down and get cell at position(1, 0) then (2,0) thead 2 - get cell at position(0, 1) write data to it then going down and get cell at position(1, 1) then (2,1)

and it does not work - file is not readable

if i do it in single thread there is no any issues and file is readable

so i think odf isnt threadsafe even your threads write to different coords of table may be it works for rows, that mean first thread fill first 100k rows next one fill nex 100k and finally we will get table with 200k rows

apb12 avatar Sep 11 '24 16:09 apb12

@apb12 When you unzip the ODF document file (rename .ods to .zip and unzip), you will realize that ODF in the content.xml - like other document formats put the cell and its content into the rows. A column has no cell or cell content but only attributes like column colour. When you think of a two-dimensional array like a chess board, the chess figure is standing equally in a certain row and column, but to avoid data duplication the cell is only in the row. Therefore, you might consider splitting and filling the table by rows and not by columns, as otherwise, the writing on shared variables is more likely as the threads accessing the same row.

You need to debug with a Java IDE - I usually use Netbeans and IntelliJ IDEA - as you need to see what is happening and what corrupts the model in the end? Do you already know what is exactly corrupted? Have you taken a look at the ODF XML? I guess that there is concurrent write access to some static member or shared objects, either directly in ODFDOM or indirectly in the underlying Xerces DOM implementation.

I think you should change your scenario - threads writing into cells of different rows (like 3 threads sharing access and the first thread gets access to the first third of rows, the second thread to the second third of rows, etc...

It requires some investigation. You might want to give it a try! ;-)

Good luck! Svante

On Wed, 11 Sept 2024 at 18:35, apb12 @.***> wrote:

svanteschubert https://github.com/svanteschubert thanks for this deep answer. But im not sure for 1a. This code does not write data in same plase. All that it does fill the column completely from top to bottom so if you have 14 columns 14 threads will fill all of them and every thread write only to same coumn. That means threads will never write in same cell, its impossible because every thread gets his own column number and works with it only. thead 1 - get cell at position(0, 0) write data to it then going down and get cell at position(1, 0) then (2,0) thead 2 - get cell at position(0, 1) write data to it then going down and get cell at position(1, 1) then (2,1)

and it does not work - file is not readable

if i do it in single thread there is no any issues and file is readable

so i think odf isnt threadsafe even your threads write to different coords of table may be it works for rows, that mean first thread fill first 100k rows next one fill nex 100k and finally we will get table with 200k rows

— Reply to this email directly, view it on GitHub https://github.com/tdf/odftoolkit/issues/325#issuecomment-2344140122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJNW2NQDHFHBJEVRFU2L3ZWBWOXAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGE2DAMJSGI . You are receiving this because you commented.Message ID: @.***>

svanteschubert avatar Sep 11 '24 17:09 svanteschubert

svanteschubert i agreed with you, seems like we only can separate rows by threads not columns. I will try it out btw. Thx for answering.

apb12 avatar Sep 11 '24 18:09 apb12

svanteschubert also want to say, that there is no any issue or exception, so debug do not show you anything helpfull on this case. Everything going good until you try to open this file xD

apb12 avatar Sep 11 '24 18:09 apb12

@apb12 Can you unzip the result and take a look at the content.xml? Just curious! BTW if you desire to debug with Xerces-J - the used underlying DOM implementation, by incident I have pushed a testing-branch on GitHub for build with Netbeans & IntelliJ-IDEA https://github.com/svanteschubert/xerces-j/tree/smoke-testing, the original Xerces uses an ANT file, which did not work for me...

On Wed, 11 Sept 2024 at 20:47, apb12 @.***> wrote:

svanteschubert https://github.com/svanteschubert also want to say, that there is no any issue or exception, so debug do not show you anything helpfull on this case. Everything going good until you try to open this file xD

— Reply to this email directly, view it on GitHub https://github.com/tdf/odftoolkit/issues/325#issuecomment-2344425744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJNW7PGXLBHDMVKDVDQ4DZWCF3PAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGQZDKNZUGQ . You are receiving this because you commented.Message ID: @.***>

svanteschubert avatar Sep 11 '24 20:09 svanteschubert

@apb12 To add a debuggable Xerces-J (Java) you should better download the latest Xerces-J as ZIP https://xerces.apache.org/mirrors.cgi#binary as there are all required JARs on root level. For instance, for the IDE Netbeans https://netbeans.apache.org/front/main/index.html you do the following

  1. Create a new Project with existing Java sources
  2. Choose JDK 8
  3. Add the root src and the sample directory as Java Sources
  4. Add all JARs from the root (aside XercesImpl.jar, which you are going to build) to your class path
  5. JUnit and Hamcrest JAR dependencies were added by Netbeans, when you fix the project (context menu on the project).

Happy debugging, Svante

On Wed, 11 Sept 2024 at 22:31, Svante Schubert @.***> wrote:

@apb12 Can you unzip the result and take a look at the content.xml? Just curious! BTW if you desire to debug with Xerces-J - the used underlying DOM implementation, by incident I have pushed a testing-branch on GitHub for build with Netbeans & IntelliJ-IDEA https://github.com/svanteschubert/xerces-j/tree/smoke-testing, the original Xerces uses an ANT file, which did not work for me...

On Wed, 11 Sept 2024 at 20:47, apb12 @.***> wrote:

svanteschubert https://github.com/svanteschubert also want to say, that there is no any issue or exception, so debug do not show you anything helpfull on this case. Everything going good until you try to open this file xD

— Reply to this email directly, view it on GitHub https://github.com/tdf/odftoolkit/issues/325#issuecomment-2344425744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJNW7PGXLBHDMVKDVDQ4DZWCF3PAVCNFSM6AAAAABOAIZWR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGQZDKNZUGQ . You are receiving this because you commented.Message ID: @.***>

svanteschubert avatar Sep 12 '24 16:09 svanteschubert

@apb12 did you run with a profiler? If so, is the greatest part of the 30 minutes spend inside ODF Toolkit, or is it because generating the values takes time and the next write can only occur when the current one is finished? If that is the case, you should use another approach:

  • use different threads to create the data to be inserted into the table
  • use a single thread to make the updates
  • coordinate it either by writing your own code or use the Flow API introduced in Java 9

xzel23 avatar May 18 '25 04:05 xzel23