excel-templates icon indicating copy to clipboard operation
excel-templates copied to clipboard

Rewrite to manipulate XML directly without POI

Open tomfaulhaber opened this issue 8 years ago • 5 comments

The POI library was a great way to start working on this library, but at this point it might be causing more complexity than it's saving us from.

Some of the issues with POI are:

  • POI relies on mutation of the underlying spreadsheet object which is not really in line with the Clojure philosophy.
  • POI will modify files and can cause corruption when to many changes happen at once. To circumvent this I create multiple copies of the spreadsheet as we go along.
  • POI can not clone sheets with charts on them. This leads to bug #7 for which the workaround is quite complex.
  • POI's formula parser resolves worksheet references to the numerical index of the sheet. This means that the worksheet referenced must exist when a formula is being parsed or unparsed. This causes a fair amount of complexity because we can't delete a sheet until all the formulas that reference that sheet have been remapped to a new sheet.
  • POI does not support 100% of excel, which leads to some holes. For example, the workaround to #7 can't handle charts with absolute anchors because there's no function to recreate them when they're added back to the sheet.
  • POI's use of OOXML to create class wrappers for the XML structure of the spreadsheet can lead to real complexity. In the case of charts, I already work around this by extracting the XML and manipulating formulas directly. Similar issues come up in the manipulation of the parts of the ZIP container.

POI still does provide some powerful benefits:

  1. The formula parser decomposes formulas in a way that makes them very easy to work with (except for the issue that all sheets need to exist, mentioned above)
  2. POI supports recomputation of most formulas that makes the spreadsheets load correctly after transformation.
  3. POI is widely used and can generally be trusted to represent spreadsheets correctly.

For 1, above, we would need to write a new formula parser that would be more clojure-esque. I would love to have this for use in the current library even before switching away from POI as it would allow me to simplify the sheet management.

For 2, above, I can see two options. We might be able to get Excel to do it for us at load time (though that may mean that the delay is incurred repeatedly in large, widely distributed spreadsheets. Alternatively, we could simply load the output spreadsheet with POI once it's created and force a full recomputation.

Now that I've learned a lot more about the internal workings of the excel format, I believe that it would be possible to recast the library as a set of transformations on the XML entities that create the spreadsheet. This would be a cleaner, more Clojure-y way to do things.

Spreadsheets are built as a zip file that contains a bunch of XML files and these XML files can have relationships with one another. For example, a worksheet XML can reference a drawing XML and the drawing XML will in turn reference one or more chart XMLs. The library already manipulates these XML objects directly for handling charts.

tomfaulhaber avatar Mar 12 '16 22:03 tomfaulhaber

Hey Tom,

Man,... that sounds like a powerful enhancement to do it that way. +1 if it wrote direct to xml. Especially if you can keep the fancy stuff like charts inside the template and it just needs to do data.

sundbry avatar Mar 23 '16 06:03 sundbry

I wrote a long comment on #30 (here) that describes a lot about Office XML.

Of particular interest was my discovery of the formula grammar on page 2043 of ECMA-376, 4th Edition, Office Open XML File Formats — Fundamentals and Markup Language Reference.

tomfaulhaber avatar Apr 04 '16 01:04 tomfaulhaber

I've started working on this (and issue #30) a bit. You can follow any progress (or lack of progress) at https://github.com/smahood/excel-templates if you are interested. The Office XML reference is a lot more approachable than I expected! It's also stretching some of the areas of Clojure that I have little experience with, so it's a pretty awesome learning experience so far.

I'm approaching things assuming that we want to completely remove the dependency on POI at some point. Is that correct, or are there some of the POI features that you want to keep?

smahood avatar Apr 22 '16 21:04 smahood

@smahood: Great! At ClojureWest, Ghadi Shayban introduced a new parser called PEX (based on the the LUA PEG implementation). This might be interesting here for both clojure-y-ness (the grammar is actual clojure data structures) and performance. In any case, his talk was excellent, so I recommend it in any case.

The talk: https://www.youtube.com/watch?v=9Q--oX5muxw&list=PLZdCLR02grLq4e8-1P2JNHBKUOLFTX3kb&index=25 The library: https://github.com/ghadishayban/pex

More abstractly, I can think of 3 choices for the parser:

  1. Instaparse: easy and widely used, but maybe too slow for larger spreadsheets, plus grammars as strings has always kind of bugged me.
  2. Antlr: fast and super-mature but not so elegant to use from Clojure
  3. PEX: immature but elegant and has the potential to be very fast

I think that my inclination would be to go with PEX even though it's immature because it seems pretty cool and I'm sure that Ghadi will help us if we need it. If the immaturity becomes a problem, we can always switch horses.

One think to think about when writing the parser is that we need to both parse and generate formulas. I'd like the formulas to round-trip as seamlessly as possible. In particular, it would be nice if we didn't change the parenthesization the user used - keep parens that the used (necessary or not) and don't add them where they don't exist in the original formula.

Hope that helps!

tomfaulhaber avatar Apr 22 '16 21:04 tomfaulhaber

I've been slowly making my way through the talks videos, that's one that I've been saving for a day when I feel smarter than usual.

I think putting a strong emphasis on clojure-y-ness should be a definite goal, since otherwise we could just stick with POI. I'm guessing I'll be stuck on the easier problems for a while, so maybe PEX will be out of it's "super-alpha" stage by the time we get to formula parsing.

I also totally think that the formulas should be as close to the original, including unnecessary parentheses and any other unnecessary parts.

smahood avatar Apr 22 '16 22:04 smahood