xlsx-populate icon indicating copy to clipboard operation
xlsx-populate copied to clipboard

Exported workbook file size increases drastically with styles

Open JuanIrache opened this issue 5 years ago • 26 comments

Hi,

Thanks for this fantastic module.

I'm using it for exporting a database to xlsx in a node app. The data I work with is about 900 rows and 70 columns. The exported file size is about 400KB if I leave the plain cell style, but about 1,250KB if I apply colors (fill) and borders to them. Is that to be expected? If I open the plain version and apply similar styles directly in LibreOffice Calc, the file size is even reduced to about 350KB.

I'm using v1.19

Thanks

JuanIrache avatar Feb 04 '19 12:02 JuanIrache

I looked into the code, applying style to m*n cells will create m*n styles, that is why the file size is large. This module is not yet optimized for setting a range of styles.

LesterLyu avatar Feb 04 '19 21:02 LesterLyu

Hi, thanks for your answer, but I'm not sure I understand if I can do something about the problem. In order to set the styles, I'm doing things like

sheet.row(1).style("topBorderColor", 'ffe900');
sheet.column("AA").style("fill", 'ccffd8');

Is there a way I can optimise this? Or will there be a fix?

Thanks

JuanIrache avatar Feb 05 '19 00:02 JuanIrache

Update: I thought the problem were the colours of the cells, but removing them didn't provide much of an improvement. Removing the border did. The issue is pretty much gone once there is no border.

JuanIrache avatar Feb 23 '19 15:02 JuanIrache

Unfortunately, the current style implementation isn't very efficient. It creates separate style objects per cell. Excel and OO will combine styles that are the same for multiple cells into single instances.

dtjohnson avatar Mar 01 '19 01:03 dtjohnson

@dtjohnson Could you provide some links in regards to the specification or a high-level overview I could follow to begin building this optimization? I believe, with large workbooks, this is preventing excel from opening my files after I have applied styles. /shrug

mdgozza avatar Jun 20 '19 00:06 mdgozza

@LesterLyu If I created a Style object manually and provided it to cells usingcell.style() call, would they use / reference the same style object?

In Cell.js

.case('Style', style => {
                this._style = style;
                this._styleId = style.id();

                return this;
            })

mdgozza avatar Jun 20 '19 02:06 mdgozza

I believe I've built an MVP solution for the individual styles per cell problem. I'll explain below and if you'd like me to continue with this approach, or similar, I'd be happy to commit back :)

In my use case, I have alot of large range copies / row copies ( values and styles, not formulas)

I built a file that will take in a given StyleSheet, which should belong to the destination WorkBook and it will take in a styleProperties object ( retrieved from a cell with cell.styles([...]) ) and return an instance of a Style object.

If I have seen the request for this style properties object before ( by checking for deep equality ) I will simply return the earlier created style instance ( which has it's associated id etc.. )

Afterwards when I preform the set operation with cell.style(Style) the application just uses that ID as opposed to creating a new Style instance. ( Since I noticed that every cell has a style object, this should just change that reference? )

A potential major downside to this approach is that once cells styles have been pointed to the same style object reference, I expect that programmatically changing a cells style at the individual level may inadvertently affect all other cells who share that reference. ( This is a similar approach to exceljs seen here: https://github.com/exceljs/exceljs#styles )

Thinking to the future, these are some high level suggestions

  • When initially parsing the xml, we can create linked style object references ( this will cut down file size bloat we see off the top )
  • From the Cell, Range, and Row api perspective, we can allow a new style object created fresh at the cell level, to override an existing link as opposed to editing one

Here is my example work around solution that only created 12 additional style nodes instead of 950+ which I consider a slight MVP of these possible approaches:

import { isEqual } from "lodash";
const styleNodesByWorkbook = new WeakMap();

export function getStyleNode(styleProperties: any, StyleSheet: any) {
  if (styleNodesByWorkbook.get(StyleSheet) === undefined) {
    styleNodesByWorkbook.set(StyleSheet, []);
  }

  const createdStyleNodes = styleNodesByWorkbook.get(StyleSheet);

  const matchedNode = createdStyleNodes.find((node: any) => {
    return isEqual(styleProperties, node.styleProperties);
  });
  if (matchedNode) {
    return matchedNode.styleNode;
  }
  // we need to create a new style object to match these
  // style properties
  const newStyleNode = StyleSheet.createStyle();
  // set the properties on this style node;
  Object.keys(styleProperties).forEach(property => {
    const value = styleProperties[property];
    newStyleNode.style(property, value);
  });
  createdStyleNodes.push({ styleProperties, styleNode: newStyleNode });
  return newStyleNode;
}

And an example of how I use it

export function copyRanges(
  sourceRange: any,
  destRange: any,
  destinationWorkbook: any
) {
  const StyleSheet = destinationWorkbook._styleSheet;

  sourceRange.forEach((cell: any, rowIndex: number, columnIndex: number) => {
    const destCell = destRange.cell(rowIndex, columnIndex);

    // get style object from current cell
    const style = cell.style(["numberFormat", "fill", "fontColor"]);

    const styleNode = getStyleNode(style, StyleSheet);

    // set the values of the destination cell
    destCell.value(cell.value());

    // set the styles of the cell
    destCell.style(styleNode);
  });

EDIT:

To add to my initial analysis, when I took my file and doubled the number of rows & column from ( 1013 x 38 ) to ( 2026 x 38 ) I was once again not able to open the generated .xlsx file in excel even though with my new enchantment I only created 12 unique new style nodes.

I'm only guessing, but I suppose I can attribute that to individual style objects being created at the initial xml parse? But I can't be sure...

mdgozza avatar Jun 20 '19 04:06 mdgozza

I believe your approach is not bad but I would keep the style per cell as it is with a little twist: when generating the xml, I would create an upper level of unique styles where every cell style will point to. This way you can change individual cells and only at building the uniqueness is checked.

dresende avatar Feb 26 '20 15:02 dresende

Hi @mdgozza and @dresende, I also would really like to see this issue fixed, is there any way I can help? I don't have knowledge of how xlsx works internally (except that it is a bunch of xmls)...

Also, do you know any other library that can handle this? (it does not have to be in Node, could be any other language)

papb avatar Jun 15 '20 02:06 papb

This is a pandora box. What happens behind the scenes is that every time you style a range/cell you create a style that is stored, even it if exists already (with the same styling properties). I tried an approach on saving that would try to find duplicated styles and relink to only just one copy of each but ran out of time to work on this.

dresende avatar Jun 15 '20 08:06 dresende

@dresende Thanks for the reply!

[...] every time you style a range/cell you create a style that is stored [...]

I am curious - What I did is just a few calls to sheet.usedRange().style() (12 calls in fact, I can test with less later), with about 150 thousand cells in the range. From what you said, only 12 styles should be created, is 12 styles too much already?

papb avatar Jun 15 '20 12:06 papb

It’s been a long while since I’ve looked at this.

@dresende your approach is definitely better, maybe one day I will look to implement it.

@papb I didn’t think you could apply styles to a range, rather only a cell? I’m only basing this assumption off my previous comment though.

12 styles should be fine, unless of course the code applies the style to each cell in the range. Which would give you 150,000 styles objects. You would have to review that code to be sure, or the output xml.

The best approach is to deduplicate style objects at build time of the xml. Walk the tree and find any matching styles and replace their reference to one style node, making sure to delete the original. I believe this is what OO does.

mdgozza avatar Jun 15 '20 12:06 mdgozza

I am curious - What I did is just a few calls to sheet.usedRange().style() (12 calls in fact, I can test with less later), with about 150 thousand cells in the range. From what you said, only 12 styles should be created, is 12 styles too much already?

It's probably each cell individually that gets a style, I don't remember now.

I tried another approach but then I realised you could change any style in the mean time so to really make it scale I couldn't find a match when you apply a style but only when saving/creating the file. It shouldn't be too hard but since I took a substantial time understanding and reverse engineering the output, I had little time then. I since needed to move to another project and this was kept on hold. I really want to help but am out of time right now.

dresende avatar Jun 15 '20 14:06 dresende

I have read through the thread. I don't have much knowledge of Excel formats, my info is from README.MD of this project. Bunch of XML...

If I missed the problem, sorry for the loss of your time.


As I read through I got an idea that might be able to solve this.

{ border: 'thin' } =! { border: 'thin'}, but JSON.stringify({ border: 'thin'}) == JSON.stringify({ border: 'thin'})!

Note that I use JSON.stringify, but here it is just a method to transform an object to a string.

This will however break if keys were defined in a different order! That could be mitigated if custom parsing would be used. One that orders keys of styles alphabetically (, or any other defined way).

With a way to determine if 2 styles are equal, we can then check if the style is present and only change the styleID of the cell to the one that already exists.

  • Create new Map<stringifiedStyle, StyleID>() for stringified styles ('<style border="thin" fill="fff"' /> will do just fine!)
  • For every cell create style string (XML as string is fine)
    • Check if this stringified style is in set of already added styles
    • If yes, change styleID of cell to the new one. (cell.styleID = map.get(stringifiedStyles))
    • If no, add it to the map (map.set(stringifiedStyles, nextStyleID++))
  • Repeat for every cell

Akxe avatar Feb 02 '21 20:02 Akxe

Hi @Akxe, I think your idea makes sense! I suggest using hash-obj instead of JSON.stringify to make key order irrelevant. Would you be willing to submit a PR?

papb avatar Mar 10 '21 14:03 papb

@papb I would be against it because hash-obj is node only library I would suggest using sort-keys and actually the JSON... :/

I could do the PR, but I am out of time. I will need it in a month, so if that is a reasonable horizon for you, I will do it about then... (hopefully :D)

Akxe avatar Mar 10 '21 14:03 Akxe

@papb I would be against it because hash-obj is node only library I would suggest using sort-keys and actually the JSON... :/

Makes sense!!

I could do the PR, but I am out of time. I will need it in a month, so if that is a reasonable horizon for you, I will do it about then... (hopefully :D)

Awesome!

papb avatar Mar 10 '21 14:03 papb

To anyone who's still struggling with this... If you only need a per-column style (which also caused a corrupt XLSX with around 1000 rows and 20 columns of data), there's a fairly simple solution. I created a "default" style which I'm using for each and every column in my XLSX, and extend the individual column styles from it. The code should be something like this:

    const workbook = await XlsxPopulate.fromBlankAsync();

    // Create a default template style to extend
    const defaultStyle = (workbook as any).styleSheet().createStyle();
    defaultStyle.style('wrapText', true);
    defaultStyle.style('verticalAlignment', 'top');

    // Iterate over columns
    for(let j=0;j<20;j++ ) {
        const column = workbook.sheet(0).column(j + 1);
        // Create a new style for the given column
        const columnStyle = (workbook as any).styleSheet().createStyle(defaultStyle);

        // Adding styles from styleObj which is an object with props from the Style Guide
        Object.keys(styleObj.style).forEach((s) => columnStyle.style(s, styleObj.style[s]));

        // Set column style using a Style object, which will be used by reference -> no new style created
        column.style(columnStyle);
    }

This way, the internal styles.xml definition keeps being a modest 12K file. Also, the final size of the assembled XLSX is reduced more than 50%.

Sleeper9 avatar Jan 25 '22 10:01 Sleeper9

For anyone still trying to work around this, I haven't been able to dedicate time to it, but I found a solution that works for me (20k rows table). It's a particular case but perhaps it works for you.

My tables have styles per column (actually a range, not full excel column), not per cell. But when defining the same style per cell (or range, it's just a shortcut), it will create a new style, even though it's the same as the cell above (same column). So what I do is the following:

Create the initial table headers, and as I'm navigating and creating headers, I create the style for the column and apply in the range of the rows. The difference here is that I create a style and pass it by reference. No need to update module.

let column_style = workbook.styleSheet().createStyle();

column_style.style("bold", true);
// whatever styles you want

sheet.range(row_start, col, row_end, col).style(column_style);

As I mentioned above, this works fine for a spreadsheet with a table with 10-20 columns and 20k rows.

dresende avatar Feb 16 '23 11:02 dresende

Note that you can't change a cell style individually after this or you will affect all cells in that column. For that you need to make some kind of style reset to that cell and start over.

dresende avatar Feb 16 '23 12:02 dresende

I analyzed code of @mdgozza and @Sleeper9, and I found more simple way.

    const styleCachedMap = new Map();

    function getStyleCached(styles) {
        const cacheKey = JSON.stringify(styles);
        let styleCached = styleCachedMap.get(cacheKey);
        if (styleCached) {
            return styleCached;
        }

        styleCached = book.styleSheet().createStyle();
        for (const [key, value] of Object.entries(styles)) {
            styleCached.style(key, value);
        }
        styleCachedMap.set(cacheKey, styleCached);
        return styleCached;
    }

    const styleCached = getStyleCached(styles);
    const cell = sheet.cell(addr);
    // Use this code instead "cell.style(styleCached);" because it raise 'Unknown type: Style' error
    cell._style = styleCached;

doctorgu avatar Jun 20 '23 08:06 doctorgu

@dresende

let column_style = workbook.styleSheet().createStyle();

column_style.style("bold", true);
// whatever styles you want

sheet.range(row_start, col, row_end, col).style(column_style);

I tried the above solution and it is working on debug mode but I am getting "Unknown type: Style" error in production build. I am using xlsx-populate with next js. Any idea what could be the problem?

zainalihashmi avatar Dec 21 '23 08:12 zainalihashmi

@zainalihashmi I use exactly that. I apply both to cells and cell ranges. I'm using version 1.21.0

dresende avatar Dec 21 '23 09:12 dresende

@dresende

let column_style = workbook.styleSheet().createStyle();

column_style.style("bold", true);
// whatever styles you want

sheet.range(row_start, col, row_end, col).style(column_style);

I tried the above solution and it is working on debug mode but I am getting "Unknown type: Style" error in production build. I am using xlsx-populate with next js. Any idea what could be the problem?

Were you able to resolve this issue 'Unknown type: Style' ?

manjunathr92 avatar Jan 17 '24 07:01 manjunathr92

@manjunathr92 I could not resolve the issue with xlsx-populate. I moved to the exceljs and tested it with upto 5 million styled cells and it seems to be working fine.

zainalihashmi avatar Jan 17 '24 09:01 zainalihashmi

@dresende

let column_style = workbook.styleSheet().createStyle();

column_style.style("bold", true);
// whatever styles you want

sheet.range(row_start, col, row_end, col).style(column_style);

I tried the above solution and it is working on debug mode but I am getting "Unknown type: Style" error in production build. I am using xlsx-populate with next js. Any idea what could be the problem?

Use _style instead of style. Refer my edited code: https://github.com/dtjohnson/xlsx-populate/issues/188#issuecomment-1598311616

doctorgu avatar Jan 31 '24 05:01 doctorgu