calamine icon indicating copy to clipboard operation
calamine copied to clipboard

Extract more information at the cell level

Open de-sh opened this issue 6 months ago • 11 comments

Hi there, I am new to excel/spreadsheets in general. As part of some work relating to the extraction of information from excel files I stumbled upon calamine and wanted to know if work on a unified struct to hold both formulas and values will be appreciated, or if it is even possible. The idea is to allow the user of this interface with the ability to iterate over cells instead of Data/String and thus be able to figure out if the cell is derived from elsewhere(by use of a formula, even the cells that are spilled into).

Thanks for the wonderful library, it has made life a whole lot easier!

de-sh avatar Jun 10 '25 13:06 de-sh

Sorry I am not sure I understand what you want.

tafia avatar Jun 19 '25 04:06 tafia

Sorry I am not sure I understand what you want.

Hi @tafia, I am planning on reading more information from the spreadsheet, not just the formula in string form, but possibly also the coordinates of the affected range, this will help me figure out the cells whose values are affected by the said cell through a spill and hence make a better choice on whether to use the stored value or compute it again.

e.g. a cell A1 with a dynamic array will spill into other cells

<sheetData>
  <row r="1" spans="1:1">
    <c r="A1" cm="1">
      <f t="array" ref="A1:A3">_xlfn.SEQUENCE(3, 1)</f>
      <v>1</v>
    </c>
  </row>
  <row r="2" spans="1:1">
    <c r="A2">
      <v>2</v>
    </c>
  </row>
  <row r="3" spans="1:1">
    <c r="A3">
      <v>3</v>
    </c>
  </row>
</sheetData>

de-sh avatar Jun 19 '25 08:06 de-sh

I see. This is not a trivial change. I am of course open for anyone to make an attempt but I will probably not have time for it myself.

tafia avatar Jun 30 '25 02:06 tafia

I am of course open for anyone to make an attempt but I will probably not have time for it myself.

I may attempt this at some point. We could have a new struct type similar to Cell that had more/all metadata from the cell such as the ref range (as requested above) but also the style/format index that could be used (opaquely) to get the number format and/or the entire format for thing like background/foreground color. That might resolve requests like #474 where users want the number formatting. It would still require them to apply the formatting themselves, if that is what they needed, but it would probably resolves most people's needs.

@tafia Had you considered something like that previously? Are there any hidden problems with that approach?

jmcnamara avatar Jun 30 '25 11:06 jmcnamara

I haven't looked at this at all. I think it is very niche and as a result I think it should not affect more regular use of the library. So I see 3 possibilities:

  • reimplement a whole new set of functions with new structs for extended cells (while keeping the old one for other people)
  • add a few functions which take cells as input, it may be slow because you need to parse it twice
  • add some custom features (off by default) which would add these extra properties for the cells

tafia avatar Jul 02 '25 03:07 tafia

@tafia @jmcnamara I'm currently working on a style implementation in a fork of this crate. We're currently using Calamine for values and formulas and need style reading support without adding the overhead of a new crate (and not a lot of great options out there anyhow).

Are you accepting PRs for this work? If so, do you have anything you'd like me to consider? Would you want this behind a feature?

My plan is to only implement for xlsx, but can consider other formats.

Basic api for getting styles

fn test_worksheet_style_iter() {
    let mut xlsx: Xlsx<_> = wb("styles.xlsx");
    let styles = xlsx.worksheet_style("Sheet 1").unwrap();

    for (row, styles) in styles.rows().enumerate() {
        for (col, style) in styles.iter().enumerate() {
            println!("row: {}, col: {}, style: {:?}", row, col, style);
        }
    }
}

Here is the relevant structs so far:

struct Style {
    font: Option<Font>,
    fill: Option<Fill>,
    borders: Option<Borders>,
    alignment: Option<Alignment>,
    number_format: Option<NumberFormat>,
    protection: Option<Protection>,
    style_id: Option<u32>,
}

struct Color {
    alpha: u8,
    red: u8,
    green: u8,
    blue: u8,
}

struct Borders {
    left: Border,
    right: Border,
    top: Border,
    bottom: Border,
    diagonal_down: Border,
    diagonal_up: Border,
}

struct Font {
    name: Option<String>,
    size: Option<f64>,
    weight: FontWeight,
    style: FontStyle,
    underline: UnderlineStyle,
    strikethrough: bool,
    color: Option<Color>,
    family: Option<String>,
}

struct Alignment {
    horizontal: HorizontalAlignment,
    vertical: VerticalAlignment,
    text_rotation: TextRotation,
    wrap_text: bool,
    indent: Option<u8>,
    shrink_to_fit: bool,
}

struct Fill {
    pattern: FillPattern,
    foreground_color: Option<Color>,
    background_color: Option<Color>,
}

struct NumberFormat {
    format_code: String,
    format_id: Option<u32>,
}

struct Protection {
    locked: bool,
    hidden: bool,
}

ddimaria avatar Jul 24 '25 14:07 ddimaria

Are you accepting PRs for this work?

Yes I would. I haven't worked out yet what would be the best way to return richer cell data within the current framework but that shouldn't block a styles parser.

Feel free to reference and/or borrow code/docs the Format and Color in rust_xlsxwriter.

Also, could you explain a little bit about your use case.

jmcnamara avatar Jul 24 '25 20:07 jmcnamara

WIP PR: https://github.com/tafia/calamine/pull/538

At my job, we're building an online spreadsheet that allows importing/exporting from/to Excel. We use Calmine for imports and rust_xlsxwriter for exports ❤️.

For imports, we currently just import values and formulas, but we want style/formatting support, as we utilize this feature in rust_xlsxwriter for exports.

I've been borrowing from Umya, which has this mostly worked out, though that crate has too many unwraps to use in production.

ddimaria avatar Jul 24 '25 21:07 ddimaria

We could have a new struct type similar to Cell that had more/all metadata from the cell such as the ref range (as requested above) but also the style/format index that could be used (opaquely) to get the number format and/or the entire format for thing like background/foreground color.

This is an example of the richer Cell information that could be returned:

/// Represents the data of a single cell in a spreadsheet.
pub struct CellData {
    // The position of the cell (row, column).
    position: (u32, u32),

    // The style index of the cell.
    style: usize,

    // The raw, uninterpreted, type of the cell.
    raw_cell_type: RawCellType,

    // The raw value stored in the cell as a string.
    raw_value: String,

    // The cell formula, if any.
    formula: Option<Formula>,

    // The cell hyperlink if any.
    hyperlink: Option<Hyperlink>,

    // The parent merged range, if any.
    merged_range: Option<Dimensions>,
}

/// Represents a formula in a spreadsheet cell.
pub struct Formula {
    // The formula string.
    formula: String,

    // The result of the formula evaluation.
    result: String,

    // The cell range where the formula is applied.
    range: Dimensions,
}

/// Represents a hyperlink in a spreadsheet cell.
pub struct Hyperlink {
    // The hyperlink URL.
    url: String,

    // The display text for the hyperlink.
    display_text: String,
}

/// Cell Dimensions.
pub struct Dimensions {
    // start: (row, col).
    start: (u32, u32),

    // end: (row, col).
    end: (u32, u32),
}

/// Represents the raw, uninterpreted, type of data stored in a cell.
pub enum RawCellType {
    // Cell containing a blank/empty cell with formatting only and no data.
    Blank,

    // Cell containing a number.
    Number(f64),

    // Cell containing a shared string.
    SharedString(usize),

    // Cell containing a boolean.
    Boolean(bool),

    // Cell contains a date in the ISO 8601 format.
    IsoDate(String),

    // Cell containing an error.
    Error(String),

    // Cell containing an (inline) rich string, not in the shared string table.
    InlineString(String),

    // Cell containing a formula.
    FormulaResultString(String),
}

This would have associated getters for the relevant information.

This is a much bigger object in terms of memory footprint and would be better suited to the Reader interface rather than creating a model of all the cells in memory.

Also, in XLSX at least, not all of this information is available at the cell level. In the XLSX format information about merged ranges, urls, conditional formats and other metadata is stored a the end of a xl/worksheets/sheetN.xml file after the <sheetData> section. In the current calamine API model the worksheet file is parsed a number of times to get this information. This has the advantage of leaving the user choose the level of information that they need and only pay the additional parse/performance overhead as needed. For the sake of this discussion let's cal that the N-times model.

Instead I would like to introduce a 1.5-times parse model where the data at the end of the file and after the <sheetData> section is parsed by reading the file from the end until the appropriate offset is found and then parsing the end data. The is the 0.5 parse. Then when reading the file from the start the additional information would be available at the cell level.

There are a number of technical details that I have skimmed over but it doable in principle and I will try to prototype it in the next few months.

jmcnamara avatar Aug 14 '25 17:08 jmcnamara

My team needs at least the format string for cells to be able to use calamine at all. @jmcnamara Have you looked into this more?

noctuid avatar Dec 03 '25 20:12 noctuid

@noctuid there is a WIP implementation at #538

jmcnamara avatar Dec 03 '25 22:12 jmcnamara