feat: reading PivotTable (PivotCache)
Feature to read all data available to a pivot table.
The data supporting a pivot table is referred to as the pivotCache. And I like to consider this feature "Calamine for your Cache".
An example use case would be auditing filtered content in an externally sourced pivot table.
Pivot Table's require both xl/pivotCache/PivotCacheDefinitions and xl/pivotCache/PivotCacheRecords files. The definitions file has relevant metadata as well as shared items. While the records file has values - rows are delimited with the <r> tag. <x> indicates only their position in the Definitions file is given (sample below from tests/pivots.xlsx)
xl/pivotCache/PivotCacheRecords1.xml
<r>
<n v="10"/>
<s v="j"/>
<!-- use data from PivotCacheDefinitions1.xml's 3rd 'CacheField Tag' -->
<x v="1"/>
<x v="3"/>
<n v="1.20452"/>
<x v="5"/>
<n v="4.1510311161292464"/>
<b v="0"/>
<m/>
<s v="blue"/>
</r>
xl/pivotCache/PivotCacheDefinitions1.xml
<cacheFields count="10">
<cacheField name="Id" numFmtId="0">
<sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="10"/>
</cacheField>
<cacheField name="Name" numFmtId="0">
<sharedItems/>
</cacheField>
<!-- Corresponding lookup value <x v="1"/> above refers to <s v="blue"/> -->
<cacheField name="Category" numFmtId="0">
<sharedItems count="2">
<s v="blue"/>
<s v="yellow"/>
</sharedItems>
</cacheField>
<cacheField name="Value" numFmtId="0">
<sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="5" maxValue="20" count="4">
<n v="10"/>
<n v="20"/>
<n v="15"/>
<n v="5"/>
</sharedItems>
Example
let mut wb: Xlsx<_> = wb("pivots.xlsx");
for result in wb.get_pivot_data_by_name_ref("PivotTable1").unwrap() {
println!("{:?}", result);
}
/*
prints the following:
[String("Id"), String("Name"), String("Category"), String("Value"), String("Size"), String("Date"), String("Value / Size"), String("IsBlue"), String("Null"), String("Misc")]
[Int(1), String("a"), String("blue"), Int(10), Float(1.78), DateTimeIso("2024-11-01T00:00:00"), Float(5.617977528089887), Bool(true), Empty, Empty]
[Int(2), String("b"), String("blue"), Int(20), Float(2.012), DateTimeIso("2024-01-04T00:00:00"), Float(9.940357852882704), Bool(true), Empty, Float(2.012)]
...
*/
This may be determined to go outside the scope Calamine - but if it fits then it will need to be applied to other workbook formats (only .xlsx currently) and worked on in a few places like error handling.
Overall it looks okay. However there are a number of issues to fix before review:
- Rebase to master on
tafia/calamine. There are some fixes on master relative to your clone. - Probably best to move the changes to a branch on your repo for easier PRs.
- Turn on the CI on your branch and fix any issues before resubmitting the PR. This PR fails in the calamine CI.
- Run
cargo fmton the code. - Fix any
cargo clippyissues. - Fix the warnings from
cargo build -F pivot-cache. - Comment style should be proper sentence case with period at the end.
- Don't use
///comments for non public comments. Use//. - Explain why
fn xml_reader()is being made public. Usepub(crate)if necessary or don't make it public if it isn't necessary. - Probably best to upgrade to Rust v1.89.0 or 1.90.0, if not already using them. This will give you the latest clippy at least.
- Avoid making whitespace changes like removing blank lines unless there is a valid reason.
- Rebase your local changes, to fix the above issues, into a single commit.
@jmcnamara I may have accidentally resubmitted the PR. I took your advice on moving to a branch on my own repo but didn't realize it would automatically resubmit once I rebased it to my master. If that was the case I still have a little cleanup left removing the unnecessary comments.
Also, do you plan to have this released for just .xlsx workbooks then have the remaining prd seperately?
but didn't realize it would automatically resubmit once I rebased it to my master. If that was the case I still have a little cleanup left removing the unnecessary comments.
That is fine, and normal. People often submit the PR as "draft" (it is an option in the initial GitHub dialog or you can do it in git) while they are iterating and then move it to full once it is ready to merge.
Also, do you plan to have this released for just .xlsx workbooks then have the remaining prd seperately?
I would think it would be too hard to do this for xls. It may be possible to do it for xlsb and I don't know about ods. So I think it is probably ok to work it out for just xlsx in this PR.
Also, I will run the AI code review on this later. Use the usual amount of judgement in relation to the suggestions.
@jmcnamara thank you for the quick feedback and shared git knowledge. Since this branch is now a draft I'll keep the same approach. Also, the copilot suggestions were actually decent for identifying places I intended to come back to (unwraps and such).
@sftse thanks as well for feedback. I took your approach in the many of the suggestions (those with the thumbs up) and addressed comments / questions for the others.
@jmcnamara I believe the failing check for stable appears to be due to the addition of clippy in 458b8ca7061b653e7be995976662a20003509986. The image installs the toolchain with --profile minimal. This setting does not include clippy by default. See line 48 in screen shot which would conflict.
@siqpush Thanks. I'll fix that.
I wonder why it was working previously.
@siqpush Thanks it is fixed on master. You will need to do a fetch and rebase to pick up the changes.
@siqpush Thanks it is fixed on master. You will need to do a fetch and rebase to pick up the changes.
Using github "sync" your code was merged not rebased onto my branch. I saw your note too late - my local branch was merge squashed by then. Tried the rebase follow up but it was useless. @jmcnamara
Tried the rebase follow up but it was useless.
Don't worry about it we can sort it out at merge.
@siqpush Could you "Resolve" the comments that have been already addressed to make the review cleaner. Thanks.
I fixed the failing typo check on master. If you rebase to that it will fix the failing CI check.
@siqpush I would like to merge this for the next release but there is a conflict due to recent changes. Could you fix this?
@jmcnamara I don't think we should merge this in its current state. The commit history is very confusing (empty rebase commits, merge commits inside this PR) and I think the code could still use some improvements. I can make another review.
The commit history is very confusing (empty rebase commits, merge commits inside this PR
@sftse I could probably deal with that but if it still needs reviewing I will pause the merge.
@siqpush Could you try rebase this down into a single commit that is up to date with main. Or start a clean branch and a new PR.
@sftse @jmcnamara does commit https://github.com/tafia/calamine/pull/559/commits/0b7a88473fcd9541a3ee45058556a6802bac944e better present the fix for you? On my side it shows i want to merge both my commits and others into master (very confusing unless i drill into the commit link).
Feels like a new PR may be the way...
does commit 0b7a884 better present the fix for you
Yes. The looks clean, from the point of view of a review but it seems to be detached from a branch.
It may be easier to visualize what is going on with this PR. git log --all --graph may help with orientation for what to merge into what and what a clear history should look like.
Here a more compact rendering.
As a rule of thumb, do not merge master into other branches, as this makes for a confusing history. Github is confused by it as well, as you can see by how it chooses to render the commits belonging to this PR, it treats the commits that were merged from master into your branch (labeled pr-559 in the rendered graph) as belonging to this PR, even though they do not contribute to the final diff or understanding of what has changed. At that point, the usefulness of the individual commits is lower than could have been and it may be clearer to just squash all of them into a single one.
Of note as well, it seems you rebased your commits beginning at 175f and then decided to merge the rebased branch into the old one, this duplicates a large amount of commits, see 175f and aff6.
I've cleaned up the commit history in my branch pr-559-clean, here is what the history would look like if it were merged into master. If you can force push to this branch, this would clean things up a lot.
@sftse thank you! thats solid advice and in hindsight makes sense. still a little traumatized by rebasing, hard to remember my order of operations that landed me in that pile...
While the API this PR proposes to add is consistent with other optional features like pictures, I think the pattern of adding Option to the main struct Xlsx is an API dead end we should be moving away from.
To illustrate, if a function needs the PivotTables to be initialized when it's called, it is much better to pass it in as an argument.
struct Xlsx {
#[feature = "pivot-cache"]
pivot_table: Option<PivotTable>,
// same problem with pictures
#[feature = "pictures"]
pictures: Option<Pictures>,
}
impl Xlsx {
pub fn new() {
let mut xls: Xlsx = todo!();
}
#[feature = "pivot-cache"]
fn read_pivot_table_metadata(&mut self) -> Result<(), ()> {
self.pivot_table = Some(todo!());
}
pub fn needs_initialized_pivot(&self) -> Foo {
let _x = self.pivot_table.as_ref().unwrap();
// use _x somehow
todo!()
}
}
// alternative API
impl Xlsx {
pub fn read_pivot_table_metadata(&mut self) -> Result<PivotTable, ()> {
todo!()
}
// if the caller wants to use this function, he must call fn read_pivot_table_metadata before
pub fn needs_initialized_pivot(&self, pivot_table: &PivotTable) -> Foo {
let _x = pivot_table;
todo!()
}
}
This eliminates the problem of having to add a feature flag to initialize the pivot tables deep within Xlsx::new and the additional costs for every caller, even the ones that don't need the feature.
What we should be looking out for is to build dataflows from types that are self-documenting: Users that need a feature should be able to reason backwards and discover in the documentation what types they need to build to call the functions.
@sftse I wasn't sold but I think in hindsight it actually does feel more correct than coercing to a feature. there was really only one awkward call to the get_pivot_tables_by_name_and_sheet helper function in lib.rs .
This eliminates the problem of having to add a feature flag to initialize the pivot tables deep within
Xlsx::newand the additional costs for every caller, even the ones that don't need the feature.
I think we should only use feature flags for features that require an additional dependency like "chrono". I think all other features like "pictures" should be initialised when the user calls them. (This isn't 100% simple in some cases since it could require a second parsing of parts of a file but in general it should be achievable).
Havent forgotten about this PR, sorry about the delays. I'd have to find time to use the current API to endorse it, but the internal details are getting close to being right.
Dont be. I was happy you got a healthy break from reviewing my mess.
Let me know what you think once you get a moment!