chore: xlsx improvements
updates Apr 3 2025
added the following kwargs to Excel-like converters
na_rep: Any = ""
remove_header_placeholders: bool = True
drop_empty_cols: bool = False
drop_empty_rows: bool = False
- only NaN-like (None, NaT) is scrubbed out by default to
na_rep="" - pandas header placeholders,
Unnamed:{i}is scrubbed out by default tona_rep=""
sparsity options:
- fully empty columns can optionally be removed
- fully empty rows can optionally be removed
also refactored Xlsx and Xls parsers to be more DRY with a ExcelParser base class. this will make it easier to
- introduce more excel-like types
- modify the underlying parsing logic, to say use
calamineinstead. see #259
usage
from markitdown import MarkItDown
md = MarkItDown()
# todo: cli support for parser opts?
res = md.convert(
"packages/markitdown/tests/test_files/test.xlsx",
na_rep="🤗",
remove_header_placeholders=True,
drop_empty_cols=False,
drop_empty_rows=False,
)
print(res.markdown)
## Sheet1
| Alpha | 🤗 | Beta | Gamma | Delta | 🤗 |
| --- | --- | --- | --- | --- | --- |
| 89.0 | 🤗 | 82 | 100.0 | 12.0 | 13.0 |
| 76.0 | 🤗 | 89 | 33.0 | 42.0 | 43.0 |
| 🤗 | 🤗 | 🤗 | 🤗 | 🤗 | 🤗 |
| 60.0 | 🤗 | 84 | 19.0 | 19.0 | 53.0 |
| 7.0 | 🤗 | 69 | 10.0 | 17.0 | 63.0 |
| 87.0 | 🤗 | 89 | 86.0 | 54.0 | 312.0 |
| 23.0 | 🤗 | 4 | 89.0 | 25.0 | 23.0 |
| 70.0 | 🤗 | 84 | 62.0 | 59.0 | 34.0 |
| 83.0 | 🤗 | 37 | 43.0 | 21.0 | 321.0 |
| 71.0 | 🤗 | 15 | 88.0 | 32.0 | 431.0 |
| 20.0 | 🤗 | 62 | 20.0 | 67.0 | 647.0 |
| 67.0 | 🤗 | 18 | 15.0 | 48.0 | 878.0 |
| 42.0 | 🤗 | 5 | 15.0 | 67.0 | 🤗 |
| 58.0 | 🤗 | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22.0 | 9.0 | 346.0 |
| 49.0 | 🤗 | 93 | 6.0 | 38.0 | 317.0 |
| 82.0 | 🤗 | 28 | 1.0 | 39.0 | 341.0 |
| 95.0 | 🤗 | 55 | 18.0 | 82.0 | 135.0 |
| 50.0 | 🤗 | 46 | 98.0 | 86.0 | 135.0 |
| 31.0 | 🤗 | 46 | 47.0 | 82.0 | 3634.0 |
| 40.0 | 🤗 | 65 | 19.0 | 31.0 | 357.0 |
| 95.0 | 🤗 | 65 | 29.0 | 62.0 | 34.0 |
| 68.0 | 🤗 | 57 | 34.0 | 54.0 | 901.0 |
| 96.0 | 🤗 | 66 | 63.0 | 14.0 | 299.0 |
| 87.0 | 🤗 | 93 | 95.0 | 80.0 | 19.0 |
## 09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
| --- | --- | --- | --- |
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |
Old
a couple simple heuristics that should be safe to apply in any general case:
- if any row or column are fully null, we can save tokens by deleting that column.
- dataframe column names that are null get auto-filled with
Unnamed_{i}and I think we can scrub this out - NaN's in the dataframe materialize as "NaN" in the markdown and can probably be scrubbed out
Examples
here is what I have expanded the test.xlsx to look like (changes annotated in red):
which would produce:
Sheet1
| Alpha | Unnamed: 1 | Beta | Gamma | Delta | Unnamed: 5 |
|---|---|---|---|---|---|
| 89 | NaN | 82 | 100 | 12 | 13.0 |
| 76 | NaN | 89 | 33 | 42 | 43.0 |
| 60 | NaN | 84 | 19 | 19 | 53.0 |
| 7 | NaN | 69 | 10 | 17 | 63.0 |
| 87 | NaN | 89 | 86 | 54 | 312.0 |
| 23 | NaN | 4 | 89 | 25 | 23.0 |
| 70 | NaN | 84 | 62 | 59 | 34.0 |
| 83 | NaN | 37 | 43 | 21 | 321.0 |
| 71 | NaN | 15 | 88 | 32 | 431.0 |
| 20 | NaN | 62 | 20 | 67 | 647.0 |
| 67 | NaN | 18 | 15 | 48 | 878.0 |
| 42 | NaN | 5 | 15 | 67 | NaN |
| 58 | NaN | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22 | 9 | 346.0 |
| 49 | NaN | 93 | 6 | 38 | 317.0 |
| 82 | NaN | 28 | 1 | 39 | 341.0 |
| 95 | NaN | 55 | 18 | 82 | 135.0 |
| 50 | NaN | 46 | 98 | 86 | 135.0 |
| 31 | NaN | 46 | 47 | 82 | 3634.0 |
| 40 | NaN | 65 | 19 | 31 | 357.0 |
| 95 | NaN | 65 | 29 | 62 | 34.0 |
| 68 | NaN | 57 | 34 | 54 | 901.0 |
| 96 | NaN | 66 | 63 | 14 | 299.0 |
| 87 | NaN | 93 | 95 | 80 | 19.0 |
09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
|---|---|---|---|
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |
with the changes:
Sheet1
| Alpha | Beta | Gamma | Delta | |
|---|---|---|---|---|
| 89 | 82 | 100 | 12 | 13.0 |
| 76 | 89 | 33 | 42 | 43.0 |
| 60 | 84 | 19 | 19 | 53.0 |
| 7 | 69 | 10 | 17 | 63.0 |
| 87 | 89 | 86 | 54 | 312.0 |
| 23 | 4 | 89 | 25 | 23.0 |
| 70 | 84 | 62 | 59 | 34.0 |
| 83 | 37 | 43 | 21 | 321.0 |
| 71 | 15 | 88 | 32 | 431.0 |
| 20 | 62 | 20 | 67 | 647.0 |
| 67 | 18 | 15 | 48 | 878.0 |
| 42 | 5 | 15 | 67 | |
| 58 | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22 | 9 | 346.0 |
| 49 | 93 | 6 | 38 | 317.0 |
| 82 | 28 | 1 | 39 | 341.0 |
| 95 | 55 | 18 | 82 | 135.0 |
| 50 | 46 | 98 | 86 | 135.0 |
| 31 | 46 | 47 | 82 | 3634.0 |
| 40 | 65 | 19 | 31 | 357.0 |
| 95 | 65 | 29 | 62 | 34.0 |
| 68 | 57 | 34 | 54 | 901.0 |
| 96 | 66 | 63 | 14 | 299.0 |
| 87 | 93 | 95 | 80 | 19.0 |
09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
|---|---|---|---|
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |
@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree [company="{your company}"]
@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree
@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?
cc @afourney
If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?
@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?
cc @afourney
thank for for reviewing @gagb 🤗
well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it.
but what I do know for sure is having NaN,Unnamed: {i} and fully NaN rows/cols are non-informative, hence does not give the LLM any more context than if it was scrubbed out in the first place. increasing the signal while decreasing the noise. plus, you save a bunch of input tokens. the assumptions here is that LLMs understand that the empty string "" implies null/empty cell.
If the objective is beautification, what is we added a
kwargcalledbeautify=Trueso that both options are available to users?
sounds good!
For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?
For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as
--remove-empty-columns,--remove-empty-rows, what do you think @hewliyang?
sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags drop_empty_{rows|cols} for this.
also am now fowarding na_rep = "" so the user can provide a custom placeholder for NaNs as well
Arguably, if the xlsx cell is empty, the conversion should be empty.
If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).
In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.
Arguably, if the xlsx cell is empty, the conversion should be empty.
If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).
In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.
I've pushed a change to:
- only drop columns iff it's header is also nullish
- side note: i've seen a case at work where a financial model for some reason had a rouge
-at row 1,000,000, andpandasblew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data
Additionally:
- rows should be safe because we don't care about indexes
- i've checked that
=NA() or #N/Avalues in Excel are indeed also parsed to NaN by pandas
Regarding XLSXConverter kwargs, currently:
na_repdefaults to""- covered in testsdrop_null_{rows|columns}defaults toFalse- not covered in tests
Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation.
Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the .xl{s|sx|sm|...} converters in another base class and move the cleaning logic to a separate internal method. Lmk what you think
Thanks for adding the --drop flags!
i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data
Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story.
On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets.
I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible.
Is there any progress on this? The patch for replacing "NaN" with "" in the MD would be really important.
Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me
Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me
Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".
Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.
Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.
Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me
Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".
Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.
I believe the current implementation already satisfies your requirements:
- Removing columns / headers optional via kwargs (defaults to false)
na_repis the empty string
Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.
I've updated it to match the latest HEAD @afourney. Please have a look at the updated PR description. The changes should apply what you wanted to cherry pick by default - and not apply what I've proposed by default- but the option is there for the user.
Only thing is I don't see a clear way to communicate these options as it is. CLI also isn't compatible with arbitrary kwargs right now.