markitdown chore: xlsx improvements

updates Apr 3 2025

added the following kwargs to Excel-like converters

na_rep: Any = ""
remove_header_placeholders: bool = True
drop_empty_cols: bool = False
drop_empty_rows: bool = False

only NaN-like (None, NaT) is scrubbed out by default to na_rep=""
pandas header placeholders, Unnamed:{i} is scrubbed out by default to na_rep=""

sparsity options:

fully empty columns can optionally be removed
fully empty rows can optionally be removed

also refactored Xlsx and Xls parsers to be more DRY with a ExcelParser base class. this will make it easier to

introduce more excel-like types
modify the underlying parsing logic, to say use calamine instead. see #259

usage

from markitdown import MarkItDown

md = MarkItDown()

# todo: cli support for parser opts?
res = md.convert(
    "packages/markitdown/tests/test_files/test.xlsx",
    na_rep="🤗",
    remove_header_placeholders=True,
    drop_empty_cols=False,
    drop_empty_rows=False,
)

print(res.markdown)

## Sheet1
| Alpha | 🤗 | Beta | Gamma | Delta | 🤗 |
| --- | --- | --- | --- | --- | --- |
| 89.0 | 🤗 | 82 | 100.0 | 12.0 | 13.0 |
| 76.0 | 🤗 | 89 | 33.0 | 42.0 | 43.0 |
| 🤗 | 🤗 | 🤗 | 🤗 | 🤗 | 🤗 |
| 60.0 | 🤗 | 84 | 19.0 | 19.0 | 53.0 |
| 7.0 | 🤗 | 69 | 10.0 | 17.0 | 63.0 |
| 87.0 | 🤗 | 89 | 86.0 | 54.0 | 312.0 |
| 23.0 | 🤗 | 4 | 89.0 | 25.0 | 23.0 |
| 70.0 | 🤗 | 84 | 62.0 | 59.0 | 34.0 |
| 83.0 | 🤗 | 37 | 43.0 | 21.0 | 321.0 |
| 71.0 | 🤗 | 15 | 88.0 | 32.0 | 431.0 |
| 20.0 | 🤗 | 62 | 20.0 | 67.0 | 647.0 |
| 67.0 | 🤗 | 18 | 15.0 | 48.0 | 878.0 |
| 42.0 | 🤗 | 5 | 15.0 | 67.0 | 🤗 |
| 58.0 | 🤗 | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22.0 | 9.0 | 346.0 |
| 49.0 | 🤗 | 93 | 6.0 | 38.0 | 317.0 |
| 82.0 | 🤗 | 28 | 1.0 | 39.0 | 341.0 |
| 95.0 | 🤗 | 55 | 18.0 | 82.0 | 135.0 |
| 50.0 | 🤗 | 46 | 98.0 | 86.0 | 135.0 |
| 31.0 | 🤗 | 46 | 47.0 | 82.0 | 3634.0 |
| 40.0 | 🤗 | 65 | 19.0 | 31.0 | 357.0 |
| 95.0 | 🤗 | 65 | 29.0 | 62.0 | 34.0 |
| 68.0 | 🤗 | 57 | 34.0 | 54.0 | 901.0 |
| 96.0 | 🤗 | 66 | 63.0 | 14.0 | 299.0 |
| 87.0 | 🤗 | 93 | 95.0 | 80.0 | 19.0 |

## 09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
| --- | --- | --- | --- |
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |

Old

a couple simple heuristics that should be safe to apply in any general case:

if any row or column are fully null, we can save tokens by deleting that column.
dataframe column names that are null get auto-filled with Unnamed_{i} and I think we can scrub this out
NaN's in the dataframe materialize as "NaN" in the markdown and can probably be scrubbed out

Examples

here is what I have expanded the test.xlsx to look like (changes annotated in red):

which would produce:

Sheet1

Alpha	Unnamed: 1	Beta	Gamma	Delta	Unnamed: 5
89	NaN	82	100	12	13.0
76	NaN	89	33	42	43.0
60	NaN	84	19	19	53.0
7	NaN	69	10	17	63.0
87	NaN	89	86	54	312.0
23	NaN	4	89	25	23.0
70	NaN	84	62	59	34.0
83	NaN	37	43	21	321.0
71	NaN	15	88	32	431.0
20	NaN	62	20	67	647.0
67	NaN	18	15	48	878.0
42	NaN	5	15	67	NaN
58	NaN	6ff4173b-42a5-4784-9b19-f49caff4d93d	22	9	346.0
49	NaN	93	6	38	317.0
82	NaN	28	1	39	341.0
95	NaN	55	18	82	135.0
50	NaN	46	98	86	135.0
31	NaN	46	47	82	3634.0
40	NaN	65	19	31	357.0
95	NaN	65	29	62	34.0
68	NaN	57	34	54	901.0
96	NaN	66	63	14	299.0
87	NaN	93	95	80	19.0

09060124-b5e7-4717-9d07-3c046eb

ColA	ColB	ColC	ColD
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

with the changes:

Sheet1

Alpha	Beta	Gamma	Delta
89	82	100	12	13.0
76	89	33	42	43.0
60	84	19	19	53.0
7	69	10	17	63.0
87	89	86	54	312.0
23	4	89	25	23.0
70	84	62	59	34.0
83	37	43	21	321.0
71	15	88	32	431.0
20	62	20	67	647.0
67	18	15	48	878.0
42	5	15	67
58	6ff4173b-42a5-4784-9b19-f49caff4d93d	22	9	346.0
49	93	6	38	317.0
82	28	1	39	341.0
95	55	18	82	135.0
50	46	98	86	135.0
31	46	47	82	3634.0
40	65	19	31	357.0
95	65	29	62	34.0
68	57	34	54	901.0
96	66	63	14	299.0
87	93	95	80	19.0

09060124-b5e7-4717-9d07-3c046eb

ColA	ColB	ColC	ColD
1	2	3	4
5	6	7	8
9	10	11	12
13	14	15	affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

Dec 16 '24 07:12 hewliyang

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree [company="{your company}"]

Dec 16 '24 07:12 hewliyang

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

Dec 16 '24 08:12 hewliyang

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

Dec 17 '24 00:12 gagb

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

Dec 17 '24 00:12 gagb

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

thank for for reviewing @gagb 🤗

well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it.

but what I do know for sure is having NaN,Unnamed: {i} and fully NaN rows/cols are non-informative, hence does not give the LLM any more context than if it was scrubbed out in the first place. increasing the signal while decreasing the noise. plus, you save a bunch of input tokens. the assumptions here is that LLMs understand that the empty string "" implies null/empty cell.

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

sounds good!

Dec 17 '24 06:12 hewliyang

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

Dec 17 '24 10:12 vmatt

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags drop_empty_{rows|cols} for this.

also am now fowarding na_rep = "" so the user can provide a custom placeholder for NaNs as well

Dec 17 '24 13:12 hewliyang

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

Dec 17 '24 23:12 afourney

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

I've pushed a change to:

only drop columns iff it's header is also nullish
side note: i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Additionally:

rows should be safe because we don't care about indexes
i've checked that =NA() or #N/A values in Excel are indeed also parsed to NaN by pandas

Regarding XLSXConverter kwargs, currently:

na_rep defaults to "" - covered in tests
drop_null_{rows|columns} defaults to False - not covered in tests

Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation.

Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the .xl{s|sx|sm|...} converters in another base class and move the cleaning logic to a separate internal method. Lmk what you think

Dec 22 '24 13:12 hewliyang

Thanks for adding the --drop flags!

i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story.

On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets.

I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible.

Dec 23 '24 11:12 vmatt

Is there any progress on this? The patch for replacing "NaN" with "" in the MD would be really important.

Mar 24 '25 18:03 johann-petrak

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

Mar 24 '25 19:03 afourney

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".

Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.

Mar 25 '25 07:03 johann-petrak

Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.

Mar 25 '25 18:03 afourney

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".

Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.

I believe the current implementation already satisfies your requirements:

Removing columns / headers optional via kwargs (defaults to false)
na_rep is the empty string

Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.

I've updated it to match the latest HEAD @afourney. Please have a look at the updated PR description. The changes should apply what you wanted to cherry pick by default - and not apply what I've proposed by default- but the option is there for the user.

Only thing is I don't see a clear way to communicate these options as it is. CLI also isn't compatible with arbitrary kwargs right now.

Apr 03 '25 02:04 hewliyang