markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

chore: xlsx improvements

Open hewliyang opened this issue 1 year ago • 15 comments

updates Apr 3 2025

added the following kwargs to Excel-like converters

na_rep: Any = ""
remove_header_placeholders: bool = True
drop_empty_cols: bool = False
drop_empty_rows: bool = False
  • only NaN-like (None, NaT) is scrubbed out by default to na_rep=""
  • pandas header placeholders, Unnamed:{i} is scrubbed out by default to na_rep=""

sparsity options:

  • fully empty columns can optionally be removed
  • fully empty rows can optionally be removed

also refactored Xlsx and Xls parsers to be more DRY with a ExcelParser base class. this will make it easier to

  • introduce more excel-like types
  • modify the underlying parsing logic, to say use calamine instead. see #259

usage

from markitdown import MarkItDown

md = MarkItDown()

# todo: cli support for parser opts?
res = md.convert(
    "packages/markitdown/tests/test_files/test.xlsx",
    na_rep="🤗",
    remove_header_placeholders=True,
    drop_empty_cols=False,
    drop_empty_rows=False,
)

print(res.markdown)
## Sheet1
| Alpha | 🤗 | Beta | Gamma | Delta | 🤗 |
| --- | --- | --- | --- | --- | --- |
| 89.0 | 🤗 | 82 | 100.0 | 12.0 | 13.0 |
| 76.0 | 🤗 | 89 | 33.0 | 42.0 | 43.0 |
| 🤗 | 🤗 | 🤗 | 🤗 | 🤗 | 🤗 |
| 60.0 | 🤗 | 84 | 19.0 | 19.0 | 53.0 |
| 7.0 | 🤗 | 69 | 10.0 | 17.0 | 63.0 |
| 87.0 | 🤗 | 89 | 86.0 | 54.0 | 312.0 |
| 23.0 | 🤗 | 4 | 89.0 | 25.0 | 23.0 |
| 70.0 | 🤗 | 84 | 62.0 | 59.0 | 34.0 |
| 83.0 | 🤗 | 37 | 43.0 | 21.0 | 321.0 |
| 71.0 | 🤗 | 15 | 88.0 | 32.0 | 431.0 |
| 20.0 | 🤗 | 62 | 20.0 | 67.0 | 647.0 |
| 67.0 | 🤗 | 18 | 15.0 | 48.0 | 878.0 |
| 42.0 | 🤗 | 5 | 15.0 | 67.0 | 🤗 |
| 58.0 | 🤗 | 6ff4173b-42a5-4784-9b19-f49caff4d93d | 22.0 | 9.0 | 346.0 |
| 49.0 | 🤗 | 93 | 6.0 | 38.0 | 317.0 |
| 82.0 | 🤗 | 28 | 1.0 | 39.0 | 341.0 |
| 95.0 | 🤗 | 55 | 18.0 | 82.0 | 135.0 |
| 50.0 | 🤗 | 46 | 98.0 | 86.0 | 135.0 |
| 31.0 | 🤗 | 46 | 47.0 | 82.0 | 3634.0 |
| 40.0 | 🤗 | 65 | 19.0 | 31.0 | 357.0 |
| 95.0 | 🤗 | 65 | 29.0 | 62.0 | 34.0 |
| 68.0 | 🤗 | 57 | 34.0 | 54.0 | 901.0 |
| 96.0 | 🤗 | 66 | 63.0 | 14.0 | 299.0 |
| 87.0 | 🤗 | 93 | 95.0 | 80.0 | 19.0 |

## 09060124-b5e7-4717-9d07-3c046eb
| ColA | ColB | ColC | ColD |
| --- | --- | --- | --- |
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 |
| 13 | 14 | 15 | affc7dad-52dc-4b98-9b5d-51e65d8a8ad0 |

Old

a couple simple heuristics that should be safe to apply in any general case:

  • if any row or column are fully null, we can save tokens by deleting that column.
  • dataframe column names that are null get auto-filled with Unnamed_{i} and I think we can scrub this out
  • NaN's in the dataframe materialize as "NaN" in the markdown and can probably be scrubbed out

Examples


here is what I have expanded the test.xlsx to look like (changes annotated in red):

image

which would produce:


Sheet1

Alpha Unnamed: 1 Beta Gamma Delta Unnamed: 5
89 NaN 82 100 12 13.0
76 NaN 89 33 42 43.0
60 NaN 84 19 19 53.0
7 NaN 69 10 17 63.0
87 NaN 89 86 54 312.0
23 NaN 4 89 25 23.0
70 NaN 84 62 59 34.0
83 NaN 37 43 21 321.0
71 NaN 15 88 32 431.0
20 NaN 62 20 67 647.0
67 NaN 18 15 48 878.0
42 NaN 5 15 67 NaN
58 NaN 6ff4173b-42a5-4784-9b19-f49caff4d93d 22 9 346.0
49 NaN 93 6 38 317.0
82 NaN 28 1 39 341.0
95 NaN 55 18 82 135.0
50 NaN 46 98 86 135.0
31 NaN 46 47 82 3634.0
40 NaN 65 19 31 357.0
95 NaN 65 29 62 34.0
68 NaN 57 34 54 901.0
96 NaN 66 63 14 299.0
87 NaN 93 95 80 19.0

09060124-b5e7-4717-9d07-3c046eb

ColA ColB ColC ColD
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

with the changes:


Sheet1

Alpha Beta Gamma Delta
89 82 100 12 13.0
76 89 33 42 43.0
60 84 19 19 53.0
7 69 10 17 63.0
87 89 86 54 312.0
23 4 89 25 23.0
70 84 62 59 34.0
83 37 43 21 321.0
71 15 88 32 431.0
20 62 20 67 647.0
67 18 15 48 878.0
42 5 15 67
58 6ff4173b-42a5-4784-9b19-f49caff4d93d 22 9 346.0
49 93 6 38 317.0
82 28 1 39 341.0
95 55 18 82 135.0
50 46 98 86 135.0
31 46 47 82 3634.0
40 65 19 31 357.0
95 65 29 62 34.0
68 57 34 54 901.0
96 66 63 14 299.0
87 93 95 80 19.0

09060124-b5e7-4717-9d07-3c046eb

ColA ColB ColC ColD
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 affc7dad-52dc-4b98-9b5d-51e65d8a8ad0

hewliyang avatar Dec 16 '24 07:12 hewliyang

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree [company="{your company}"]

hewliyang avatar Dec 16 '24 07:12 hewliyang

@hewliyang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

hewliyang avatar Dec 16 '24 08:12 hewliyang

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

gagb avatar Dec 17 '24 00:12 gagb

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

gagb avatar Dec 17 '24 00:12 gagb

@hewliyang thanks for the PR. Our typical use case is including this data as context in an LLM. Do you think your suggested changes, NaN -> "", Unnamed:{I} -> "" improve generation?

cc @afourney

thank for for reviewing @gagb 🤗

well, the right answer is that it is non-trivial because we would have to curate a QA test set and run experiments on it.

but what I do know for sure is having NaN,Unnamed: {i} and fully NaN rows/cols are non-informative, hence does not give the LLM any more context than if it was scrubbed out in the first place. increasing the signal while decreasing the noise. plus, you save a bunch of input tokens. the assumptions here is that LLMs understand that the empty string "" implies null/empty cell.

If the objective is beautification, what is we added a kwarg called beautify=True so that both options are available to users?

sounds good!

hewliyang avatar Dec 17 '24 06:12 hewliyang

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

vmatt avatar Dec 17 '24 10:12 vmatt

For empty values, "" is definitely better than NaN. Although I'm not sure if we want to remove empty columns, that would be transforming the structure of the data. - maybe this can be a flag, such as --remove-empty-columns, --remove-empty-rows, what do you think @hewliyang?

sounds good. my idea of when we would want to keep fully null rows/cols is such as when there are subtables in a sheet & we want to maintain some spatial seperation (or not). added two flags drop_empty_{rows|cols} for this.

also am now fowarding na_rep = "" so the user can provide a custom placeholder for NaNs as well

hewliyang avatar Dec 17 '24 13:12 hewliyang

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

afourney avatar Dec 17 '24 23:12 afourney

Arguably, if the xlsx cell is empty, the conversion should be empty.

If the excel value is "#N/A" then I believe it should be interpreted as NaN (though it's perhaps not exactly the same).

In any case, I agree with dropping completely empty columns (including empty headers), and with encoding empty cells with "". I'm not sure what to do about column headers yet. Let me think on this more, and test with #N/A values.

I've pushed a change to:

  • only drop columns iff it's header is also nullish
  • side note: i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Additionally:

  • rows should be safe because we don't care about indexes
  • i've checked that =NA() or #N/A values in Excel are indeed also parsed to NaN by pandas

Regarding XLSXConverter kwargs, currently:

  • na_rep defaults to "" - covered in tests
  • drop_null_{rows|columns} defaults to False - not covered in tests

Not sure how verbose you would want the tests to be, because I think for the latter we will need to reparse the string representation.

Also I see a related PR #169. If we intend to merge this then another todo would be to wrap all the .xl{s|sx|sm|...} converters in another base class and move the cleaning logic to a separate internal method. Lmk what you think

hewliyang avatar Dec 22 '24 13:12 hewliyang

Thanks for adding the --drop flags!

i've seen a case at work where a financial model for some reason had a rouge - at row 1,000,000, and pandas blew up after consuming > 15GB of memory trying to convert the dataframe into HTML. so cleaning is kinda important especially for unsanitized inputs i.e. user data

Excel supports ~1.048m rows max, but in my experience it starts to get sluggish with a sheet over 100k, but that's another story.

On the pandas using too much memory topic - I'm generally advocating for all projects to replace pandas with polars or duckdb whenever it's possible, especially when working with large datasets.

I'm actual planning to open a PR myself in January, to replace all pandas functions with polars whenever it's possible, as I'm expecting this library will be used to convert vast amounts of data, we should make this as efficient as possible.

vmatt avatar Dec 23 '24 11:12 vmatt

Is there any progress on this? The patch for replacing "NaN" with "" in the MD would be really important.

johann-petrak avatar Mar 24 '25 18:03 johann-petrak

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

afourney avatar Mar 24 '25 19:03 afourney

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".

Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.

johann-petrak avatar Mar 25 '25 07:03 johann-petrak

Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.

afourney avatar Mar 25 '25 18:03 afourney

Not yet, but let me prioritize it for this week. I'm not sold on removing columns, or headers... but swapping NaN for "" make a lot of sense to me

Totally agree .. removing columns or headers should probably be something that would be better optional, to be enabled by a command line / API parameter option, but the NaNs are surprising if a cell is simply just empty. Currently I am working around this by doing my own API call and then manually replacing " NaN " in the generated string with " ".

Personally I very much like it when a utility like this one allows the user to choose such things via command line parameters or API parametrization, especially this tool can be used for many purposes.

I believe the current implementation already satisfies your requirements:

  • Removing columns / headers optional via kwargs (defaults to false)
  • na_rep is the empty string

Sounds good. This PR has drifted from when it was submitted. Let me try to migrate the "NaN" -> "" bits to the new main branch, and we can add further refinements in another PR.

I've updated it to match the latest HEAD @afourney. Please have a look at the updated PR description. The changes should apply what you wanted to cherry pick by default - and not apply what I've proposed by default- but the option is there for the user.

Only thing is I don't see a clear way to communicate these options as it is. CLI also isn't compatible with arbitrary kwargs right now.

hewliyang avatar Apr 03 '25 02:04 hewliyang