office-text-extractor icon indicating copy to clipboard operation
office-text-extractor copied to clipboard

pull sheet names as part of text from xlsx files?

Open chazzmoney opened this issue 1 year ago • 5 comments

Description

When pulling text from a spreadsheet, the current extractor does not return the sheet names in the text. It would be GREAT if there was an options to preface the sheet text by the sheet name.

Why

Often, important contextual information is included in sheet names.

It would be easy to implement - in the office-text-extractor code, you are pulling them already as the sheet data is accessed via the sheet name. Adding a simple boolean flag on whether or not to output the sheet names into the === separator denoting new sheet text could be a solution? It could be set to false by default for backward compatibility.

Alternatives

I mean, I love the all in one nature of office-text-extractor, but I could process the files myself instead.

chazzmoney avatar Mar 18 '24 22:03 chazzmoney

(I'd be happy to create a pull request for this, but I'm not sure where you would prefer to place such a boolean. If you let me know, I'd be happy to create one.

chazzmoney avatar Mar 18 '24 23:03 chazzmoney

Hi,

Thanks for opening this issue!

I would definitely like this to be the default behaviour of the library, not sure why I hadn't done this in the first place. A PR that appends the sheet name near the === separator (on the same line? or the next line? let me know what would be better) sounds good.

I suppose we could add a boolean option to configure this, in the constructor of the ExcelExtractor class. But I don't think it is needed.

Regards, Vedant

gamemaker1 avatar Mar 19 '24 07:03 gamemaker1

I know this would could be a breaking change which was the intent of the boolean. Not sure how many users you have that need no format changes.

Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?

I'm not being critical here - I'm just curious what you had in mind and the use cases. Want to make sure that whatever I put in aligns with the plans.

chazzmoney avatar Mar 19 '24 20:03 chazzmoney

Not sure how many users you have that need no format changes.

I have no idea either, but you're right - it is a breaking change, and to be safe we should hide it behind a boolean flag that is false by default.

Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?

I did not follow a format, I made my own 😅 The --- to separate rows and === to separate sheets is completely arbitrary.

I chose YAML because it maintained a text-sense of structure instead of a grid-sense of structure, i.e., col-header:value instead of value,value,value.

The 'text-based structure' was actually useful to me in the project I wrote this package for; where I was extracting text from files to identify 'topics', primarily based on position and frequency of words.

gamemaker1 avatar Mar 20 '24 14:03 gamemaker1

That said, I am open to adding more options that configure the format of the output.

gamemaker1 avatar Mar 20 '24 14:03 gamemaker1