office-text-extractor
office-text-extractor copied to clipboard
pull sheet names as part of text from xlsx files?
Description
When pulling text from a spreadsheet, the current extractor does not return the sheet names in the text. It would be GREAT if there was an options to preface the sheet text by the sheet name.
Why
Often, important contextual information is included in sheet names.
It would be easy to implement - in the office-text-extractor code, you are pulling them already as the sheet data is accessed via the sheet name. Adding a simple boolean flag on whether or not to output the sheet names into the === separator denoting new sheet text could be a solution? It could be set to false by default for backward compatibility.
Alternatives
I mean, I love the all in one nature of office-text-extractor, but I could process the files myself instead.
(I'd be happy to create a pull request for this, but I'm not sure where you would prefer to place such a boolean. If you let me know, I'd be happy to create one.
Hi,
Thanks for opening this issue!
I would definitely like this to be the default behaviour of the library, not sure why I hadn't done this in the first place. A PR that appends the sheet name near the === separator (on the same line? or the next line? let me know what would be better) sounds good.
I suppose we could add a boolean option to configure this, in the constructor of the ExcelExtractor class. But I don't think it is needed.
Regards, Vedant
I know this would could be a breaking change which was the intent of the boolean. Not sure how many users you have that need no format changes.
Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?
I'm not being critical here - I'm just curious what you had in mind and the use cases. Want to make sure that whatever I put in aligns with the plans.
Not sure how many users you have that need no format changes.
I have no idea either, but you're right - it is a breaking change, and to be safe we should hide it behind a boolean flag that is false by default.
Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?
I did not follow a format, I made my own 😅 The --- to separate rows and === to separate sheets is completely arbitrary.
I chose YAML because it maintained a text-sense of structure instead of a grid-sense of structure, i.e., col-header:value instead of value,value,value.
The 'text-based structure' was actually useful to me in the project I wrote this package for; where I was extracting text from files to identify 'topics', primarily based on position and frequency of words.
That said, I am open to adding more options that configure the format of the output.