unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Feat: Add-rc-locator-to-partition-excel

Open marctorsoc opened this issue 1 year ago • 5 comments

As described in https://github.com/Unstructured-IO/unstructured/issues/3236, I wanted to add the coordinate RC to the metadata of the returned elements. I'm adding both rc and excel_rc. The former being 0-based, the latter 1-based (as in Excel).

marctorsoc avatar Jun 20 '24 11:06 marctorsoc

can I get some feedback here? e.g. @potter-potter @rbiseck3 (sorry for the spam, feel free to share with whoever that could help)

marctorsoc avatar Jul 06 '24 09:07 marctorsoc

Hi @potter-potter, I fixed the conflicts. Could you enable CI at least?

marctorsoc avatar Jul 31 '24 19:07 marctorsoc

Hi @potter-potter, I fixed the conflicts. Could you enable CI at least?

hey @marctorsoc I alerted @scanny to this pr earlier this month. you'll have to go through him.

potter-potter avatar Jul 31 '24 21:07 potter-potter

Hi @marctorsoc I'm not sure this metadata is of general interest for users of the library. Let me ask around and see what folks have to say.

For sure we wouldn't want two fields with the same value offset by 1. A user could easily adjust for themselves if they want cardinal vs. ordinal numbering.

And starting_page_number has a different meaning having to do with adjusting page-numbering, so repurposing it for selective partitioning would change its behavior. That parameter is used across multiple partitioners and it needs to have consistent semantics across all those.

I think your best bet for the moment would be to use your fork if you want these new behaviors.

scanny avatar Aug 01 '24 17:08 scanny

Hi @marctorsoc I'm not sure this metadata is of general interest for users of the library. Let me ask around and see what folks have to say.

For sure we wouldn't want two fields with the same value offset by 1. A user could easily adjust for themselves if they want cardinal vs. ordinal numbering.

And starting_page_number has a different meaning having to do with adjusting page-numbering, so repurposing it for selective partitioning would change its behavior. That parameter is used across multiple partitioners and it needs to have consistent semantics across all those.

I think your best bet for the moment would be to use your fork if you want these new behaviors.

Hi @scanny, I had removed the second bit, but did not update the description. I'm fine with removing that.

Regarding the two fields, I think they are useful because one is how dataframes and in general programming works (0-based) vs how Excel works. So when checking what the library gives vs the excel, it's convenient to have the excel formatting. It's not only an offset, but also converting the column to a letter. But yeah, I'm ok with removing that too.

IMHO, providing the coordinates of each partitioned element should be part of the metadata. For any type of document. This is the case for pdfs. But for docx and txt it's the same story as xlsx. It's used internally but not surfaced to the user, who could also be interested.

Well, lmk what the wise council decide 😉

marctorsoc avatar Aug 01 '24 21:08 marctorsoc