framework icon indicating copy to clipboard operation
framework copied to clipboard

[Feature Request] -- how to detect tables even fuzzily in an Excel file/sheet?

Open simkimsia opened this issue 3 years ago • 1 comments

I was recommended to post this as a possible feature request by @aivuk at https://frictionlessdata.slack.com/archives/C0362US1U3G/p1658213511287809?thread_ts=1658148616.050559&cid=C0362US1U3G

My original question was

I have an Excel file and is there an easy way to detect a table and its headers? I don’t mean Excel table as defined here in Overview of Excel tables (microsoft.com) There will be a table but there’s no a priori knowledge of where that table starts (assume a specific sheet). and where the table ends. is there a way to detect or extract it given such broad paramters using frictionless?

Then @aivuk response was

  1. Load the file with pandas
  2. Detect the number of lines that are empty from top to bottom (num_blank_lines)
  3. Detect the number of columns that are empty from left to right (num_blank_cols)
  4. Detect the last column number N that is not blank
  5. use pandas.read_excel('file.xlsx; , skiprows=num_blank_lines, usecols=[num_blank_cols, num_blank_cols + 1, ...., num_blank_cols + N) where N is the last column that is not blank calculated in the step 4

The problem with this approach though I guess, can still tweak to some extent

is if we go by detect number of empty lines and columns and look for the first non empty row or column algo might have issue, if there are extra standalone cell info that appear outside the table and closer to the top left hand corner.

@aivuk also gave an example of csv file with similar issue https://github.com/aivuk/frictionless-notebooks/blob/main/describing-and-reading-d[…]rtal/Example%20of%20parsing%20a%20csv%20with%20a%20schema.ipynb

but i see it as csv file with multiple line headers.

My issue is a little bit more challenging as the "table" in excel sheet can appear anywhere plus you can have multiple "tables".

I am writing this issue down, though I am open to exploring it.

simkimsia avatar Jul 20 '22 08:07 simkimsia

Thanks @simkimsia,

It's implemented on the basic level (if you need to detect the header row after some comments - see an example in Slack) but as a more general solution it's def a feature request (I would say similar to detecting tables in HTML or PDF)

roll avatar Jul 25 '22 06:07 roll