amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

Table.rows_without_header function adds duplicate non_header_rows - the call is 1-level too deep

Open cleung11 opened this issue 3 years ago • 0 comments

The function checks if a row is not a header and appends the row within the cell for loop (adds a row for each cell). It should be moved one level out into the row for loop instead: https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/trp/init.py#L431

Original:

    @property
    def rows_without_header(self) -> List[Row]:
        non_header_rows: List[Row] = list()
        for row in self.rows:
            header = False
            for cell in row.cells:
                for entity_type in cell.entityTypes:
                    if entity_type == ENTITY_TYPE_COLUMN_HEADER:
                        header = True
                if not header:
                    non_header_rows.append(row)
        return non_header_rows

New:

    @property
    def rows_without_header(self) -> List[Row]:
        non_header_rows: List[Row] = list()
        for row in self.rows:
            header = False
            for cell in row.cells:
                for entity_type in cell.entityTypes:
                    if entity_type == ENTITY_TYPE_COLUMN_HEADER:
                        header = True
            if not header: # moved this left one tab
                non_header_rows.append(row) # moved this left one tab
        return non_header_rows

cleung11 avatar Feb 25 '22 21:02 cleung11