amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Trouble replicating markdown output

Open bvbg1 opened this issue 7 months ago • 8 comments

I tried out the code from this example: https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html#All-entities-can-be-linearized

The markdown output I'm getting is different from the above and is incorrect:

| CO.
     | FILE
                              | DEPT.
   | CLOCK
   | NUMBER
   |
|-----|------------------------------|---|---|---|
| ABC | 126543 123456 12345 00000000 |   |   |   |

|                |           |
|----------------|-----------|
| Period ending: | 7/18/2008 |
| Pay date:      | 7/25/2008 |

|          |                       |
|----------|-----------------------|
| Federal: | 3. $25 Additional Tax |
| State:   | 2                     |
| Local:   | 2                     |

| Earnings
          | rate
           | hours
       | this period
          | year to date
           |
|----------|-----------|-------|----------|-----------|
| Regular  | 10.00     | 32.00 | 320.00   | 16,640.00 |
| Overtime | 15.00     | 1.00  | 15.00    | 780.00    |
| Holiday  | 10.00     | 8.00  | 80.00    | 4,160.00  |
| Tuition  |           |       | 37.43    | 1,946.80  |
|          | Gross Pay |       | $ 452.43 | 23,526.80 |

|                 |             |               |
|-----------------|-------------|---------------|
| Other Benefits and

Information                 | this period | total to date |
| Group Term Life | 0.51        | 27.00         |
| Loan Amt Paid   |             | 840.00        |
| Vac Hrs         |             | 40.00         |
| Sick Hrs        |             | 16.00         |
| Title           | Operator    |               |

|            |                     |         |          |
|------------|---------------------|---------|----------|
| Deductions | Statutory

Federal Income Tax                     | -40.60  | 2,111.20 |
|            | Social Security Tax | -28.05  | 1,458.60 |
|            | Medicare Tax        | -6.56   | 341.12   |
|            | NY State Income Tax | -8.43   | 438.36   |
|            | NYC Income Tax      | -5.94   | 308.88   |
|            | NY SUI/SDI Tax      | -0.60   | 31.20    |
|            | Other
 Bond                     | -5.00   | 100.00   |
|            | 401(k)              | -28.85  | 1,500.20 |
|            | Stock Plan          | -15.00  | 150.00   |
|            | Life Insurance      | -5.00   | 50.00    |
|            | Loan                | -30.00  | 150.00   |
|            | Adjustment

Life Insurance                     | + 13.50 |          |
|            | Net Pay             | $291.90 |          |

|                       |             |
|-----------------------|-------------|
| Payroll check number: | 0000000000  |
| Pay date:             | 7/25/2008   |
| Social Security No.   | 987-65-4321 |

|              |                                           |         |
|--------------|-------------------------------------------|---------|
| Pay to the

order of:              | JOHN STILES                               |         |
| This amount: | TWO HUNDRED NINETY-ONE AND 90/100 DOLLARS | $291.90 |

This is my code:

import os
from PIL import Image
from textractor import Textractor
from textractor.visualizers.entitylist import EntityList
from textractor.data.constants import TextractFeatures
image = Image.open("stub1.jpg").convert("RGB")


extractor = Textractor(region_name="us-west-2")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.FORMS, TextractFeatures.SIGNATURES],
    save_image=True
)
print(document.tables.to_markdown())

I'm using amazon-textract-textractor version 1.8.2 (latest)

bvbg1 avatar Jul 14 '24 18:07 bvbg1