excalibur icon indicating copy to clipboard operation
excalibur copied to clipboard

Unstructured Data

Open supperherodev opened this issue 4 years ago • 2 comments
trafficstars

Hi team, the camelot and excalibur is a great tool for extracting data from pdf but sometimes I get unstructured data. Please give me some suggestion or a way to handle this type of problem below is the attachment you can see  pdftable xlsxfile so here the instrument type is nestle india and industry type is consumer non durables it takes the Durables as an extra cell Please i request you to provide me some solution to overcome this problem.

Thank you so much for making this library and tool.

supperherodev avatar Feb 11 '21 07:02 supperherodev

@vinayak-mehta Guess ML tools would help with such unstructured data. Thoughts?

https://djajafer.medium.com/pdf-table-extraction-with-keras-retinanet-173a13371e89

arky avatar Apr 03 '21 17:04 arky

Yeah right now Camelot can't group rows together when there are no lines present. Adding support for horizontal line separators on the frontend or trying out ML might be some solutions, but it might take some time before I can do those experiments. @rajshah1997 If you want to give those solutions a try, please go ahead :)

vinayak-mehta avatar Apr 04 '21 20:04 vinayak-mehta