invoice2data icon indicating copy to clipboard operation
invoice2data copied to clipboard

Is there a way to parse multiple tables using "lines"

Open suprateekn opened this issue 5 years ago • 4 comments

I am having 2 tables in a single page of the pdf and want to extract data from both of them. Can't use the "tables" plugin as can have only the first line of data using that.

As far as I can understand the "lines" block is treated as a dictionary. So multiple table parsing using "lines" is not possible. Is there an alternative?

suprateekn avatar Jul 16 '20 17:07 suprateekn

I can implement this but I need invoice2data maintainers to help me to decide on YAML syntax for it. I have two suggestions. @m3nu: can you comment on below ideas, please?

Use extended fields syntax

In the #307 I suggested each fields entry to be an associative array. That would allow very clean support for requested feature (without breaking backward compatibility), consider:

fields:
  foo:
    static: 'Lorem ipsum'
  items:
    plugin: lines
    settings:
      start: ...
      end: ...
      line: ...
  rates:
    plugin: lines
    settings:
      start: ...
      end: ...
      line: ...

That would require rewriting plugins API a bit which I can easily handle. The only problem I see is that table plugin wouldn't match that design. It's because table plugin parses (returns) multiple fields. It means we may need two APIs for plugins then (which is not a problem for me - I'm just making it clear).

Extend existing lines syntax

Current syntax for lines looks like this:

lines:
  start: ...
  end: ...
  line: ...

We could extend it to support following (without breaking backward compatibility):

lines:
  - items:
      start: ...
      end: ...
      line: ...
  - rates:
      start: ...
      end: ...
      line: ...

rmilecki avatar Oct 28 '20 17:10 rmilecki

Instead of this I was thinking if the lines section could be made like the tables section. Where we could give multiple entries. That would be really helpful

suprateekn avatar Oct 28 '20 17:10 suprateekn

Instead of this I was thinking if the lines section could be made like the tables section. Where we could give multiple entries. That would be really helpful

The difference between above "Extend existing lines syntax" and tables plugin syntax is the former having every array entry named (items and rates). We can't have pure tables-like syntax instead. It's because:

  1. lines plugin returns array that has to be assigned to some field
  2. tables plugin assigns to few fields depending on used body

Unless I misunderstood you. If so, please provide some syntax example, so it's clear what you mean.

rmilecki avatar Oct 28 '20 22:10 rmilecki

Use extended fields syntax

Defining this per-field makes more sense to me personally and seems more scalable. So we would have options for each field: regex, static, lines plugin, etc. Maybe we treat everything as plugin, including regex and static, so those can be improved independently.

m3nu avatar Oct 30 '20 08:10 m3nu