pydifact icon indicating copy to clipboard operation
pydifact copied to clipboard

Reference to line/byte from each segment

Open nerdoc opened this issue 3 years ago • 4 comments

Btw: It would be nice if each segment had a reference to the exact input line (byte position start/end) to ease debugging. That might also help handling malformed files. (I just swapped our self-written edifact parser against pydifact and it was a pretty pleasant experience. However exact byte positions was the one feature we lost in that process.)

Originally posted by @FelixSchwarz in https://github.com/nerdocs/pydifact/issues/24#issuecomment-679190196

nerdoc avatar Aug 24 '20 17:08 nerdoc

This is not completely trivial. pydifact stores its data in lists, and does not create Token objects. This is mainly for performance reasons. Creating an own object for each Token would cause much more memory usage. But OTOH, it's not possible to store information additionally to the Token into the list, without breaking compatibility to all structures using Tokens and Segments. One approach would be, creating a separate list in parallel, where line information is stored with the matching index like in the token list. This seems a bit clumsy to me, but maybe is the most straightforward way to go.

Refactoring pydifact to use Token objects (with a linenumber/byte attribute is a no-go IMHO, as it uses to much overhead. Even the "parallel list" approach could be made "switch-off"-able - to use less memory when parsing bigger EDI files, at the cost of one additional if clause in each iter. I'll test that a bit.

nerdoc avatar Aug 24 '20 17:08 nerdoc

OMG. Big mistake. I should read my own code. Tokens ARE in fact objects, so we can store line information in them. ;-)

nerdoc avatar Aug 24 '20 17:08 nerdoc

I also like using namedtuple. Previously our own parser used something like this

Segment = namedtuple('Segment', ('data', 'line_nr', 'offset_start', 'offset_end', 'raw_data'))

You could easily add some extra methods there for extra convenience like you do now for Token.

FelixSchwarz avatar Aug 24 '20 17:08 FelixSchwarz

Maybe it would be an option to skip the step of creating Segments as nested lists, and instead directly create the Segment objects using the factory.

nerdoc avatar Aug 24 '20 19:08 nerdoc