edifact icon indicating copy to clipboard operation
edifact copied to clipboard

Issue parsing linebreaks in segments

Open johanib opened this issue 1 year ago • 1 comments

Hi,

I'm not an edifact expert, so I'm currently struggling with some issues related to parsing linebreaks.

Afaik, edifact uses the ' character as a line ending (segment end). So whether or not the edi file uses linebreaks should not matter to parsing the file.

However, this change, specifically the linked line is giving me trouble:

https://github.com/php-edifact/edifact/commit/7cea68c6#diff-1e623c61c2badd421334ac19660878a0a400c822feebe3c2984769b8f0fda401R483

$string = (string) \preg_replace(
            '/(([^' . $this->symbRel . ']' . $this->symbRel . '{2})+|[^' . $this->symbRel . '])' . $this->symbEnd . '|[\r\n]+/',
            '$1' . $this->stringSafe,
            $string
        );

That change causes the second line to not be parsed as part of the FTX segment:

FTX+AAI+++THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR
:DUE TO INTERVENTION OF CUSTOMS.'

Reverting the change by changing:

$terminatorRegex = '/(([^'.$this->symbRel.']'.$this->symbRel.'{2})+|[^'.$this->symbRel.'])'.$this->symbEnd.'|[\r\n]+/';

to

$terminatorRegex = '/(([^'.$this->symbRel.']'.$this->symbRel.'{2})+|[^'.$this->symbRel.'])'.$this->symbEnd.'/';

Fixes my testReadsLinebreakedSegments test, but of course breaks \EDITest\AnalyserTest::testProcess.


Input edi:

UNB+UNOB:2+CARRIER+RECEIVER-ID+999818:999+251'
UNH+0001+IFTMBC:D:00B:UN'
BGM+770+AAA99970929+9'
TSR+30+2:::2'
FTX+AAI+++THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR
:DUE TO INTERVENTION OF CUSTOMS.'
UNT+10+0001'
UNZ+1+251'

My test:

    public function testReadsLinebreakedSegments()
    {
        $p = new Parser();
        $p->load(__DIR__ . '/../files/example_linebreak.edi');
        $r = new Reader($p);

        $line = $r->readEdiDataValue(['FTX', [1 => 'AAI']], 4);

        self::assertSame(
            [
                'THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR',
                'DUE TO INTERVENTION OF CUSTOMS.',
            ],
            $line
        );
    }

Currently fails with:

4) EDITest\ReaderTest::testReadsLinebreakedSegments
Failed asserting that 'THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR' is identical to Array &0 (
    0 => 'THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR'
    1 => 'DUE TO INTERVENTION OF CUSTOMS.'
).

Can anyone help me with this? Is my example input within parse-able spec? If so, what would be needed to handle this scenario without introducing regression?

For reference: https://github.com/johanib/edifact/pull/2/files

johanib avatar May 13 '24 07:05 johanib

unwrap function in Reader removes CR+LF+0x00, but unwrap in Parser doesn't:

$string = \preg_replace('#[\x00\r\n]#', '', $string);

Is that how it should be?

https://github.com/php-edifact/edifact/blob/master/src/EDI/Reader.php#L557

https://github.com/php-edifact/edifact/blob/master/src/EDI/Parser.php#L467

iGrog avatar May 13 '24 19:05 iGrog

Hi, your input EDI has a line-break inside the segment, it shouldn't be valid as the linebreak character isn't in the UNOB charset... I modified the Parser to add a feature to detect an error where there's no segment terminator in files that are formatted with one segment per line. If you pass ->setStrict(true) to the Parser instance before running it on the edi it reverts to the previous behaviour... @johanib @iGrog I believe the unwrap function in Reader is a version of more than 5 years ago of the same function in Parser :-D Reader and Analyser need some refactoring probably....

sabas avatar May 14 '24 08:05 sabas

@sabas Thanks! I'm not sure how I missed ->setStrict(true), but it solves my linebreak issue :sweat_smile:

One more thing I don't understand, given these EDI segments:

FTX+AAI+++PLS ENSURE TO TAKE OUR APPROVAL PRIOR STUFFING ANY NON HAZ CHEMICA:LS'
FTX+ABV+++THIS BOOKING CONFIRMATION IS SUBJECT TO SEALING'
FTX+AAI+++THE SHIPPER SHALL NOT BE RESPONSIBLE FOR ANY COSTS/DELAYS OCCUR

I really can't figure out how to extract the FTX+AAI segments... Do you know if this library is capable of reading these segments? It seems it always returns with the error 'Segment is ambiguous'.

I wrote some hacky solution to accomplish this and a test to illustrate what I expect as output: https://github.com/johanib/edifact/pull/1/files

johanib avatar May 16 '24 12:05 johanib

What about returning you an array of results instead of null? Would it make sense? Are you using Reader to extract specific information?

sabas avatar May 16 '24 12:05 sabas

For my purposes I do not need to extract the individual FTX+AAI segments. But afaik, the readEdiDataValue function is build on the idea that the caller specifies what it wants to read, so I added the $offset parameter.

Returning an array would work for my case.

There are also some checks at the bottom of the readEdiDataValue function. They would need to be looped if the function would start to return arrays. So I'm not sure if that will make it better.

I can probably improve my initial suggestion to make it more nice if you think it's a good addition.

johanib avatar May 16 '24 13:05 johanib

Of course, feel free to send a PR :)

sabas avatar May 16 '24 13:05 sabas

@sabas I created this pr: https://github.com/php-edifact/edifact/pull/135

I did some small refactoring to reduce the amount of nested if/else constructions. Please let me know if you have any comments.

johanib avatar May 21 '24 07:05 johanib

:confetti_ball: Just a question: What is the release process like? I see the latest tag / composer package is from October last year. Can we release this version? :)

johanib avatar May 21 '24 08:05 johanib

The processs doesn't exist actually, in my projects I use directly the master branch 🤣 I usually do it after someone reminds me to do it... I'll tag 1.1 in a few hours!

sabas avatar May 21 '24 08:05 sabas