Make encoder-marc21 more forgiving?
Came up in https://github.com/metafacture/metafacture-core/issues/527 :
If we parse (assumingly) crude binary MARC the encoding fails.
(first broken MRC seems to be 02589nas a2200601 c 4500 in https://raw.githubusercontent.com/gbv/Catmandu-Tutorial/master/data/marc.mrc (should be double checked with another MARC-validator other than MF:
Because MARCs binary directory of field 787 points to Iso646Constants.INFORMATION_SEPARATOR_2 = 0x1e the encoding breaks))
If an encoding breaks not only the field is dumped or the whole record but the whole stream. The dumping of the record and - more important- the whole stream can be avoided by piping decode-marc21 to catch-stream-exception before piping to encode-marc21.
a) if the record is indeed invalid:
aa) shall we make the encode-marc21 make more forgiving?
ab) or is it enough to bail out (as it is atm) resp. to expect the user to use catch-stream-exception resp. fix the invalid MARC ?
b) if the record is valid: fix encode-marc21
If I am not mistaken MF in general has a "make or break" approach to transforming things especially the encode-marc21 modul has an integrated validator that is quite strict. I would assume that this okay. But It would be good if the error message would be more explanatory and hinting to the error.
I separated the broken records from the valid ones.
e.g.
6500\x1e":
"7": ""
"0": "(DE-588)4057379-5"
"0": "(DE-101)040573796"
a: "Steroide"
"2": "gnd"
"650d\x1e":
"7": ""
"0": "(DE-588)4039983-7"
"0": "(DE-101)040399834"
a: "Molekularbiologie"
"2": "gnd"
"650d\x1e":
"7": ""
"0": "(DE-588)4067488-5"
"0": "(DE-101)040674886"
a: "Zeitschrift"
"2": "gnd"
"650d\x1e":
"7": ""
"0": "(DE-588)4057379-5"
"0": "(DE-101)040573796"
a: "Steroide"
"2": "gnd"
"650d\x1e":
"7": ""
"0": "(DE-588)4006777-4"
"0": "(DE-101)040067777"
a: "Biochemie"
"2": "gnd"
See here in the playground You can spot the broken indicators in the yaml result.
I also checked the broken records with yaz-marcdump:
$ yaz-marcdump -np '/home/tobias/Downloads/broken.mrc'
<!-- Record 1 offset 0 (0x0) -->
No separator at end of field length=75
No separator at end of field length=19
No separator at end of field length=24
No separator at end of field length=88
<!-- Skipping bad byte 10 (0x0A) at offset 882 (0x372) -->
<!-- Record 2 offset 883 (0x373) -->
No separator at end of field length=124
No separator at end of field length=89
No separator at end of field length=21
No separator at end of field length=30
No separator at end of field length=88
Separator but not at end of field length=22
<!-- Skipping bad byte 10 (0x0A) at offset 1805 (0x70d) -->
<!-- Record 3 offset 1806 (0x70e) -->
No separator at end of field length=121
No separator at end of field length=17
No separator at end of field length=40
No separator at end of field length=14
No separator at end of field length=119
Separator but not at end of field length=91
<!-- Skipping bad byte 10 (0x0A) at offset 2733 (0xaad) -->
<!-- Record 4 offset 2734 (0xaae) -->
No separator at end of field length=117
No separator at end of field length=110
<!-- Skipping bad byte 10 (0x0A) at offset 3848 (0xf08) -->
<!-- Record 5 offset 3849 (0xf09) -->
No separator at end of field length=104
No separator at end of field length=80
<!-- Skipping bad byte 10 (0x0A) at offset 4975 (0x136f) -->
<!-- Record 6 offset 4976 (0x1370) -->
No separator at end of field length=176
No separator at end of field length=17
No separator at end of field length=16
No separator at end of field length=21
No separator at end of field length=27
No separator at end of field length=115
Separator but not at end of field length=45
Separator but not at end of field length=64
Separator but not at end of field length=45
<!-- Skipping bad byte 10 (0x0A) at offset 6185 (0x1829) -->
<!-- Record 7 offset 6186 (0x182a) -->
No separator at end of field length=132
No separator at end of field length=40
No separator at end of field length=12
No separator at end of field length=28
No separator at end of field length=245
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=97
No separator at end of field length=91
Separator but not at end of field length=96
<!-- Skipping bad byte 10 (0x0A) at offset 7805 (0x1e7d) -->
<!-- Record 8 offset 7806 (0x1e7e) -->
No separator at end of field length=65
No separator at end of field length=32
No separator at end of field length=19
No separator at end of field length=24
No separator at end of field length=43
No separator at end of field length=14
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=109
No separator at end of field length=95
<!-- Skipping bad byte 10 (0x0A) at offset 9536 (0x2540) -->
<!-- Record 9 offset 9537 (0x2541) -->
No separator at end of field length=66
No separator at end of field length=31
No separator at end of field length=58
No separator at end of field length=23
No separator at end of field length=16
No separator at end of field length=56
No separator at end of field length=65
No separator at end of field length=59
No separator at end of field length=56
No separator at end of field length=57
No separator at end of field length=59
No separator at end of field length=54
No separator at end of field length=63
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=54
No separator at end of field length=55
No separator at end of field length=57
No separator at end of field length=19
No separator at end of field length=118
Separator but not at end of field length=88
Separator but not at end of field length=206
Longer report with $ yaz-marcdump -npv '/home/tobias/Downloads/broken.mrc' here: https://gist.github.com/TobiasNx/9711cc680acdeb55ebb1b69700cb2477
The separators in these examples seem to be broken. Let me see how Catmandu is handling it.
I also tested the broken records with catmandu it seems that their marc decoder AND not the encoder handles the incomming data differently. It does not skip the broken separators but the broken elements as a whole. Here it replaces the broken indicators with whitespaces:
MF Result transforming MARC into MARCXML, have a look at the indicator and the first subelement :
<marc:datafield tag="775" ind1="0" ind2="">
<marc:subfield code="8"></marc:subfield>
<marc:subfield code="i">Online-Ausg.</marc:subfield>
<marc:subfield code="t">�The� journal of steroid biochemistry and molecular biology</marc:subfield>
<marc:subfield code="w">(DE-600)1482780-3</marc:subfield>
<marc:subfield code="w">(DE-101)019756801</marc:subfield>
</marc:datafield>
<marc:datafield tag="780" ind1="8" ind2="0">
<marc:subfield code="">00</marc:subfield>
<marc:subfield code="i">Vorg.:</marc:subfield>
<marc:subfield code="t">�The� journal of steroid biochemistry</marc:subfield>
<marc:subfield code="w">(DE-600)80169-0</marc:subfield>
<marc:subfield code="w">(DE-101)010545514</marc:subfield>
</marc:datafield>
CATMANDU Result transforming MARC into MARCXML with: $ catmandu convert MARC to MARC --type XML < '/home/tobias/Downloads/broken.mrc' > broken.xml . Here the broken first indicator and first element does not exist.
<marc:datafield tag="775" ind1=" " ind2=" ">
<marc:subfield code="i">Online-Ausg.</marc:subfield>
<marc:subfield code="t">�The� journal of steroid biochemistry and molecular biology</marc:subfield>
<marc:subfield code="w">(DE-600)1482780-3</marc:subfield>
<marc:subfield code="w">(DE-101)019756</marc:subfield>
</marc:datafield>
<marc:datafield tag="780" ind1=" " ind2=" ">
<marc:subfield code="i">Vorg.:</marc:subfield>
<marc:subfield code="t">�The� journal of steroid biochemistry</marc:subfield>
<marc:subfield code="w">(DE-600)80169-0</marc:subfield>
<marc:subfield code="w">(DE-101)0105</marc:subfield>
</marc:datafield>
I would be in favour of adjust the behaviour of the decoder as an option that it does not create broken values from an broken separator. Perhaps the CATMANDU MARC Decoder even if they handle marc very differently could hint a solution:
https://metacpan.org/release/HOCHSTEN/Catmandu-MARC-1.32/source/lib/Catmandu/Importer/MARC/Decoder.pm#PCatmandu::Importer::MARC::Decoder
I try to follow. But the playground example in https://github.com/metafacture/metafacture-core/issues/528#issuecomment-2326018580 results in "Request-URI Too Long".
I try to follow. But the playground example in #528 (comment) results in "Request-URI Too Long".
Thanks for the hint. MF Playground does not complain anymore if the URL is too long. Should open a ticket there.
I fixed the example and added some more info to my comments: #528 (comment)
As I revised my comments: @dr0i in short: we should not change the behaviour of encode-marc21 but of decode-marc21. So that the decoder optionally does not create broken values due to the invalid separators as catmandu would.
Perhaps the CATMANDU MARC Decoder even if they handle marc very differently could hint a solution: