metafacture-core
metafacture-core copied to clipboard
Allow USMARC character encoding in Marc21Decoder
This does not work since we the decode-marc21
needs UTF-8
At that moment we can't solve this problem. So we would need an additional modul that could transform string encoding from one encoding to another.
idea would be something like:
stringEncodingSwitcher(in="ASCII" out="UTF-8")
As @blackwinter noted , it would not help to convert the data into a new character-encoding - the problem is rather that the 'characterCodingScheme' (Pos. 09) in the data is not set. The Marc21Decoder
checks if this is set to a
, but in usmarc
this is empty. Removing this check from the Marc21Decoder
I got the following output:
leader: status: "c" type: "a" bibliographicLevel: "m" typeOfControl: " " characterCodingScheme: " " [...]
So we could:
- allow an empty
characterCodingScheme
(Pos 09) - add an option to mark it as usmarc and then check if the
characterCodingScheme
is empty
What shall we do?
For the MARC-8
character encoding see also https://en.wikipedia.org/wiki/MARC-8.
I would vote to first go with the simplest character encoding: USMARC
. Also, no option is needed.
WDYT?
USMARC
is not a character set, it's the precursor to MARC 21:
MARC 21 is a result of the combination of the United States and Canadian MARC formats (USMARC and CAN/MARC).
We should probably investigate why Marc21Decoder
only supports the Character coding scheme UCS/Unicode
(current module initially introduced by 3b24df5, while the UTF-8 check was already present in MarcDecoder
- although optional) and what needs to be done in order to add support for MARC-8
.
Just a guess - as MARC-8
comes not as out-of-the-box library AND there was (still is - besides the USMARC
(which is just usascii
, no?)) no demand for it, it was ignored by implementers. If we would really want to support it fully, we may want to predate e.g. https://github.com/xbib/marc.
USMARC
(which is justusascii
, no?)
No, see e.g. here if you're curious ;)
[ETA: But why would we concern ourselves with USMARC
anyway? We're talking about the Marc21Decoder
, aren't we?]
we may want to predate
"predate"?
e.g. https://github.com/xbib/marc
I have no idea if this would be suitable (and sufficiently compatible).
there was [...] no demand for it
Which begs the question if there's actual demand now - after (almost exactly) 6 years? Was this request based on a concrete use case or was it just for completeness sake?
The initial issue was about a character encoding modul, the example was a USMARC case. There was no demand for USMARC other than the concrete example which I picked up from a Catmandu test. I thought it was a general encoding problem. I therefore suggested a general modul for character encoding.
@blackwinter hinted that it is an decode-marc21
problem in the chat and @dr0i changed the isssue to USMARC support.
For me this is not urgent.
"predate"?
Uh, I meant "depredate"
I have no idea if this would be suitable (and sufficiently compatible).
That's what I mean with "depredate", copy 'n paste code, not reusing the whole thing. But you are right, it would mean some work.
But as @TobiasNx said, it's about reuse catmandu's tests. My impression is that it could be enough to
allow an empty characterCodingScheme (Pos 09)
and we could at least decode these records.
That would not enable handling MARC-8
character sets (completely) but it would a be a low hanging fruit to start with (and, maybe, enough for all times, because there ma be no "real" demand).
(BTW, besides this issue may be of a rather academic interest, I appreciate the excursion.)
It wasn't clear to me that this issue referred to being able to run Catmandu tests. That's why it's usually beneficial to state one's intention instead of assuming what the solution should be ;)
That would not enable handling
MARC-8
character sets (completely) but it would a be a low hanging fruit to start with
So we would accept MARC-8
without actually supporting it? What would the outcome be? (*) Would it satisfy @TobiasNx's original goal?
(*) It's easy to test: Just modify the input to pretend it was UCS/Unicode
. That could also be a generic workaround in this case: match(pattern="\\A(.{9}) ", replacement="$1a")
@dr0i and @blackwinter the suggested workaround seems to work. Thanks.