clarin-dspace Request for replacing the reference to the "LRT Standards" PDF with a more informative reference

Current instances of CLARIN DSpace point users at the "LRT Standards" PDF in places where a centre is expected to provide the user (= the depositor) with information about the given centre's preferences regarding deposition data formats. The two places that I was able to identify are:

the FAQ, in answer to the question "What submissions do we accept?"
the pop-up where a user selects data for upload.

(This ticket was originally meant to be a PR, but I felt a bit overwhelmed when trying to search for "LRT-v6.pdf" in the source and staring at all the results -- apologies for chickening out... :-) )

I can very well understand why the "LRT Standards" document was chosen:

it is a single PDF available from clarin.edu,
it talks about standards (and some of those standards either define data formats or correspond to known formats, so the connection was easy to make)
it used to be (at the time when the CLARIN DSpace customisation was prepared) described as a CLARIN recommendation,
there was no obvious alternative at that time.

The essential bits of the above have all changed, by now. Basically, all that remains is that the document is still available from clarin.eu -- but only as an item of historical interest (it was compiled 15 years ago). Many of the standards it mentions have either evolved or been retired, and, crucially, what the centre should provide here (by CTS requirements and by the definition of the relevant CLARIN KPI) is its own, specific recommendations concerning which formats it is ready to accept nearly with no effort ("recommended"), which may require some hassle and delay ("acceptable"), and which should rather get up-translated before submission ("discouraged").

There are no general top-down guidelines in CLARIN for what each centre should accept -- that is always dependent on the specific research profile that each centre has and on its long-term-archiving workflows. An example can be seen e.g. in what the IDS publishes in the Standards Information System -- that is rather drastically different from the content of the "LRT Standards" PDF, with space for general description of the centre and for comments on some of the individual suggestions. That is the format which CLARIN centres are encouraged to follow (getting more/all centres on board of the SIS is actually a running subproject of the Technical Centres Committee).

I would like to ask, in my role as representative of the CLARIN Standards and Interoperability Committee, for the reference to that PDF to be removed from the source. There are two non-mutually-exclusive ways in which this could be performed:

a quick substitution for the page https://www.clarin.eu/content/standard-recommendations -- which is also a kind of roundabout page, because it encourages the user (and, indirectly, the centres) to start using the Standards Information System. So it's effectively a kludge, but at least it avoids providing obsolete or simply false information (and provides some hopefully useful information on top of that). And it's "ready to use", independent of the particular centre's profile.
comment in the file (I'll be happy to assist in formulating it) that encourages the person who is customising their instance of CLARIN DSpace to reference either their existing page listing format recommendations (some centres have such pages) or the format recommendations that the centre maintains in the Standards Information System, where the link is of the form https://standards.clarin.eu/sis/views/view-centre.xq?id= + [centre-ID].

Thanks for considering this! :-)

Oct 06 '24 18:10 bansp

I agree with Piotr's recommendation. Would it be sensible to set the custimisation / confuguration of a new instance to use 2), and if it doesn't exist, use 1) as fallback?

a quick substitution for the page https://www.clarin.eu/content/standard-recommendations -- which is also a kind of roundabout page, because it encourages the user (and, indirectly, the centres) to start using the Standards Information System. So it's effectively a kludge, but at least it avoids providing obsolete or simply false information (and provides some hopefully useful information on top of that). And it's "ready to use", independent of the particular centre's profile. comment in the file (I'll be happy to assist in formulating it) that encourages the person who is customising their instance of CLARIN DSpace to reference either their existing page listing format recommendations (some centres have such pages) or the format recommendations that the centre maintains in the Standards Information System, where the link is of the form https://standards.clarin.eu/sis/views/view-centre.xq?id= + [centre-ID].

Oct 17 '24 13:10 stranak

Just asking for a friend: Any updates on this?

Sep 03 '25 05:09 mmatthiesencsc

I know that we are very slow, but we are working on this. As you can see, it is scheduled for the milestone to be delivered this year.

Practically speaking, This splits into 2 issues:

change to the CLARIN-DSpace, scheduled above.
get our list of supported + curated together for LINDAT repo and publish it in the SIS, to be ready for 1).

We are now in the process of reviewing the formats for LINDAT, but it is surprisingly tricky. I go through the various centres' recommendations, but I am never happy, when I compare it with our real review process and when we ask submitters for changes. Some examples: plain text data: when a UTF-8 file has mixed line endings it is a nightmare for processing, but it is still valid Unicode. CSV data (we often get corpus samples, etc. this way): sometimes no header, sometimes mixed line endings, almost always extra columns and (many, many) extra lines (is it from Excel?). It is all valid CSV, but really bad for processing (how many records are in that file?). Office Open XML (XSLX) is accepted by only 3 centres for Textual Resource, but XML (any XML?) is better for many centres?

I am just giving these examples to say that we are working on this, but it is surprisingly hard to give a sensible list of the formats we are happy with and when we are not happy.

If anybody is willing to help us with this and review with us the list (and contents of our repo) we would be very happy. But either way, it will be done this year, I promise!

Sep 09 '25 19:09 stranak

I would start with something. plain text UTF-8 at least makes people with Latin-15 think (if that is still a thing). I share your feeling about "XML", it is vague, but most of the time indeed "acceptable". It is a fact of life that a resource in a "discouraged" format will still be taken if it is otherwise valuable. It might just take longer to process. And if the depositor has the data in multiple formats (e.g. mp3 and wav) the list encourages to send wav. And if we get mp3 we will convert it, if at all possible or make an exception.

Sep 10 '25 06:09 mmatthiesencsc

Hi Pavel @stranak and thanks for keeping this in your sights. And apologies for coming back to you with a delay.

Above, I see three issues that are of a different nature, so I'll go over them one by one (not necessarily all at once but let's see):

a kinda moral issue that may probably be redressed as an efficiency issue: should we not stand for what we're preaching
implementation issue of granularity, with a dash of ontology: at which point do we make a cut between an object and its properties
a largely social (mixed with psychological) issue of cutting corners: let's do the easy bit formally because, after all, we have space to react and say "well, yes, but..."

Re: 1.

get our list of supported + curated together for LINDAT repo and publish it in the SIS, to be ready for 1).

This is a very sensible and responsible stance. I have attempted it myself ("Shouldn't my centre be a shining example of implementation? It's embarrassing if I ask others to get their centres to do what my centre somehow keeps delaying."). Plus efficiency: "what if there are hidden hurdles in the process and wouldn't it be best to 'test the cure on myself first'?" I fully agree with this kind of approach. I also think that it has a tendency to fail in cases of man vs. inert group (substitute "traditional academic or research institution" for "inert group" just as well).

The strategy that I think has greater chance of working in many cases is one presenting a fait accompli, maybe not so much burnt bridges as simply (positive) facts on the ground: we've made the change, so let's now deal with it ourselves as well. I think that is a way of shifting the spotlight from that lonely man and that man's personal embarrassment to the institution (group): I've done something good that was needed and is appreciated, so it's time we've embraced it as well.

(Because, to be sure, the requested change is beneficial to many centres not only because the SIS is a sensible approach [and we can argue about how sensible, in issues 2 and 3 above], but also because that old PDF is at best useless, and otherwise potentially harmful. And also, at this very point (as opposed to 15+ years ago when the CLARIN world was new) presenting it to new users is an embarrassment.)

To wrap up: I suggest that the issue of the submission of LINDAT's recommendations be divorced from the issue of introducing a warning and a link to the SIS in the template / help system in v7.

(To be sure: I'd love to see LINDAT in the SIS. But the requested change is bigger than that.)

Tbc...

Oct 25 '25 20:10 bansp

Re: 2. These are great real-life cases for inclusion into the SIS documentation. Thank you! (And it's not the first bonus that the SIS gets from you, to be sure. Extended centre descriptions appeared after our conversation at one of the CACs.)

plain text data: when a UTF-8 file has mixed line endings it is a nightmare for processing, but it is still valid Unicode.

CSV data (we often get corpus samples, etc. this way): sometimes no header, sometimes mixed line endings, almost always extra columns and (many, many) extra lines (is it from Excel?). It is all valid CSV, but really bad for processing (how many records are in that file?).

This goes mainly under the label "granularity" and is not fully settled, in the sense that there is still, for example, an open-ish issue of how to treat program code (is it just code, mostly plain text, sometimes XML, and the tools that consume it are just 'flavours', or are we talking about R vs. Python, next to XSLT and PS). As for plain text, I have just closed one issue (clarin-eric/standards/issues/46), because I am not sure if it's OK to perform the positive act of creating something only in order to shake its severed head before the crowd and say it's a villain.

In the two cases above it's a tad simpler, I think. I would say we're looking at a single object here, and want to serialise that object as a single file, but make it possible to specify its properties, both positively and negatively. My instinct would be, in the first case, to either ask for unified line endings or warn against mixed line endings that can prolong the deposition process. But I can imagine that a centre bombarded with some systematically malfunctioning formats wants to state explicitly that "{plain text}(object) with {mixed line endings}(property) is discouraged". The property would be stated in the comment. Maybe the comment (or the body of the format description) could send the user to either a Tidy-like utility or to a diagnostic tool. etc.

The CSV case is analogous. Both seem very good candidates for adding warnings to the body of the format description. (Will those warnings work? That's a question that can be asked of any piece of documentation. For some they will, for some they won't, and it's the latter that is going to bite our donkey, guaranteed by Murphy's Laws.)

Seemingly, the next case, "plain" XML, is analogous, but actually I'd say that's just a superficial similarity.

Oct 25 '25 21:10 bansp

Re: 3

Office Open XML (XSLX) is accepted by only 3 centres for Textual Resource, but XML (any XML?) is better for many centres?

The XML case is ontologically simple, in the sense that we fully agree that unqualified "XML" should not be used for recommendations. There are several warnings or statements to this effect across the SIS and its documentation. The description file for "just XML" was actually created for two reasons: (minor) to have a handy umbrella for a lot of formats and (major) to provide a warning that it doesn't make sense to use it... It is similar with unqualified "TEI" (that is not obvious to many) and, e.g., unqualified "CoNLL".

I had an idea (which I hope is recorded somewhere in the SIS) that we should have a way of marking some formats as umbrella formats, and then

perhaps indicating that in the recommendations if they appear without comments
excluding them (probably unconditionally) from the list of "most popular formats"

I think that both are worth implementing. In case (1), I could also write a bit of Schematron to at least warn the encoder. But it wouldn't really solve the matter fully, because I have already seen comments that are a bit unoptimal, and I could imagine that we may simply begin to see more 'clever' comments, to get around that. In the end, this is down to someone's attitude ("just get this over with" or "we can always reject the deposition anyway") or maybe sometimes lack of expertise (not realising that schema-less, obfuscated XML with mixed content and no information which whitespace is significant, and which pieces of object language inside attributes count, is simply useless for preservation and/or interoperability).

So I'd rather count on some kind of peer pressure here, whether coming from users or from other centres. I think this is, realistically, all that should be counted on, with the help of little hints such as Schematron warnings (but not everyone bothers to use schemas), existing hints in the documentation, maaaybe some gentle indication of the inappropriate use of comment-less umbrella format in the list of recommendations displayed on the page of an individual centre.

It seems to me that the fact that SIS recommendations can be misused need not prevent a centre from using the SIS. The system has become quite expressive, to the point that there are so many options to produce something completely inconsistent that it would be unrealistic to try to "save" the encoder by implementing hard constraints, which could bite back on many innocent occasions. We have the open source / open access methodology (many eyes, many brains, many mouths), and it seems to me that the way to improve things should lead through gentle pressure rather than any sort of programmatic cuffs. So I would like to encourage LINDAT (and any other centre whose representatives may read this) to go ahead and submit. I'd say (and agree with Martin) that it is better to have something in the SIS and fine-tune when needed, than not to have anything at all. Plus, in a distributed, decentralised network, it's probably always good to minimise entropy in the backbone, so let's try :-)

And again thanks for sharing, Pavel and Martin. I will now look for that issue targeting comment-less umbrella formats and prioritise it. Cheers!

[edit, 3 days later: umbrella formats are called "hub formats" by the SIS; the boolean attribute @hub is actually implemented, but not yet visualised; new issues got added so that the entire thing gets finished]

[edit, X days later: we've found a cute umbrella symbol and renamed everything to use "umbrella", after all; it's not bad, overall]

Oct 26 '25 00:10 bansp