fd-dictionaries refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD

I would like to update the existing ODD, in two steps, and this ticket is meant for the first and gentler of them, namely for a rewrite of the current ODD to the current TEI idiom, which should ideally mean just a cosmetic change without affecting the extension (i.e., the patterns/grammars defined by RNG, XSD, DTD), but in practice, the extension is going to be affected due the the changes in the TEI that have happened over the years, so some tinkering may be in order, and a lot of test runs across all the databases.

In doing that, I would like to add two files to our version control. For strictly internal purposes, so that we can trace the changes in the TEI internals without investigating the git history of the TEI itself, each time.

Let me sketch some background:

the TEI ODD mechanism is in essence a customization / documentation mechanism that targets a set of all the definitions encoded by the TEI Guidelines.
that set is not present in a cloned TEI repository, but rather gets derived by the make system (via TEI Stylesheets, which is a set of tools that accompanies the TEI Guidelines) and resides in a cryptically named document called p5subset. It is called an 'integrated ODD'.
any typical ODD document created with the appropriate TEI tools is meant to tailor the integrated ODD down to a particular purpose: manuscript description, corpus encoding, dictionary encoding, etc.
the application of the Freedict ODD to the integrated ODD (p5subset) silently creates something that can be called Freedict integrated ODD; it is not visible to the outside eyes, because it is regenerated each time that the Freedict ODD is manipulated by the TEI Stylesheets.
the 'Freedict integrated ODD' is used (or rather: was used) to derive the schema documents: RNG (of primary use for us), but also XSD and DTD (which we provide more or less out of courtesy -- but I can imagine us not providing these two, to avoid having to address the potential issues if someone decides to use those instead of the RNG)
I stress the "was used" because, simplifying the history slightly, that happened once, years ago: I ran the TEI tools on the current Freedict ODD and created the three schema documents. Note the crucial issue: they were ran on the p5subset as it was defined by the TEI years ago. So while the Freedict ODD hasn't been modified since then, the result of its application on the current p5subset is going to be extensionally different from what was used years ago. I don't think it's a major issue (because we only use a very small subset of the TEI), but it's definitely something to be aware of.
one more relevant issue and an argument for 'freezing' the p5subset in our version control is that, if one doesn't have full control of the TEI environment, their ODDs may reference the current 'blessed' TEI ODD, recreated after each release in the TEI Vault, or the current snapshot of the TEI under control of their Jenkins environment, or the local p5subset on the user's hard drive; what I propose reduces this potential complexity and adds a lot of transparency.

A hopefully minor complication is that our RNG was edited by hand since it got derived. Since it is version-controlled, I can extract the modifications and reapply them at the ODD level.

Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. ~~I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?~~ EDIT: this is now the topic of freedict/tools#28 and I have an interim solution

I mentioned adding two files to the version control. I meant the current p5subset and the Freedict integrated ODD (call it... freedict_p5subset?). The first one freezes the current state of the TEI, so that, in the future, we can diff that. The second is to expose the Freedict integrated ODD for similar comparisons. I could probably live without the latter, since it depends on the former, but it also depends on the TEI stylesheets, and those are under constant development as well. Bottom line: it's far more convenient in case one has to investigate some schema-related issue across time, to have both these files handy, because both of them can only be recreated in the future after tinkering with two very dynamic repositories (TEI Guidelines and TEI Stylesheets).

Envisioned action sequence:

derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)
freeze the p5subset by adding it to Freedict version control (where? under shared/ or elsewhere?)
derive the current freedict_p5subset by using the current Freedict ODD, with one change: its @source attribute will now point at the p5subset frozen at step (2)
derive the RNG and check if all the databases validate against the RNG
freeze the newly derived freedict_p5subset next to the p5subset; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets
rewrite the current Freedict ODD, just for the syntactic sugar
(recurring step) derive the RNG and check if all the databases validate against the RNG
commit the newly created freedict_p5subset just to document any modifications that could have crept in at step (6)
check our RNG version history for potential modifications introduced by hand, and see if they need to be handled at the ODD level (it might be that the underlying TEI has caught up with them, during the years that passed), if an ODD rewrite is necessary, then repeat steps (7) and (8)

At this point, after all the above actions, we should be still at the status quo, except with (a) 2 new files, kept for reproducibility checks and (b) a newer Freedict ODD, ready to be modified further.

Jan 04 '21 16:01 bansp

Another hopefully minor issue (but actually part of a larger issue suitable for a separate task in a separate ticket) is the way to make sure that the newly derived RNG is still valid for all the dictionary databases. I seem to recall that the Freedict make system had a 'validate' target, so I imagine that, after regenerating the RNG, I would only have to run make with the specific parameter, and watch for error messages. @humenda , do you sense any trouble in this regard, please?

If it is about applying the RNG to the TEI file (xmllint, etc.), I would say that's fine. The only stepping stone here is that the RNGs are sometimes symlinked and a broken symlink can cause trouble :).

BTW, if the schemas were in tools/, we wouldn't need to copy / symlink the schemas to each dictionary, but they were part of the tooling. Does this sound sensible? If so, I would like to make this shift at some point.

Envisioned action sequence:

derive the current p5subset (on my disk, against the current snapshot of the TEI and TEI Stylesheets)

Y you can always check out a new branch and commit any temporary state there. That at least gives people the chance to see progress.

freeze the p5subset by adding it to Freedict version control (where? under shared/ or elsewhere?)

What is it? If that subset manifests itself as ODDs and schemas, then why not move it straight to tools and adapt the build system to use it from there? If it is a preformat and we decide to move the schemas to tools, this preformat should also be in tools.

freeze the newly derived freedict_p5subset next to the p5subset; this one should be regenerated by hand after each modification of the Freedict ODD (one has to remember about that); recall: it's frozen for convenience, to shield it from any ensuing modifications in the TEI Stylesheets

I am lost here. Please go ahead if you have your plan :).

Jan 09 '21 19:01 humenda

Replying to specific points:

BTW, if the schemas were in tools/, we wouldn't need to copy / symlink the schemas to each dictionary, but they were part of the tooling. Does this sound sensible? If so, I would like to make this shift at some point.

I don't thank document grammars should be seen as part of the tooling. ODD and schemas are what provides semantic and syntactic rules for the interpretation of dictionary documents. I would definitely advise to keep them within the fd-dictionaries repository and either symlink, the way it's done now, or make the dictionaries point to the shared/ directory to identify the schema. I have just posted #66 to outline that. [EDIT: I would be completely comfortable (or even outright happy) with scratching issue #66 and maintaining the current status quo]

Please go ahead if you have your plan :)

Thanks :-) I understand that some of the above may be unclear (and I think I will reduce the procedure somewhat, to save some time), but indeed, I'm going to work on that in a separate branch, so nothing will be affected until I'm finished and it looks good.

Jan 11 '21 10:01 bansp

Trying to keep the off-topic to a single ticket, so I am reposting Sebastian's comment from elsewhere. I am not sure if Sebastian had read my reply above before posting that comment.

I asked in another issue about including the schemas with the tooling. To what respect is this not optimal? A dictionary should be buildable with a certain version of the tooling. Eng-deu in 0.1 required the possibly oldest version of freedict-tools, not versioned back then. eng-deu 1.8.1 requires fd-tools 0.5.0. It looks natural to me to include the schemas in each version of the tools.

Tools operate on the semi-structured databases (as our XML dictionaries can be treated) in many cases thanks to the document grammars that flesh out the semantics of the particular components or regulate the relationships between components.

Think very early HTML with all the styling info inside. Separating the styling info into CSS leaves us with a skeleton that the styling information from the CSS attaches to. You need to put the two together in order to receive a pleasant, readable web page. By rough analogy, you need to put bare XML and its schema together to be sure how to interpret the given semi-structured database. They belong together.

If you were to take the schemas away, you would only leave part of the relevant information in fd-dictionaries. They would be half-useless as XML documents, until the schemas were located or (imperfectly) inferred from the existing structure. There is completely nothing natural in snatching schemas away from the dictionary documents. I don't think it is a good approach for an open project to say, "fd-dictionaries contain bare XML documents; in order to make them meaningful, you have to install the other repository as well". That just isn't user-friendly. fd-dictionaries in the current form (with schemas) do not require the fd-tools in order to become useful to people who do not wish to build distribution packages. They can safely exist on their own and the shared/ directory contains enough information (even if some of it is outdated) to get people started using or even fixing or extending fd-dictionaries with an XML editor. fd-tools make fd-dictionaries even more valuable, but they are not essential for fd-dictionaries to function on their own, if fd-documents are accompanied by their schema and their ODD.[1]

The TEI ODD makes the connection between the XML documents and schemas even more explicit, and it is my fault to not have maintained our ODD for a long time, and to have failed to exploit some of its features. I intend to take a step to amend that situation, and this ticket outlines the first steps towards that goal.

Looping back to the beginning of this particular comment: I believe that there exist good arguments for keeping schemas in fd-dictionaries rather than in fd-tools. I would like to suggest that we maintain the status quo in this regard, and don't try to fix something that is not broken.

[1] A minor note: it is part of TEI compliance requirements that in order to qualify as the TEI document, an XML document has to be (among other things) accompanied by the ODD document that defines its schema. But I believe that my argument above stands even without this further detail.

Jan 12 '21 01:01 bansp

If you were to take the schemas away, you would only leave part of the relevant information in fd-dictionaries. They would be half-useless as XML documents, until the schemas were located or (imperfectly) inferred from the existing structure. There is completely nothing natural in snatching schemas away from the dictionary documents. I don't think it is a good approach for an open project to say, "fd-dictionaries contain bare XML documents; in order to make them meaningful, you have to install the other repository as well". That just isn't user-friendly. fd-dictionaries in the current form (with schemas) do not require the fd-tools in order to become useful to people who do not wish to build distribution packages. They can safely exist on their own and the shared/ directory contains enough information (even if some of it is outdated) to get people started using or even fixing or extending fd-dictionaries with an XML editor. fd-tools make fd-dictionaries even more valuable, but they are not essential for fd-dictionaries to function on their own, if fd-documents are accompanied by their schema and their ODD.[1]

I can only stress what is in the README. It says, among other things, that this repository only contains the dictionaries that are not auto-imported (anymore). There are more dictionaries these days that we automatically import than those we maintain by hand. The dictionaries in this repository are in the sense only half of the story. What about our auto-imported dictionaries? Why is it "user friendly" that somebody who reads about them and visits https://download.freedict.org/generated doesn't find the shared folder with the schemas? They are currently copy-pasted there, just because we have a rather sloppy way of treating schemas somewhere between data and tools.

I'm looking to it from the perspective of a distributor. The schemas are one thing, the data is something else. Both belong into different packags. The shared folder was a great thing as long as everything was in one repository. Schema updates automatically propagated to the dictionaries. Today, this is not longer the case. It could well happen that the schemas change in an incompatible way and there's no way for external users to figure out which schema version should be used for the dictionary they have at hand. I am all for strict versioning here. We could of course separate the schemas into a separate repository, but IMO versioning together with tools is more convenient.

Is there a compromise we could find to resolve this discrepancy between dictionaries in this repo and from other sources?

Jan 12 '21 19:01 humenda

fd-dictionaries fd-dictionaries copied to clipboard

refreshing the schemas: freeze the p5subset, add it to our vc, update the syntax in the ODD

fd-dictionaries
fd-dictionaries copied to clipboard