bilara-data
bilara-data copied to clipboard
Restore missing variants to Pali text
Our legacy Pali text includes variant readings, ultimately derived from the Mahasangiti edition. Unfortunately it seems that approximately 7% of the variants become lost in transitioning between the legacy texts and bilara-data.
Restore missing variant readings to bilara-data/variant/pli/ms/
from /legacy-suttacentral-data/text/pi
.
Ideally we might want to go all the way back to the source MS files. However the variants are not in a very usable form there, and extracting them is difficult. So better just take them from our own legacy HTML.
Using a simple count, it seems we have 20792 variants in bilara-data, while in the legacy texts we had 22430. This leaves us with 1638 cases that need restoring.
I have tried to detect some patterns that might help with restoring them, and so far I have eliminated these:
- Missing variants are in all three nikayas.
- Some are already missing in the PO files in /translation, however texts that did not go through PO also have missing cases.
- It includes simple cases, and I cannot detect any pattern in terms of position, edition, or anything else.
- A possible pattern is the presence of soft-hyphen in the variants of legacy texts. These were added to long words to enable line-breaks. In some cases in dn18 (see below) the long words are missing from bilara-data. However it doesn't seem consistent, so it may be just coincidence.
As an example, check dn18.
-
Legacy: 37 variants
- counting one variant for each
class="var"
- counting one variant for each
-
Bilara-data: 28 variants
- counting one variant for each kv pair, plus the number of
|
characters. In bilara-data, when there are multiple variants for the same segment,|
is the separator, so this should give an accurate count.
- counting one variant for each kv pair, plus the number of
If you check the git history, you'll see that I manually restored a variant to dn18:1.3. Without that there would have been 27 in bilara-data.
How to
Here is an initial sketch of a possible strategy. Let us use the same files from dn18.
Step 1: extract variants and position from legacy
- In dn18.html, extract all instances of
class="var"
together with the preceding ms number. - separate out the data, something like:
"ms": "7D_596",
"var": "nādike (si, s1-3, km, pts1)",
"id": "note302",
"lemma": "nātike"
Okay, next go to dn18_reference
and look for ms7D_597
. There it is, on segment ID dn18:1.1. Excellent! Unfortunately the segments are often smaller than a ms
, so we'll have to check for the next ms
number and create a range of segments.
The next ms
is on 2.1, so we look for anything that is dn18:1.1– (less than) dn18:2.1.
Now check dn18_variant-pli-ms
under dn18:1.1
. Nothing. At 1.2, however, we have a variant.
"dn18:1.2": "nātike → nādike (si, sya-all, km, pts1ed)"
Now, see whether the lemma matches the bit before the arrow:
nātike = nātike
Yay! So this variant is actually found.
Let's try another one. A bit further down in dn18.html we find (once parsed out):
"ms": "7D_597",
"var": "pañhāveyyākaraṇaṃ (s1-3, km, mr)",
"id": "note307",
"lemma": "pañhaveyyākaraṇaṃ"
Go to ms7D_597
, this gives the range, dn18:2.1–dn18:3.1.
In dn18_variant-pli-ms
there are no variants in these segments. Ahaa! We have found a missing case!
Now we need to find what the actual segment number is.
Go to dn18_root-pli-ms
and search for the lemma. We find it at dn18:2.7
. This does indeed fall within the specified range dn18:2.1–dn18:3.1, so we have a hit. Add to dn18_variant-pli-ms
:
"dn18:2.7": "pañhaveyyākaraṇaṁ → pañhāveyyākaraṇaṃ (sya-all, km, mr)",
And done.