bilara-data icon indicating copy to clipboard operation
bilara-data copied to clipboard

Restore missing variants to Pali text

Open sujato opened this issue 4 years ago • 0 comments

Our legacy Pali text includes variant readings, ultimately derived from the Mahasangiti edition. Unfortunately it seems that approximately 7% of the variants become lost in transitioning between the legacy texts and bilara-data.

Restore missing variant readings to bilara-data/variant/pli/ms/ from /legacy-suttacentral-data/text/pi.

Ideally we might want to go all the way back to the source MS files. However the variants are not in a very usable form there, and extracting them is difficult. So better just take them from our own legacy HTML.

Using a simple count, it seems we have 20792 variants in bilara-data, while in the legacy texts we had 22430. This leaves us with 1638 cases that need restoring.

I have tried to detect some patterns that might help with restoring them, and so far I have eliminated these:

  • Missing variants are in all three nikayas.
  • Some are already missing in the PO files in /translation, however texts that did not go through PO also have missing cases.
  • It includes simple cases, and I cannot detect any pattern in terms of position, edition, or anything else.
  • A possible pattern is the presence of soft-hyphen in the variants of legacy texts. These were added to long words to enable line-breaks. In some cases in dn18 (see below) the long words are missing from bilara-data. However it doesn't seem consistent, so it may be just coincidence.

As an example, check dn18.

  • Legacy: 37 variants
    • counting one variant for each class="var"
  • Bilara-data: 28 variants
    • counting one variant for each kv pair, plus the number of | characters. In bilara-data, when there are multiple variants for the same segment, | is the separator, so this should give an accurate count.

If you check the git history, you'll see that I manually restored a variant to dn18:1.3. Without that there would have been 27 in bilara-data.

How to

Here is an initial sketch of a possible strategy. Let us use the same files from dn18.

Step 1: extract variants and position from legacy

  • In dn18.html, extract all instances of class="var" together with the preceding ms number.
  • separate out the data, something like:
"ms": "7D_596",
"var": "nādike (si, s1-3, km, pts1)",
"id": "note302",
"lemma": "nātike"

Okay, next go to dn18_reference and look for ms7D_597. There it is, on segment ID dn18:1.1. Excellent! Unfortunately the segments are often smaller than a ms, so we'll have to check for the next ms number and create a range of segments.

The next ms is on 2.1, so we look for anything that is dn18:1.1– (less than) dn18:2.1.

Now check dn18_variant-pli-ms under dn18:1.1. Nothing. At 1.2, however, we have a variant.

"dn18:1.2": "nātike → nādike (si, sya-all, km, pts1ed)"

Now, see whether the lemma matches the bit before the arrow:

nātike = nātike

Yay! So this variant is actually found.

Let's try another one. A bit further down in dn18.html we find (once parsed out):

"ms": "7D_597",
"var": "pañhā­vey­yāka­ra­ṇaṃ (s1-3, km, mr)",
"id": "note307",
"lemma": "pañha­vey­yāka­ra­ṇaṃ"

Go to ms7D_597, this gives the range, dn18:2.1–dn18:3.1.

In dn18_variant-pli-ms there are no variants in these segments. Ahaa! We have found a missing case!

Now we need to find what the actual segment number is.

Go to dn18_root-pli-ms and search for the lemma. We find it at dn18:2.7. This does indeed fall within the specified range dn18:2.1–dn18:3.1, so we have a hit. Add to dn18_variant-pli-ms:

"dn18:2.7": "pañhaveyyākaraṇaṁ → pañhā­vey­yāka­ra­ṇaṃ (sya-all, km, mr)",

And done.

sujato avatar Jul 24 '20 09:07 sujato