dev icon indicating copy to clipboard operation
dev copied to clipboard

Normalize IPA strings to NFC for consistency

Open lars76 opened this issue 4 months ago • 3 comments

Hi, I noticed that the dataset contains a mixture of NFC and NFD Unicode forms for IPA strings. For example:

  • Row 277, Col 7: ãː (NFD: a + COMBINING TILDE) vs. ãː (NFC: single precomposed ã).

Out of ~5.1 million cells, ~6,200 are not in NFC. This causes issues with string matching, e.g., "ã" != "ã" even though they look identical.

To fix this, I applied NFC normalization across the CSV like this:

import csv, unicodedata

with open("input.csv", "r", encoding="utf-8", newline="") as infile, \
     open("output.csv", "w", encoding="utf-8", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        writer.writerow([unicodedata.normalize("NFC", cell) for cell in row])

Example of a normalized cell:

Row 1747, Col 8
  Original   : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+006F U+031E U+0303 U+0020 U+0254]
  Normalized : o̞˞ o̞ õ̞ ɔ   [U+006F U+031E U+02DE U+0020 U+006F U+031E U+0020 U+00F5 U+031E U+0020 U+0254]

lars76 avatar Aug 24 '25 10:08 lars76