Remove workaround for "font smearing"
In my investigations into the performance of make load-data I came across some odd code:
def is_bold(self, lang, element):
if element.tag in {'b', 'strong'}:
return True
return lang in {'lzh', 'ko', 'jp', 'tw'} and element.tag in {'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}
This is used to scan every legacy text and written as a set to the unicode_points collection in ArangoDB:
unicode_points/en
{
"normal": "\t\n !\"#%&'()*+,-./0123456789:;=>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]_`abcdefghijklmnopqrstuvwxyz{|}~ §©ÑÜáâæéêíïñóôüĀāăĕĪīŇŋōŚśŪūəɛɦɪʃʌʒˌ̣̄ӕḍḥḷṁṃṅṆṇṚṛṢṣṬṭẓạấ –—―‘’“”…→√《》一三中交佛修典冊出分十卷受員四國基大委子學官年度律戒所提揵教文方日會月有權正毾氀氍法流版研究站經綖綩網翻菩華蔣藏術覴譯金院際電頁鼓fifl︰()",
"bold": " '(),-.0123456789:?ABCDEFGHIJKLMNOPRSTUWY[]abcdefghijklmnopqrstuvwxyz§ñĀāīūḷṃṇṭ–—’“”…",
"italic": " !&'(),-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXY[]abcdefghijklmnopqrstuvwxyzñĀāīōŚśū̄ḍḥḷṁṃṅṇṛṣṭ–—‘’“”…"
}
That's the first time I've looked at the English record. Looks like it's broken.
I can't see anything that reads from the collection so I contacted Bhante Sujato. His response:
the reason is probably to avoid using faux bold and italic fonts in certain languages. When a browser smears a bold font it normally looks kind of bad, but it's especially probl;ematic for CJK, where the smear can completely obscure the character. There's supposed to be a CSS property font-synthesis to forbid this, but when we built it support for that was I think FF and Safari only. Just checking now, it was introduced to Chrome in 2022. So our workaround can be deleted, and we can rely on font-synthesis: none to do our work for us!
Maybe as you suggest it was either not implemented properly or the logic is elsewhere, either way it (and any duplicating logic) can go.
The next step is to find any duplicated logic.
Here we go. This looks relevant:
/suttacentral/client/elements/styles/sc-typography-i18n-styles.js
/* styles for texts with scripts other than latin */
/* Firefox & Safari use the font-synthesis property to remove faux italics or bold.
Note that vi is excluded from this list: it uses Roman font but qualifies as tall script due to the stacked diacriticals */
See also issues #1262, #3240,
Just for the record here's the profiler result:
normal-run.prof% stats is_bold
Thu May 1 09:57:44 2025 normal-run.prof
313360422 function calls (312947933 primitive calls) in 397.821 seconds
Random listing order was used
List reduced from 6553 to 1 due to restriction <'is_bold'>
ncalls tottime percall cumtime percall filename:lineno(function)
4256224 1.233 0.000 1.233 0.000 /opt/sc/sc-flask/server/data_loader/textdata.py:26(is_bold)
That function gets called 4.2 million times. As it happens, it only claims 1.2 seconds.
Thanks Bhante!
the current code does not use the data in the unicode_points collection
it existed before I took over the development of SC, and I did not use this data during the development process.
so I think removing it will not affect any functionality at this time.
That profiler report is a bit off due to refactoring. Here are the actual stage times:
from sc_load_perf.data_load_stats import *
top_clock_time(stats('stage_timing/unicode_points_removed.csv'))
Out[4]:
shape: (10, 4)
┌────────┬─────────────────────────────────┬──────────────┬────────────┐
│ number ┆ message ┆ clock_time_s ┆ cpu_time_s │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ f64 │
╞════════╪═════════════════════════════════╪══════════════╪════════════╡
│ 30 ┆ Loading html_text ┆ 54.02914 ┆ 35.182444 │
│ 44 ┆ Update Acronym ┆ 29.828202 ┆ 19.238385 │
│ 31 ┆ Make yellow brick road ┆ 19.87329 ┆ 0.002454 │
│ 26 ┆ Updating names ┆ 15.308209 ┆ 8.11932 │
│ 29 ┆ Generating and loading relatio… ┆ 8.766755 ┆ 1.528496 │
│ 43 ┆ Updating text_extra_info ┆ 7.087038 ┆ 3.641788 │
│ 48 ┆ Generating and loading orderin… ┆ 6.415615 ┆ 1.68137 │
│ 20 ┆ Load names from sc_bilara_data ┆ 5.343331 ┆ 0.310788 │
│ 35 ┆ Loading simple dictionaries ┆ 4.459519 ┆ 0.94023 │
│ 18 ┆ Building and loading navigatio… ┆ 4.247066 ┆ 0.183822 │
└────────┴─────────────────────────────────┴──────────────┴────────────┘
top_clock_time(stats('stage_timing/unicode_points_commit_before_remove.csv'))
Out[5]:
shape: (10, 4)
┌────────┬─────────────────────────────────┬──────────────┬────────────┐
│ number ┆ message ┆ clock_time_s ┆ cpu_time_s │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ f64 │
╞════════╪═════════════════════════════════╪══════════════╪════════════╡
│ 30 ┆ Loading html_text ┆ 61.656196 ┆ 45.237969 │
│ 44 ┆ Update Acronym ┆ 29.545614 ┆ 19.103429 │
│ 31 ┆ Make yellow brick road ┆ 20.119479 ┆ 0.002504 │
│ 26 ┆ Updating names ┆ 14.473155 ┆ 8.329643 │
│ 29 ┆ Generating and loading relatio… ┆ 8.996217 ┆ 1.631244 │
│ 43 ┆ Updating text_extra_info ┆ 8.47365 ┆ 3.709043 │
│ 48 ┆ Generating and loading orderin… ┆ 6.607545 ┆ 1.815601 │
│ 35 ┆ Loading simple dictionaries ┆ 4.483373 ┆ 0.943689 │
│ 18 ┆ Building and loading navigatio… ┆ 4.470838 ┆ 0.179664 │
│ 20 ┆ Load names from sc_bilara_data ┆ 3.18882 ┆ 0.237275 │
└────────┴─────────────────────────────────┴──────────────┴────────────┘
That's 6 seconds saved, and 10 seconds of CPU time. So still great
This is done, though the unicode_points collection should be removed. Perhaps do this in a migration?