suttacentral Test and refactor data loader

About two weeks ago the server was having difficulties:

https://discourse.suttacentral.net/t/http-500-error-when-trying-to-read-ven-sujatos-trans-of-mn-137/38285

While the urgent matter is the performance issues, I think we'd do well to refactor first. By refactoring I'm following Martin Fowler's definition:

Refactoring (noun): a change made to the internal structure of software to make it easier to understand and cheaper to modify without changing its observable behavior.

Refactoring (verb): to restructure software by applying a series of refactorings without changing its observable behavior.

Automated testing is complementary to refactoring, so as I go along I can create a suite of unit tests. Higher level system tests might need to be in place to get started, otherwise you don't know what you're going to break.

To enable that I have a SuttaCentral instance running in VirtualBox:

https://github.com/jhanarato/sc-testing/blob/main/docs/How-To-Install-Suttacentral-In-VirtualBox.md

So the game plan:

Write some high level system tests and/or characterisation tests to provide some safety.
Refactor in small steps, committing regularly.
Build a suite of unit tests as the refactoring progresses.
Examine the data load process with knowledge gained to see where performance issues might lie.

As it happens, I still seem to have permissions for the suttacentral repository. I can do this work on a branch, or else make a fork. Up to you!

Nice to be back,

Ajahn J.R.

Mar 09 '25 01:03 jhanarato

Glad to have you back!

I know nothing of the work that needs to be done on this count, but I am happy to say Sadhu! and do whatever I can.

BTW, I don't think Bhante Sujato monitors the issues closely, so if you need his attention best to reach out directly.

Mar 09 '25 02:03 thesunshade

Hey @thesunshade, pretty sure I know who you are on D&D.

The work is just a series of small improvements, avoiding changes to the behavior. The near future will involve reading the code to figure out what the behavior is, adding tests via pytest to avoid introducing bugs.

Back in the day I worked for a library systems company and learned what happens when you refactor without tests. Basically you introduce a series of silly bugs and your coworkers look at you askance. I mentioned this recently to a software developer who stayed with us at Kusala. He said, "Oh yeah, everyone goes through that" Rather comforting.

What I'm not proposing is a huge rewrite. I've owned copies of both editions of Fowler's Refactoring, about 20 years apart. It's a gradual process with a series of small steps. As I go I can merge these into production, or have them reviewed by @ihongda. If anything breaks it should be easy to revert the changes.

I don't think it's necessary to bring this up on D&D, but yes, hi Bhante @sujato!

Mar 09 '25 02:03 jhanarato

Yep, you may know me as the multi named busybody that tries to skillfully (?) manipulate people into collecting merit.

If you ever need a dim bird to use as a rubber duck, I'd be happy to go along for the ride.

Mar 09 '25 03:03 thesunshade

Cool!

Mar 09 '25 03:03 jhanarato

I think for the near term I'll create a fork and issue pull requests.

Mar 10 '25 02:03 jhanarato

That makes sense. Thanks!

Mar 10 '25 03:03 thesunshade

The fork is set up now:

https://github.com/jhanarato/suttacentral

Mar 13 '25 00:03 jhanarato

Might be unrelated to this process, but look into this:

https://suttacentral.net/search?query=author:kelly

Which displays "Data Load Error"

Mar 13 '25 00:03 jhanarato

Hey folks, been working hard here at Kusala, but I've got some ideas on how to approach this.

First up I'd like to get the data load process running in a test harness. That is to say, launching an ArangoDB container, running the data load process and investigating the results. This first step avoids modifying the production code, while allowing me to study how it works. So you would run make test-data-load and all this would happen. I'll need to work on my Docker-fu.

From there we can start testing with smaller sets of JSON. This will give faster feedback on the tests. I like to use learning tests that just tell me how things work.

When I've got a bit of confidence, I'll look at refactoring. Most of the action seems to be in arangoload.py. I do see an opportunity for breaking it up into separate pieces where loading JSON data and writing it to ArangoDB are separated. More fine grained tests can be written and profiling will be easier to see if it's ArangoDB or simply deserialising JSON that's using up CPU resources. Or something that we would never have thought of. That's why we profile right?

Anyway, I'll keep you updated as I delve into the code.

Mar 21 '25 11:03 jhanarato

@khemarato are you aware of this work?

Ven. @jhanarato, all of this is beyond my understanding, however I rejoice greatly in your work. It is a blessing to have your involvement with this.

Mar 21 '25 19:03 thesunshade

Ven. @jhanarato, all of this is beyond my understanding,

Indeed. I'd better get it right then!

however I rejoice greatly in your work. It is a blessing to have your involvement with this.

Hey, no worries. I like writing code, and I don't see better kamma elsewhere. I know from my personal projects that I can do this sort of thing in small steps over a long period of time.

Mar 22 '25 00:03 jhanarato

Hi @khemarato !

Mar 22 '25 02:03 jhanarato

I'd also really like to have a conversation around elasticsearch and neo-bilara. Just from github and D&D there seems to be a bit of history there. Where's a good place for that?

Mar 22 '25 02:03 jhanarato

I just sent you a DM on D&D

Mar 22 '25 02:03 thesunshade

The data load tests should have their own arangodb database so as not to interfere with other unit tests. As far as I can see there are no tests that actually write data, but on the safe side be.

Having had a look at the docker configuration, I think we can get away with a separate database in the same container. The test database is already named suttacentral_tests or suttacentral_test depending on whether config.py or .test.env defines it. It looks like there's a conflict there, I'll have to check. My idea is to have a database called suttacentral_data_load_test and provide test fixtures to set up and tear down as necessary.

Mar 24 '25 04:03 jhanarato

It seems that the current test database is called suttacentral_tests, which is obtained in config.py:

class TestingConfig(Config):
    """Configuration for Testing"""

    TESTING = True
    DEBUG = True
    ARANGO_DB = os.getenv('ARANGO_BASE_DB_NAME') + '_tests'

and not the .test.env file:

SERVER_ADDRESS=localhost

SECRET=SomeSecretKey
ENVIRONMENT=testing

# Database variables
ARANGO_DB_NAME=suttacentral_test

Mar 25 '25 05:03 jhanarato

I've been fairly productive over the past few days. As I now have code merged, please do let me know if anything untoward occurs. I'll be busier over the next couple of weeks, but when I have some time I want to add cpu time to the stats.

Mar 29 '25 11:03 jhanarato

Here's the original CSV I generated with the data-load stages and their actual running time: load-data-run.csv

Mar 29 '25 11:03 jhanarato

I've created a new project for working with the performance stats. It's based on polars:

https://github.com/jhanarato/sc-load-perf

Here's the initial results:

Top 10 Stages - Clock Time

┌────────┬─────────────────────────────────┬──────────────┬────────────┐
│ number ┆ message                         ┆ clock_time_s ┆ cpu_time_s │
│ ---    ┆ ---                             ┆ ---          ┆ ---        │
│ i64    ┆ str                             ┆ f64          ┆ f64        │
╞════════╪═════════════════════════════════╪══════════════╪════════════╡
│ 30     ┆ Loading html_text               ┆ 79.050246    ┆ 73.02274   │
│ 44     ┆ Update Acronym                  ┆ 58.988813    ┆ 34.236346  │
│ 31     ┆ Make yellow brick road          ┆ 25.563617    ┆ 0.003657   │
│ 26     ┆ Updating names                  ┆ 20.259218    ┆ 11.903545  │
│ 29     ┆ Generating and loading relatio… ┆ 14.499532    ┆ 2.871383   │
│ 48     ┆ Generating and loading orderin… ┆ 10.632966    ┆ 2.59407    │
│ 43     ┆ Updating text_extra_info        ┆ 8.522138     ┆ 5.024656   │
│ 35     ┆ Loading simple dictionaries     ┆ 6.706958     ┆ 1.445712   │
│ 18     ┆ Building and loading navigatio… ┆ 5.888915     ┆ 0.239287   │
│ 25     ┆ Load texts from sc_bilara_data  ┆ 4.836083     ┆ 2.962696   │
└────────┴─────────────────────────────────┴──────────────┴────────────┘

Top 10 Stages - CPU Time

┌────────┬─────────────────────────────────┬──────────────┬────────────┐
│ number ┆ message                         ┆ clock_time_s ┆ cpu_time_s │
│ ---    ┆ ---                             ┆ ---          ┆ ---        │
│ i64    ┆ str                             ┆ f64          ┆ f64        │
╞════════╪═════════════════════════════════╪══════════════╪════════════╡
│ 30     ┆ Loading html_text               ┆ 79.050246    ┆ 73.02274   │
│ 44     ┆ Update Acronym                  ┆ 58.988813    ┆ 34.236346  │
│ 26     ┆ Updating names                  ┆ 20.259218    ┆ 11.903545  │
│ 43     ┆ Updating text_extra_info        ┆ 8.522138     ┆ 5.024656   │
│ 25     ┆ Load texts from sc_bilara_data  ┆ 4.836083     ┆ 2.962696   │
│ 4      ┆ Loading ChangeTracker           ┆ 2.963014     ┆ 2.953054   │
│ 29     ┆ Generating and loading relatio… ┆ 14.499532    ┆ 2.871383   │
│ 48     ┆ Generating and loading orderin… ┆ 10.632966    ┆ 2.59407    │
│ 35     ┆ Loading simple dictionaries     ┆ 6.706958     ┆ 1.445712   │
│ 47     ┆ Generating sitemap              ┆ 1.295066     ┆ 0.922846   │
└────────┴─────────────────────────────────┴──────────────┴────────────┘

NB: This is the Python process only. We don't yet know what ArangoDB is doing. If we take the discrepancy between clock and CPU time, we need to figure out what else is happening. The DB itself might be using up CPU, or there is some waiting for IO.

Apr 10 '25 11:04 jhanarato

Let's take three examples:

Loading html_text - Almost all Python
Make yellow brick road - Basically all ArangoDB
Update Acronym - Half python, half some other activity

The first can be profiled, AQL can presumably be analysed somehow and the third is a mystery as to where it spends it's time.

I'll focus on these three for now, testing and refactoring as I go.

Apr 10 '25 11:04 jhanarato

Thanks for all your hard work! Hope you are enjoying it

Apr 11 '25 02:04 thesunshade

I'm working through the sc_html.py module and there are a couple of relevant comments I'll like to preserve here for later contemplation:

# We need to jump through some hoops to ensure the mixins are included
# in all Element class for every tag type. (in lxml.html, some, like input
# and select, have a custom element type, these require the mixins parameter
# to set_element_class_lookup, but there the mixins don't apply to any
# non-customized tag, so we also need to manually mix them into a new
# HtmlElement and create a CustomLookup class which returns our new
# HtmlElement class as the default)
class HtHtmlElement(HtHtmlElementMixin, _html.HtmlElement):
    pass

# lxml still has a number of issues handling utf8. We need to
# explicitly define that our parser is using utf-8 for some systems.
# http://stackoverflow.com/questions/15302125/html-encoding-and-lxml-parsing
def get_parser(encoding='utf-8'):
    parser = _html.HTMLParser(encoding=encoding)
    parser.set_element_class_lookup(CustomLookup(mixins=[('*', HtHtmlElementMixin)]))
    return parser

As mentioned on D&D dev chat, this code will probably be migrated to the most recent BeautifulSoup library.

Apr 18 '25 02:04 jhanarato

I've found a lot of seemingly dead code in sc_html.py that I plan to remove. To start with, it's in a mixin:

class HtHtmlElementMixin:
    """ Adds methods primarily to aid with proper handling of text/tail.
    Also adds some convenience methods.
    """

and comprises methods like this:

    def prepend(self, other):
        """Inserts other at the start of self.

        Unlike .insert(0, other), handles text properly.

        """
        self.insert(0, other)
        if self.text:
            other.tail = self.text + (other.tail or '')
            self.text = None

This comment claims that only one method is overriden:

"""
The only HtmlElement method which is overridden by this module is __str__,
which now returns the html code of the element.
"""

If that comment is wrong (entirely possible) then we might find that the code reverts to lxml's default implementation of the methods. The steps below mitigate that risk.

[x] Use IDE to search for uses in our code.
[x] Write tests to check that lxml does not define these methods.
[x] Replace the method bodies with a raised error and run the full load.
[x] Remove the methods.

Apr 23 '25 00:04 jhanarato

After refactoring the profiler is far more informative:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.151    0.151   48.937   48.937 unsegmented.py:142(load_unsegmented_texts)
    70397    0.079    0.000   44.843    0.001 unsegmented.py:163(documents)
    70351    0.756    0.000   42.194    0.001 unsegmented.py:170(create_document)
    70351    0.281    0.000   31.822    0.000 unsegmented.py:202(extract_html_details)
    70351    0.143    0.000   11.466    0.000 unsegmented.py:217(extract_authors_long_name)
    70351    0.299    0.000    9.443    0.000 unsegmented.py:188(extract_file_details)
    70351    0.088    0.000    4.766    0.000 unsegmented.py:263(find_title_tag)
    70351    0.052    0.000    4.395    0.000 unsegmented.py:231(extract_publication_date)
    70351    0.052    0.000    3.700    0.000 unsegmented.py:125(add_document)
      697    0.004    0.000    3.638    0.005 unsegmented.py:130(flush_documents)
    70397    0.030    0.000    2.548    0.000 unsegmented.py:158(html_files)
    70351    0.073    0.000    1.833    0.000 unsegmented.py:239(extract_title)
    65958    0.054    0.000    1.145    0.000 unsegmented.py:276(normalise_title)
    70351    0.084    0.000    0.545    0.000 unsegmented.py:281(extract_volpage)
     4620    0.013    0.000    0.387    0.000 unsegmented.py:250(extract_side_by_side_title)
    70351    0.102    0.000    0.217    0.000 unsegmented.py:60(as_dict)
    70351    0.118    0.000    0.157    0.000 unsegmented.py:91(__getitem__)
    70351    0.045    0.000    0.095    0.000 unsegmented.py:56(key)
   140702    0.053    0.000    0.053    0.000 unsegmented.py:49(path)
    70351    0.021    0.000    0.021    0.000 unsegmented.py:294(log_missing_details)
        1    0.000    0.000    0.015    0.015 unsegmented.py:78(__init__)
        1    0.000    0.000    0.008    0.008 unsegmented.py:1(<module>)
        1    0.000    0.000    0.004    0.004 unsegmented.py:138(__exit__)
        1    0.000    0.000    0.000    0.000 unsegmented.py:154(language_directories)
        1    0.000    0.000    0.000    0.000 unsegmented.py:155(<listcomp>)
        1    0.000    0.000    0.000    0.000 unsegmented.py:42(Document)
        1    0.000    0.000    0.000    0.000 unsegmented.py:77(Authors)
        1    0.000    0.000    0.000    0.000 unsegmented.py:25(TextDetails)
        1    0.000    0.000    0.000    0.000 unsegmented.py:33(AuthorDetails)
        1    0.000    0.000    0.000    0.000 unsegmented.py:118(HtmlTextWriter)
        1    0.000    0.000    0.000    0.000 unsegmented.py:17(FileDetails)
        1    0.000    0.000    0.000    0.000 unsegmented.py:121(__init__)
        1    0.000    0.000    0.000    0.000 unsegmented.py:135(__enter__)

Jul 04 '25 03:07 jhanarato

This is all great, thanks so much.

Jul 17 '25 01:07 sujato