Description

This PR adds extra regression tests for private data, which I've added to a separate repository. (You should all have access to this; let me know if you don't.) I took the liberty of stealing the HTML data @Antoinelfr shared with us on Google Drive; currently they're all PMC articles (using the PMC config, not the LEGACY_PMC one).

I came across some problems with AC while working on this (see below). I think we should merge this in any case, so that we have the infrastructure in place to check for future regressions, but if/when we fix any bugs in AC that affect the output, we will need to update the test data.

Closes #229.

Implementation

The data is housed in a git submodule in this repo. I've added a readme to the tests/data folder which explains how this all works. Only users who've been given access to the private repo will be able to use this test data, though, if you don't have access, you can still run the tests (the test for private data will be marked as skipped).

I've changed the code from hard-coding the paths to the HTML test files to dynamically reading them from the file system, with separate parametrised tests for public and private data. Files are in separate subfolders with the name of the DefaultConfig they use (e.g. LEGACY_PMC, see details in readme).

I've added pytest-xdist (an extension to run tests in parallel) as a dependency and enabled it, as there are now quite a few tests. On my desktop machine, it reduces the time taken to run the tests from 130s to 104s. (GitHub runners only have two cores, but we might as well use both!)

PATs

By default, GitHub runners don't have sufficient permissions to access external private repositories, so in order to let the runners obtain the private test data, I had to generate a PAT and add it as a secret to this repo. I've set it to expire in a week's time, at which point, GitHub Actions will stop working. I considered setting it to expire later (the maximum is 366 days), but I figured it'd cause confusion if things randomly stopped working at some distant point in the future. We don't want the RSE team to be the only ones who know how to set this up, so let's schedule a call in the next week and I can show you how to do it yourselves. (@Thomas-Rowlands I guess it makes sense for you to be the one to do this?)

Future work

Currently this setup only works for HTML data, but we should extend it so we can use it for SI files too. I didn't attempt that in this PR because a) I didn't want merge conflicts with #260 and b) the SI tests currently output data to the tests directory and we should fix this to use a temp dir instead first. Once we've done that, we can (and should!) add private test data for SI files too.

In the meantime, feel free to add more test data to the private test data repo. It would be good to have some files in there that use something other than the PMC config. See instructions in readme for how to add test data.

Problems identified

I found two kinds of problems. I'll write up proper issue descriptions for each, but I'll give a brief description of each here.

Reproducibility problems

For four of the test files, you get different results on successive runs of AC using the same version each time. It always affects the *_tables.json files, not the others. This is definitely a bug. I've marked these test cases as xfail for now (I thought that was better than excluding them from the dataset), but obviously the code should be fixed and these tests should be re-enabled at some point.

Regressions

I tried running the tests using output files generated by an older version of AC (commit 5f5e416a012bfe19b9e1ce31ecb58a850bb6d459) as a baseline, to see whether it matches the output of the current version. The reason I didn't use an even older version (e.g. the v1.1.0 release) is because we (deliberately) introduced a breaking change in #154 to fix a bug and I didn't want spurious failures due to that change. There could well have been more regressions introduced before that commit though. Might be worth checking.

Unfortunately the outputs were different for 85/96 files (with 4 excluded), HOWEVER this may not indicate a bug as long as we're happy with the new output. For the ONE failure I looked at in detail, the output from the newer version of AC didn't look obviously wrong, so it may be that the output is now more correct than it was before. This needs examining more thoroughly though to make sure we haven't inadvertantly broken anything since the v1.1.0 release.

Type of change

[ ] Documentation (non-breaking change that adds or improves the documentation)
[x] New feature (non-breaking change which adds functionality)
[ ] Optimization (non-breaking, back-end change that speeds up the code)
[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Breaking change (whatever its nature)

Key checklist

[x] All tests pass (eg. pytest)
[ ] The documentation builds and looks OK (eg. mkdocs)
[x] Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

[x] Code is commented, particularly in hard-to-understand areas
[x] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

May 23 '25 06:05 alexdewar

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

see 5 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

May 23 '25 06:05 codecov[bot]

This looks very good to me. I've made some suggestions. Have you opened issues for the problems you mentioned in the description?

Not yet! But I'll try to do it today and see if I can come up with reproducers for the problems too

May 29 '25 10:05 alexdewar

So I've been digging into why the results seem to have changed and it seems like a lot of the differences (though not all) are to do with missing "iao_name_*" and "iao_id_*" properties in the *_bioc.json output files.

For example:

diff --git c/html/PMC/PMC10101305_bioc.json w/html/PMC/PMC10101305_bioc.json
index 88f1a61..091320a 100644
--- c/html/PMC/PMC10101305_bioc.json
+++ w/html/PMC/PMC10101305_bioc.json
@@ -35,9 +35,7 @@
         {
           "offset": 186,
           "infons": {
-            "section_title_1": "Rationale:",
-            "iao_name_1": "document part",
-            "iao_id_1": "IAO:0000314"
+            "section_title_1": "Rationale:"
           },
           "text": "Calcinosis cutis is a rare skin disease, and idiopathic cases are rarely reported. It is characterized by the deposition of insoluble calcium salts in the skin, subcutaneous tissue, superficial muscles, and tendon sheaths. However, no abnormal changes were found in the bone. In this article, we introduce a case of idiopathic calcinosis cutis of the buttocks with a long course and large lesion area.",
           "sentences": [],
@@ -48,9 +46,7 @@
           "offset": 597,
           "infons": {
             "section_title_1": "Rationale:",
-            "section_title_2": "Patient concerns:",
-            "iao_name_1": "document part",
-            "iao_id_1": "IAO:0000314"
+            "section_title_2": "Patient concerns:"
           },
           "text": "A 51-year-old male patient was admitted to the hospital with a chief complaint of ‘Due to the discovery of hard nodules with pruritus in the buttocks for 32 years. The patient was a male who was 51 years old. He has been in good health and reported no history of surgery, trauma, infection, metabolic disease, tumor, or other diseases. There was no family history. It is worth noting that the patient has the occupation of driving trucks, which keeps him sedentary.",
           "sentences": [],

I'm assuming this change was unintentional.

I tried to figure out when the change was introduced and it seemed like it was with #189, though that PR was to do with adding these properties if they were missing rather than removing them. I guess it could have been an inadvertent change -- or I've made a mistake.

Can you comment @Antoinelfr?

May 30 '25 07:05 alexdewar

I figured out why the tables output files were changing between different runs of AC. I had a hunch we might be iterating over a set somewhere, which turned out to be the case. I've opened a PR for it (#276). Once we've merged that, I'll update this branch and we can include the files that are currently marked xfail.

PS -- the tests are failing on this branch because my PAT has expired. I'll generate a new one.

May 30 '25 07:05 alexdewar

PS -- the tests are failing on this branch because my PAT has expired. I'll generate a new one.

Done. The new one will expire in 30 days.

May 30 '25 08:05 alexdewar

LGTM and merging since it passes and the handover meeting has passed, assuming no further changes to remaining PRs outside of mine and Antoine's ownership.

We'll finish up anything that's already open and re-review anything that needs re-reviewing etc. We're not just going to cut and run! But we won't start any new development at this stage. And as @AdrianDAlessandro said, we're very happy to answer questions and give support to do with anything we've worked on.

Jun 02 '25 08:06 alexdewar

Add regression tests for private test data

Description

Implementation

PATs

Future work

Problems identified

Reproducibility problems

Regressions

Type of change

Key checklist

Further checks

Codecov Report