Fix `is_abbreviation`
Description
As discussed, there are some shortcomings/bugs with the current method used to check whether or not a string is an abbreviation. I've had a go at reimplementing is_abbreviation; let me know what you think.
I'm opening this as a draft PR because it changes the behaviour of AC. We should get @jmp111 and/or @Antoinelfr to sign off on it before we merge.
I've described the checks we're doing in the docstring:
To be classified as an abbreviation, a string must be composed exclusively of Unicode letters or digits, optionally with a following dot. This sequence must repeat between two and ten times. We exclude strings that are exclusively composed of digits or lowercase letters.
I've added unit tests. The test cases are in tests/test_abbreviation.py and should give a sense of what is/isn't considered an abbreviation. If you think there's anything that we're classifying wrongly or any checks that we're not doing but should be, let me know.
From among the test cases, there are a few differences with the old implementation:
- An empty string was counted as an abbreviation
- Trailing characters were ignored so e.g.
"ABC!!!!"would be counted as an abbreviation - Lowercase letters separated by dots (e.g.
"a.b.c.") weren't counted as abbreviations (now we only exclude strings if they're exclusively composed of lowercase letters) - Five letters separated by dots weren't counted as an abbreviation (too long -- now we count letters/numbers instead)
- Strings with punctuation etc. between letters weren't properly excluded (e.g.
"A!&=B"was counted as an abbreviation)
One consequence of the last change in that list is that one string in the Auto-CORPus paper which was previously considered an abbreviation isn't anymore. The string is "CRIS-CODE", described as "Clinical Record Interactive Search Comprehensive Data Extraction". Not sure if we should be counting that as an abbreviation or not? If so, we may want to allow hyphens as separators as well as dots.
Fixes #144.
Type of change
- [ ] Documentation (non-breaking change that adds or improves the documentation)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Optimization (non-breaking, back-end change that speeds up the code)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (whatever its nature)
Key checklist
- [x] All tests pass (eg.
pytest) - [ ] The documentation builds and looks OK (eg.
mkdocs) - [x] Pre-commit hooks run successfully (eg.
pre-commit run --all-files)
Further checks
- [x] Code is commented, particularly in hard-to-understand areas
- [x] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))
One consequence of the last change in that list is that one string in the Auto-CORPus paper which was previously considered an abbreviation isn't anymore. The string is
"CRIS-CODE", described as "Clinical Record Interactive Search Comprehensive Data Extraction". Not sure if we should be counting that as an abbreviation or not? If so, we may want to allow hyphens as separators as well as dots.
I suspect we should allow hyphens as separators. But need to hear from the others to confirm
I aggree with @AdrianDAlessandro, I think hyphens should be allowed if possible, I suspect a minority of abbreviations will have it but still
Sure. I'll do that now.
This PR might break some of the regression tests for private data. I guess maybe it would be worth merging #265 first so that we can audit this more easily?
Done. Amusingly, now "Auto-CORPus" itself appears in the list of abbreviations, when it didn't before (there are no other changes, except that "CRIS-CODE" is now included again).
Auto-CORPus is 11 chars long and so was previously excluded because it's longer than 10 chars. But now, hyphens and dots aren't counted towards the total, so now it's included, but only just. Maybe we should consider upping the limit?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
| Files with missing lines | Coverage Δ | |
|---|---|---|
| autocorpus/abbreviation.py | 70.83% <100.00%> (-0.02%) |
:arrow_down: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
For anyone who's interested, I reran AC over the private test data to see how the results changed for the abbreviations file.
Here's the result:
diff --git c/html/PMC/PMC10627243_abbreviations.json w/html/PMC/PMC10627243_abbreviations.json
index 9218e7a..077a5e8 100644
--- c/html/PMC/PMC10627243_abbreviations.json
+++ w/html/PMC/PMC10627243_abbreviations.json
@@ -71,6 +71,11 @@
"text_short": "ASCO",
"text_long_1": "American Society of Clinical Oncology",
"extraction_algorithm_1": "fulltext"
+ },
+ {
+ "text_short": "pd-l1",
+ "text_long_1": "-programmed death-ligand 1",
+ "extraction_algorithm_1": "fulltext"
}
]
}
diff --git c/html/PMC/PMC4289056_abbreviations.json w/html/PMC/PMC4289056_abbreviations.json
index 69581f3..ccd2905 100644
--- c/html/PMC/PMC4289056_abbreviations.json
+++ w/html/PMC/PMC4289056_abbreviations.json
@@ -17,11 +17,6 @@
"text_long_1": "computed tomography",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "H&E",
- "text_long_1": "hematoxylin and eosin",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "GMS",
"text_long_1": "Grocott’s methenamine silver",
diff --git c/html/PMC/PMC6099000_abbreviations.json w/html/PMC/PMC6099000_abbreviations.json
index 21a63b4..92dd0d4 100644
--- c/html/PMC/PMC6099000_abbreviations.json
+++ w/html/PMC/PMC6099000_abbreviations.json
@@ -17,11 +17,6 @@
"text_long_1": "human immunodeficiency virus",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "HHV‐8",
- "text_long_1": "human herpes virus 8",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "UCD",
"text_long_1": "Unicentric CD",
diff --git c/html/PMC/PMC7454517_abbreviations.json w/html/PMC/PMC7454517_abbreviations.json
index 0edb505..d0aa362 100644
--- c/html/PMC/PMC7454517_abbreviations.json
+++ w/html/PMC/PMC7454517_abbreviations.json
@@ -32,11 +32,6 @@
"text_long_1": "percutaneous coronary intervention",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "V-V ECMO",
- "text_long_1": "veno-venous extracorporeal membrane oxygenation",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "LDH",
"text_long_1": "lactate dehydrogensae",
diff --git c/html/PMC/PMC8723718_abbreviations.json w/html/PMC/PMC8723718_abbreviations.json
index 41ae298..f4f391e 100644
--- c/html/PMC/PMC8723718_abbreviations.json
+++ w/html/PMC/PMC8723718_abbreviations.json
@@ -22,11 +22,6 @@
"text_long_1": "Carcinoembryonic antigen",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "CA 19-9",
- "text_long_1": "carbohydrate antigen 19-9",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "IHC",
"text_long_1": "immunohistochemistry",
@@ -37,11 +32,6 @@
"text_long_1": "cytokeratin 7",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "CK 20",
- "text_long_1": "cytokeratin 20",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "TTF-1",
"text_long_1": "thyroid transcription factor 1",
diff --git c/html/PMC/PMC9613823_abbreviations.json w/html/PMC/PMC9613823_abbreviations.json
index 7b7ac6d..b7c2dcf 100644
--- c/html/PMC/PMC9613823_abbreviations.json
+++ w/html/PMC/PMC9613823_abbreviations.json
@@ -11,11 +11,6 @@
"text_short": "CT",
"text_long_1": "computed tomography",
"extraction_algorithm_1": "fulltext"
- },
- {
- "text_short": "ATLS®",
- "text_long_1": "Advanced trauma life support",
- "extraction_algorithm_1": "fulltext"
}
]
}
diff --git c/html/PMC/PMC9724164_abbreviations.json w/html/PMC/PMC9724164_abbreviations.json
index ff7d5ab..cb4e1fc 100644
--- c/html/PMC/PMC9724164_abbreviations.json
+++ w/html/PMC/PMC9724164_abbreviations.json
@@ -37,11 +37,6 @@
"text_long_1": "computed tomography",
"extraction_algorithm_1": "fulltext"
},
- {
- "text_short": "CA 19-9",
- "text_long_1": "cancer antigen 19-9",
- "extraction_algorithm_1": "fulltext"
- },
{
"text_short": "CK",
"text_long_1": "cytokeratin",
So it's found one extra abbreviation and it's excluding a bunch of others. While the others do look like abbreviations, for the most part it's obvious why they were excluded based on the rules we came up with (e.g. spaces, a "rights reserved" sign at the end etc.).
The only one that stood out as odd was "HHV-8", which looks like it should be included. On closer examination it turns out that the hyphen in the middle isn't a hyphen but a unicode dash of some description (bloody unicode).
It's a bit weird that some abbreviations with spaces in were previously included as I thought the old algorithm also excluded candidates with spaces, but I guess it wasn't doing what it was supposed to.
If you decide you actually do want some of these candidates to be counted as abbreviations, it should be easy enough to tweak the regex or do some pre-processing of the strings.