Update XML script + add regression test for XML module
Description
This PR includes:
- Title text cleaning from new lines and weird formatting.
- Fix the author formatting in the reference.
- Change manually \xa0 to space.
- Fix a bug where the section title was empty due to the title tag being outside the section tag.
- Fix the keywords section to exclude the title from the text, remove non-English keywords, take multiple lists as opposed to the first one before, and exclude the abbreviation list.
- Remove tables and figures from the front and back tags.
- Add IOA IDs for all passages and fixed the "document part" allocation.
- Fix a bug where chunks of text were not identified because they were not in section tags.
- Improve the filtering of text artefacts from unwanted sections.
Fixes # (issue)
Type of change
- [ ] Documentation (non-breaking change that adds or improves the documentation)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Optimization (non-breaking, back-end change that speeds up the code)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (whatever its nature)
Key checklist
- [ ] All tests pass (eg.
pytest) - [ ] The documentation builds and looks OK (eg.
mkdocs) - [x] Pre-commit hooks run successfully (eg.
pre-commit run --all-files)
Further checks
- [x] Code is commented, particularly in hard-to-understand areas
- [x] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))
Codecov Report
Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| autocorpus/file_processing.py | 33.33% | 2 Missing :warning: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| autocorpus/parse_xml.py | 37.10% <ø> (ø) |
|
| autocorpus/file_processing.py | 50.90% <33.33%> (ø) |
... and 1 file with indirect coverage changes
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Before merging, let me look at the Windows test failing. I will try to implement the IOA from the main AC as well to avoid redundancy. I also noticed a weird character encoding in the AC paper parsing as opposed to the HTML I will take a look at this one as well.
Btw mypy is now happy with parse_xml.py, but we're still ignoring it in .pre-commit-config.yaml. Maybe we should enable mypy for it?
@AdrianDAlessandro
@Antoinelfr I've brought this up to date with main and added one or two small changes. I'd still recommend addressing all of @alexdewar 's suggestions before merging.
@Thomas-Rowlands I'll leave this with you now to decide when it's ready to merge
I have made some modifications as suggested by @alexdewar. I also created a summary of the next steps here https://github.com/omicsNLP/Auto-CORPus/issues/294