File type checking
Description
Added very basic file extension checks to differentiate between HTML/HTM and XML input file extensions.
Fixes #118
Type of change
- [ ] Documentation (non-breaking change that adds or improves the documentation)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Optimization (non-breaking, back-end change that speeds up the code)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (whatever its nature)
Key checklist
- [ ] All tests pass (eg.
pytest) - [ ] The documentation builds and looks OK (eg.
mkdocs) - [ ] Pre-commit hooks run successfully (eg.
pre-commit run --all-files)
Further checks
- [ ] Code is commented, particularly in hard-to-understand areas
- [ ] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))
reverted to draft as more changes are needed to deal with the potential for XML files to actually contain HTML content (or vice-versa)
Thinking about this a bit more, I'm wondering if it makes sense to check the file type here at all. Presumably each config file applies only to HTML or XML files (well, there aren't any XML ones yet, but there will be), so we'll know what kinds of files to expect depending on the config file loaded. Should we just add a file_type property to the config files instead?
Should we just add a
file_typeproperty to the config files instead?
Currently, XML doesn't use a config file at all, so hard to say. Also, part of the reason for this check is to check that a file is actually the type it is labelled as. But I think if this check in the right place, we could be able to process multiple files of both XML and HTML types in one batch.
There are probably details I'm missing, but I kind of assumed that XML files wouldn't be that different from HTML files, in that different journals would have different formats, so you would need to handle them differently (e.g. the title might be in a differently named section). I'm guessing @Antoinelfr's code just handles the one type for now. But in any case I'm guessing it'll probably make sense to have a config file for XML files rather than hard-coding names of tags etc. And if we don't allow more than one config type to be used at a time that's going to be a bit of an obstacle with the current interface.
I kind of feel this relates to the discussion about the command-line interface (#140). It would be nice to be able to be able to have multiple input files, then you could just do something like this:
auto-corpus -b my_html_config some_folder/*.html
auto-corpus -b my_xml_config some_folder/*.xml
Anyway, just a thought.
Post sprint discussion this week, thought I'd summarise the plan for file type processing in AC in the longer term. So far, we aim to get HTML, XML, PDF, Word docs, spreadsheets, presentations and a little further out image OCR. Not all of these files will be main text, likely PDF will, some will be supplementary material.
The config system will not work for every file type, at least not without modification, and presumes that the file type has a common thread between each instance (i.e. heading & body font sizes, section styles etc). The current idea talked about is maintaining the input directory structure for the provided output directory, where the top level files are presumed to be articles and any nested in folders will be presumably supplementary material.
I think this is superseded by #235. Shall we close it?
The XML/HTML version testing is definitely needed here, like attempting to parse them just to confirm the contents match the file extension, but we could just add it to the PDF PR if preferred
Ah ok. Nvm then.
@Thomas-Rowlands I have changed the base branch of this PR to target my current branch where I am refactoring the autocorpus.py file. I will update this branch and merge it into my one as I can see where it can be used now.
Codecov Report
Attention: Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| autocorpus/autocorpus.py | 37.50% | 3 Missing and 2 partials :warning: |
| autocorpus/file_type.py | 88.46% | 3 Missing :warning: |
| Files with missing lines | Coverage Δ | |
|---|---|---|
| autocorpus/file_type.py | 88.46% <88.46%> (ø) |
|
| autocorpus/autocorpus.py | 52.17% <37.50%> (+0.23%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.