Auto-CORPus icon indicating copy to clipboard operation
Auto-CORPus copied to clipboard

File type checking

Open Thomas-Rowlands opened this issue 9 months ago • 8 comments

Description

Added very basic file extension checks to differentiate between HTML/HTM and XML input file extensions.

Fixes #118

Type of change

  • [ ] Documentation (non-breaking change that adds or improves the documentation)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Optimization (non-breaking, back-end change that speeds up the code)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] Breaking change (whatever its nature)

Key checklist

  • [ ] All tests pass (eg. pytest)
  • [ ] The documentation builds and looks OK (eg. mkdocs)
  • [ ] Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • [ ] Code is commented, particularly in hard-to-understand areas
  • [ ] Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

Thomas-Rowlands avatar Mar 25 '25 13:03 Thomas-Rowlands

reverted to draft as more changes are needed to deal with the potential for XML files to actually contain HTML content (or vice-versa)

Thomas-Rowlands avatar Mar 25 '25 13:03 Thomas-Rowlands

Thinking about this a bit more, I'm wondering if it makes sense to check the file type here at all. Presumably each config file applies only to HTML or XML files (well, there aren't any XML ones yet, but there will be), so we'll know what kinds of files to expect depending on the config file loaded. Should we just add a file_type property to the config files instead?

alexdewar avatar Mar 28 '25 14:03 alexdewar

Should we just add a file_type property to the config files instead?

Currently, XML doesn't use a config file at all, so hard to say. Also, part of the reason for this check is to check that a file is actually the type it is labelled as. But I think if this check in the right place, we could be able to process multiple files of both XML and HTML types in one batch.

AdrianDAlessandro avatar Mar 28 '25 16:03 AdrianDAlessandro

There are probably details I'm missing, but I kind of assumed that XML files wouldn't be that different from HTML files, in that different journals would have different formats, so you would need to handle them differently (e.g. the title might be in a differently named section). I'm guessing @Antoinelfr's code just handles the one type for now. But in any case I'm guessing it'll probably make sense to have a config file for XML files rather than hard-coding names of tags etc. And if we don't allow more than one config type to be used at a time that's going to be a bit of an obstacle with the current interface.

I kind of feel this relates to the discussion about the command-line interface (#140). It would be nice to be able to be able to have multiple input files, then you could just do something like this:

auto-corpus -b my_html_config some_folder/*.html
auto-corpus -b my_xml_config some_folder/*.xml

Anyway, just a thought.

alexdewar avatar Mar 28 '25 17:03 alexdewar

Post sprint discussion this week, thought I'd summarise the plan for file type processing in AC in the longer term. So far, we aim to get HTML, XML, PDF, Word docs, spreadsheets, presentations and a little further out image OCR. Not all of these files will be main text, likely PDF will, some will be supplementary material.

The config system will not work for every file type, at least not without modification, and presumes that the file type has a common thread between each instance (i.e. heading & body font sizes, section styles etc). The current idea talked about is maintaining the input directory structure for the provided output directory, where the top level files are presumed to be articles and any nested in folders will be presumably supplementary material.

Thomas-Rowlands avatar Apr 16 '25 09:04 Thomas-Rowlands

I think this is superseded by #235. Shall we close it?

alexdewar avatar May 16 '25 08:05 alexdewar

The XML/HTML version testing is definitely needed here, like attempting to parse them just to confirm the contents match the file extension, but we could just add it to the PDF PR if preferred

Thomas-Rowlands avatar May 16 '25 08:05 Thomas-Rowlands

Ah ok. Nvm then.

alexdewar avatar May 19 '25 07:05 alexdewar

@Thomas-Rowlands I have changed the base branch of this PR to target my current branch where I am refactoring the autocorpus.py file. I will update this branch and merge it into my one as I can see where it can be used now.

AdrianDAlessandro avatar May 29 '25 12:05 AdrianDAlessandro

Codecov Report

Attention: Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autocorpus/autocorpus.py 37.50% 3 Missing and 2 partials :warning:
autocorpus/file_type.py 88.46% 3 Missing :warning:
Files with missing lines Coverage Δ
autocorpus/file_type.py 88.46% <88.46%> (ø)
autocorpus/autocorpus.py 52.17% <37.50%> (+0.23%) :arrow_up:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 29 '25 16:05 codecov[bot]