stump
stump copied to clipboard
Image analysis job
This pull request creates a first draft of an image analysis job (#181), which will need to be updated further before merging.
I chose a slightly broader approach to this job, making an AnalyzeMediaJob
which includes an AnalyzeImage
task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.
The job isn't fully integrated. Although it is written to have four variants (Individual, Library, Series, and MediaGroup) I've only implemented a client-side way to access the Individual analysis command. On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it. Implementing the job fully would involve adding the same button to other menus and linking them up on the axum side. The axum side also needs authentication boilerplate added.
Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.
In the meantime, I think this is a good place to discuss how this could be improved and made ready for integration to develop.
Okay, these last several commits get things into a generally workable state. There are probably a few things to do still. A few that come to mind:
- Add more doc comments and logging to job.
- Any other media analysis tasks we need?
- Should the media's
page_count
always be updated or do we want to only do that if it doesn't already have one? - Need to enforce permissions properly in server when the start analyze media task endpoint is hit.
Any other media analysis tasks we need?
I think what you have is good for now wrt to what analysis tasks are performed, and you've built it already to be able to add more tasks as needed which is 🔥
Should the media's
page_count
always be updated or do we want to only do that if it doesn't already have one?
I'm not 100% yet. I think if it flat out doesn't exist, it's fine to set it to the value you read. The exception might be EPUB files.
Otherwise, I think some sort of reconciliation would be neat, e.g. the analysis generates a list of actions to be taken on the frontend. In the case of mismatch, it would tell you what the current is and what the analyzed is and you can decide the action.
However, that in itself is a huge feature. I'll err on the side of caution and say just persist JobExecuteLog
with a WARN
level for now. I think folks are generally anxious about automatic operations on their carefully curated metadata library (even if it doesn't change the underlying metadata files)
So I was working on this pull request today and I realized that a media item's page is non-null.
Investigating further, to see if perhaps an invalid value might be assigned under some conditions that should be overwritten (versus properly assigned page counts where no update is necessary, per your prior comment), I found something I hadn't considered: the pages for a file are always counted during initial creation during a scan (see the process
function for each FileProcessor
implementation).
That initially made the job here seem a bit unnecessary - it would never have a new value. However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job. In interest of this, I am thinking of unifying the get_page_count
and process
logic so that both use the same code to count pages.
Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.
So I was working on this pull request today and I realized that a media item's id is non-null.
I'm not sure I understand, did you mean to refer to the page
being non-nullable?
the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).
Yeah good catch, there is a pages
field directly embedded into a media
file (IIRC it is -1 for EPUBs). This is mostly because there wasn't proper metadata support until a little later on, and that representation has just stuck around.
I think the big goal for the analysis is more for ensuring the metadata matches the actual count, not necessarily the pages
field in the media directly (since like you said, that was generated by Stump already and should just be correct). The metadata which is read from misc files within e.g. an archive are more subject to error. However if someone were to manually edit the file, e.g. adding a page, obviously that would lead to a mismatch in both.
However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job.
I'm good with this 👍 I do like my idea for some sort of reconciliation flow (I'm starting to overuse this word now lol) but I don't think it is necessarily needed for this small of an operation.
In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.
Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.
Yeah I think that makes sense, I agree 👍
I also think the random buttons are good for now, especially since this is based into experimental
I'm not sure I understand, did you mean to refer to the
page
being non-nullable?
Yep, typo. Will edit the original comment.
This last commit does two things. First, it fixes an error in the way I had the job for rars counting pages (needed to check if it was an image). Second, it addresses a todo in process
that suggested using the same validation as in get_page
for counting pages (basically checking if the entry is an image).
Once you've confirmed that this is what you want this should be good to merge.