stump icon indicating copy to clipboard operation
stump copied to clipboard

Image analysis job

Open JMicheli opened this issue 10 months ago • 6 comments

This pull request creates a first draft of an image analysis job (#181), which will need to be updated further before merging.

I chose a slightly broader approach to this job, making an AnalyzeMediaJob which includes an AnalyzeImage task. This is intended to be more extensible, as additional stages to media analysis will probably be wanted later on.

The job isn't fully integrated. Although it is written to have four variants (Individual, Library, Series, and MediaGroup) I've only implemented a client-side way to access the Individual analysis command. On that note, I did this by adding an "Analyze Media" button to the "Manage" page for individual media items, it seemed like the right place to put it. Implementing the job fully would involve adding the same button to other menus and linking them up on the axum side. The axum side also needs authentication boilerplate added.

Finally, the job currently updates the number of pages on a media item always as opposed to just validating the number of pages. It sounds, from the issue, like this isn't the goal, so I can change that alongside other changes.

In the meantime, I think this is a good place to discuss how this could be improved and made ready for integration to develop.

JMicheli avatar Apr 15 '24 23:04 JMicheli

Okay, these last several commits get things into a generally workable state. There are probably a few things to do still. A few that come to mind:

  • Add more doc comments and logging to job.
  • Any other media analysis tasks we need?
  • Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?
  • Need to enforce permissions properly in server when the start analyze media task endpoint is hit.

JMicheli avatar Apr 22 '24 06:04 JMicheli

Any other media analysis tasks we need?

I think what you have is good for now wrt to what analysis tasks are performed, and you've built it already to be able to add more tasks as needed which is 🔥

Should the media's page_count always be updated or do we want to only do that if it doesn't already have one?

I'm not 100% yet. I think if it flat out doesn't exist, it's fine to set it to the value you read. The exception might be EPUB files.

Otherwise, I think some sort of reconciliation would be neat, e.g. the analysis generates a list of actions to be taken on the frontend. In the case of mismatch, it would tell you what the current is and what the analyzed is and you can decide the action.

However, that in itself is a huge feature. I'll err on the side of caution and say just persist JobExecuteLog with a WARN level for now. I think folks are generally anxious about automatic operations on their carefully curated metadata library (even if it doesn't change the underlying metadata files)

aaronleopold avatar Apr 24 '24 01:04 aaronleopold

So I was working on this pull request today and I realized that a media item's page is non-null.

Investigating further, to see if perhaps an invalid value might be assigned under some conditions that should be overwritten (versus properly assigned page counts where no update is necessary, per your prior comment), I found something I hadn't considered: the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

That initially made the job here seem a bit unnecessary - it would never have a new value. However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job. In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

JMicheli avatar Apr 27 '24 22:04 JMicheli

So I was working on this pull request today and I realized that a media item's id is non-null.

I'm not sure I understand, did you mean to refer to the page being non-nullable?

the pages for a file are always counted during initial creation during a scan (see the process function for each FileProcessor implementation).

Yeah good catch, there is a pages field directly embedded into a media file (IIRC it is -1 for EPUBs). This is mostly because there wasn't proper metadata support until a little later on, and that representation has just stuck around.

I think the big goal for the analysis is more for ensuring the metadata matches the actual count, not necessarily the pages field in the media directly (since like you said, that was generated by Stump already and should just be correct). The metadata which is read from misc files within e.g. an archive are more subject to error. However if someone were to manually edit the file, e.g. adding a page, obviously that would lead to a mismatch in both.

However, thinking further, it might actually makes the most sense to always update it. This way, should the page counting methodology be updated in the future to correct an error, people can easily correct their databases by running the analyze job.

I'm good with this 👍 I do like my idea for some sort of reconciliation flow (I'm starting to overuse this word now lol) but I don't think it is necessarily needed for this small of an operation.

In interest of this, I am thinking of unifying the get_page_count and process logic so that both use the same code to count pages.

Let me know your thoughts here, if you agree, then I just need to get authentication working before this is finalized on the server side. If the random button I threw on each manage page is good enough for you, then it'll be ready to merge at that point as well.

Yeah I think that makes sense, I agree 👍

I also think the random buttons are good for now, especially since this is based into experimental

aaronleopold avatar Apr 27 '24 23:04 aaronleopold

I'm not sure I understand, did you mean to refer to the page being non-nullable?

Yep, typo. Will edit the original comment.

JMicheli avatar Apr 27 '24 23:04 JMicheli

This last commit does two things. First, it fixes an error in the way I had the job for rars counting pages (needed to check if it was an image). Second, it addresses a todo in process that suggested using the same validation as in get_page for counting pages (basically checking if the entry is an image).

Once you've confirmed that this is what you want this should be good to merge.

JMicheli avatar Apr 29 '24 03:04 JMicheli