videoduplicatefinder icon indicating copy to clipboard operation
videoduplicatefinder copied to clipboard

[Question] Change the number of 'thumbnails' multiple times - I have to re-generate the thumbnails each time. It seems not logical

Open jeffward01 opened this issue 1 year ago • 4 comments

Issue

Each time a 'scan' occurs and the number of thumbnails change, the thumbnails are not 'stored' or 'remembered'. This results in a very long file scan time.

Please consider the following scenario

Scenerio

  • Directory: `/root/library
  • Contents: 10 video files

Action 1

  • I set the configuration to generate (2) thumbnails
  • I scan `/root/library
  • The thumbnails are generated to produce (2) thumbnails per video file

Action 2

  • I set the configuration to generate (5) thumbnails
  • I scan `/root/library
  • The thumbnails are re-generated to produce (5) thumbnails per video file

Action 3 (this is the important step)

  • I set the configuration to generate (2) thumbnails
  • I scan `/root/library
  • The thumbnails are re-generated to produce (2) thumbnails per video file

Action 3 (this is the important step) Alternate version

  • I set the configuration to generate (4) thumbnails
  • I scan `/root/library
  • The thumbnails are re-generated to produce (4) thumbnails per video file

Expected behavior

  • Action 3

    • The thumbnails from 'Action 1' are used and re-generation is not required
  • Action 3 - Alternate Version

    • Each video already has (2) thumbnails which exist from 'Action 1'.
    • Only (2) new thumbnails are generated per each video
    • Based on the new generation step, only (2) new thumbnails are generated, which results in a grand-total of (4) thumbnails

Question

Does this functionality exist?

Context --> Why I suggest this feature

  • I have a large 120TB library that is not very organized, I am constantly adding new files to it, so I do not have the opportunity to create a very 'clean' structure
  • For a 120TB library, a scan can easily take 3-5 days
  • I encounter duplicates sometimes at (2) thumbnails. I increase to (4) thumbnails which results in a larger scan
  • I perform a scan with (4) thumbnails
  • I add new files to the directory
  • Now I decrease the thumbnails to (2) per video file to increase the scan speed
  • The entire library needs to be 're-scanned' which still results in a scan of 3-5 days

jeffward01 avatar Nov 27 '22 06:11 jeffward01

Graybyte values are saved. The thumbnails of the found duplicates are not saved between multiple scans. This is by design as these thumbnails can take a lot of space. But these thumbnails shouldn't affect scan speed as they're generated after scan is done.

0x90d avatar Nov 27 '22 07:11 0x90d

Graybyte values are saved. The thumbnails of the found duplicates are not saved between multiple scans. This is by design as these thumbnails can take a lot of space. But these thumbnails shouldn't affect scan speed as they're generated after scan is done.

Let me do some testing to see, because in my experience if I repeat the above steps - it takes days to complete a scan.

Perhaps I am mixing up some settings ad 're-scanning' the database so that the database id dumped then rescanned.

I will test and verify this, then report back to you on my findings either way 🙌


Question

The thumbnails of the found duplicates are not saved between multiple scans.

  1. Any interest in making these files optionally persist to a target location, and perhaps add a flush feature?

For example, I have at least 159,142 video files in my library haha. So this means each time I run a scan, it will need to generate 159,142 * thumbnailCount each time.

  1. Just for estimation reasons, do you happen to know the very roughly approximate size of thumbnail in kb? I'm just trying to how large the cache could grow.

Honestly tho, it can't be worse than JetBrains cache size 🙃

jeffward01 avatar Nov 29 '22 09:11 jeffward01

I have not tried the latest versions of VDF ..., but in principle it has been so far:

  • Based on the set thumbnail count, a corresponding number of mini grayscale thumbnails are created from all videos, which are used for duplicate search
  • The timestamp at which these graybyte values are created, results (if I remember right) from the respective video length, divided by the number of thumbnails + 1. So with one thumbnail at half the duration, with two at 1/3 and 2/3, and so on. Therefore, in most cases, for each number of thumbnails not yet scanned, a corresponding number of graybyte values must be created. Except for those whose divisors happen to match (such as the 1/2 for thumbnail count 1 and 3). The positions to be compared with each other must match. Therefore you can't just add some more to the existing ones. (Or the result of the duplicate search of a corresponding implementation would then no longer depend only on the selected number, but on what was selected sometime earlier).
  • If you don't delete the database and then return to a previously scanned thumbnail count, the corresponding graybyte values should still exist, so they don't have to be recreated.
  • However, the displayed thumbnails of the duplicates need to be recreated. However, their number is not directly dependent on the number of videos in the library and in most cases it will be "relatively" low.

Maltragor avatar Nov 29 '22 20:11 Maltragor

Thank you @Maltragor for explaining that, that makes a lot of sense how it is “averaged out” to an even ratio like in the example you gave.

If I made a pull-request, do you think it would be helpful it the algorithm had a “memory” and would make some sort of adjustments to not “re-scan” entires?

the adjustments would be to refactor how it selected where the scans would take place by pre-setting slots essentially.

such as for example, if you have (3) thumbnails, let’s say the thumbnails are at: these positions:

• 5% mark • 50% mark • 75% mark

4 thumbnails: • 5% • 33% mark • 50% mark • 75% mark

2 thumbnails: • 5% mark • 50% mark

1 thumbnail: • 5%

As an example ^^.

I don’t see why it is necessary for it to “by ratio” re-pick the marks based on each number of thumbnails in an “even way”. If it has pre-determined slots like in my example it could be alot faster.

Questions:

1.) if I made a PR for this, would it be something that would be accepted, or does it logically break something?

2.) given the example of (2) 60 minute long movies that are identical, but each movie has a DIFFERENT 10 second exactly intro - in the current algorithm, would a duplicate be detected? My assumption is no, because it would not see the gray pixels in the first 10 seconds. Is this correct?

jeffward01 avatar Dec 10 '22 05:12 jeffward01