apps-android-commons icon indicating copy to clipboard operation
apps-android-commons copied to clipboard

Speed up the search process for available file names in findUniqueFilename

Open whym opened this issue 6 years ago • 17 comments

If the user-chosen file name is occupied by an existing file, the app sequentially tries to find an available file name by trying suffixes _1, _2, etc. When there are hundreds of these numbered files, it will be unbearably slow. It already does for Test.jpg on the Beta Commons. I believe the same thing happens when the user uploads a lot of files under the same name.

Proposed solution:

Change the file name pattern from <user-given title>_<number>.jpg to <user-given title>_<short hash>.jpg when the user-chosen file name is unavailable. It should extend to a longer hash when collisions repeat. When that's not enough (let's say 3 consecutive collisions on random generation), we could add another 4, and repeat.

Steps to reproduce:

Build betaDebug and try uploading a file under the name Test. It will take a longer time (> 1 min in my WiFi environment) than it should.

Commons app version:

latest master

(Note: the task description has been rewritten based on feedback below.)

whym avatar Feb 22 '19 12:02 whym

@whym I would like to solve this issue

madhurgupta10 avatar Feb 22 '19 12:02 madhurgupta10

@madhurgupta10 thanks, that would be great! One thing I forgot to mention - currently the Beta Commons is read-only for maintenance and it might be a bit difficult to reproduce the slowness. It's been fixed since then.

whym avatar Feb 22 '19 13:02 whym

@whym The "binary search" I know is https://en.wikipedia.org/wiki/Binary_search_algorithm , is that what you mean? Or is it a different Mediawiki API URL?

nicolas-raoul avatar Feb 22 '19 13:02 nicolas-raoul

@nicolas-raoul yes, I was thinking about the for-loop or recursive version of the algorithm. I don't think MediaWiki provides such a search function.

I assumed 'beginner' here was Android beginner, not CS beginner, seeing that many new comers say they are CS students.

whym avatar Feb 22 '19 14:02 whym

A binary search is usually performed on a finite number of elements, but here there is a quasi-infinite number of elements. We could restrict the binary search to only 1000 elements, but images Test_1.jpg to Test_1000.jpg might all exist already.

How about just using something random like a UUID (or maybe a bit shorter)? As Commons switches to using captions, having pretty filenames is becoming a bit less important than before. :-)

nicolas-raoul avatar Feb 22 '19 14:02 nicolas-raoul

Well spotted! Pinging @maskaravivek for input as he implemented the most recent version of our upload backend.

misaochan avatar Feb 22 '19 17:02 misaochan

We could restrict the binary search to only 1000 elements, but images Test_1.jpg to Test_1000.jpg might all exist already.

You are right, a simple binary search cannot be used. Perhaps we can use the 2^n sequence instead - 1, 2, 4, 8 instead of 1, 2, 3, 4? To make sure no gap is created because of this, you would have to go backward to find the least available position, though. (So if Test_1.jpg - Test_10.jpg exist and you find Test_16.jpg to be available after failing _1, _2, _4, _8, you will have to go backward to find 11 is the least one.) ... not sure how much improvement this would be, actually, but that's the best I can think of at the moment.

How about just using something random like a UUID (or maybe a bit shorter)?

Commons users (mainly admins who have to deal with a lot of files in a short time) might not be ready for entirely meaningless file names, but the scheme of <user-given title>_<short hash (maybe 4 hexadecimals)>.jpg might be viable for avoiding file name collisions. When that's not enough (let's say 3 consecutive collisions on random generation), we could add another 4.

whym avatar Feb 23 '19 01:02 whym

@whym shall I start implementing the Binary Search technique you mentioned under the previous comment?

madhurgupta10 avatar Feb 23 '19 15:02 madhurgupta10

For this scenario where the first part of the string is the same among multiple search results, a Trie is one of the common structures used (https://en.wikipedia.org/wiki/Trie). I don't think our storage is in the right format for this but if we are loading values into the app we can consider using this structure. This could be too complex for the amount of improvement we're looking for.

An entirely different solution would be to create a suffix which is a hash of the current date time to ensure uniqueness and completely skip the whole searching for a unique filename thing.

albendz avatar Feb 23 '19 19:02 albendz

@madhurgupta10: It seems like this might be harder than I first imagined, sorry about that. The end result might be simple enough, but there are a couple of approaches and I don't know which one is the best at the moment. I'd wait for a while to see what people think.

@albendz brought up a good point. Not exactly trie, it looks like we can combine search operators to find what we want: e.g. Parmadan Forest <number>.jpg (It's slightly different, though, since it returns files with zero-padded numbers'01', '02' that we don't try creating.)

  1. Use binary search to find the first available sequence number with an arbitrarily max (let's say 1000 would be enough for most cases). Some more work will need to be done for the rare cases where you have to go beyond the chosen max.
  2. Change the file name pattern to <user-given title>_<short hash>.jpg when the user-chosen file name is unavailable. It should extend to longer hash when collisions repeat. (See my comment above too.)
  3. Use Search API to get the list of all existing files under the given pattern, and find the next available name. (See above within this comment.)

Number 1 was my first idea, but it now seems less effective than the other options. I'm now inclined to support 3, although 2 wouldn't be too bad, too.

whym avatar Feb 24 '19 02:02 whym

Thanks for analyzing the different strategies! I vote for (2) because it is the fastest and easiest.

And also because small numbers can make a filename ambiguous. For instance imagine I take a picture of an interesting building and name it with its address: 南青山 2 - 3 - 1. Appending #1 would make the name very confusing, whereas adding something like #6s2p would be less confusing as it is clearly a hash. By the way, I would suggest using 5 characters instead of 4, because with 4 a large proportion would looks like years: #1789. (I suggest using # instead of - to make it clear that it is a hash)

nicolas-raoul avatar Feb 25 '19 03:02 nicolas-raoul

I like the second approach but would like to suggest some modification in the upload flow.

Background context about current flow:

If the user chooses a title that already exists, we show a warning dialog informing the user that the title already exists and asking whether he wants to proceed or not.

If the user clicks on No > he stays on the same page and he can manually edit the title.

If the user clicks on Yes > he proceeds further and once the upload is queued, he would discover the actual title that was assigned to his upload.

It might surprise the user to see 1, 2, 3 etc (or based on the new discussion some hash) appended to the title. Will it be more fruitful to show the final title to the user(ie. on the upload screen itself) before he proceeds further. Seeing an unaesthetic title might encourage the user to actually update the title to something more meaningful.

maskaravivek avatar Feb 25 '19 03:02 maskaravivek

Thanks for the input, everyone. It sounds like the short hash approach is the way to go. I've changed the task description accordingly.

@maskaravivek: do you think the UI change you described (which is a great idea) can be a separate task? I believe it can be, although hashes might make it a bit more surprising than numbers.

whym avatar Mar 01 '19 13:03 whym

Yup for sure it should be a separate task. I just wanted to point it out :)

maskaravivek avatar Mar 01 '19 14:03 maskaravivek

@madhurgupta10: do you still want to work on the modified task? If yes, that's great. If no, that would be reasonable too, given that it has been significantly changed. Thanks!

whym avatar Mar 05 '19 13:03 whym

@whym I can still work on the issue :)

madhurgupta10 avatar Mar 05 '19 18:03 madhurgupta10

@nicolas-raoul Is this issue still open and would I be able to work on it?

AlexG-2003 avatar Oct 08 '24 23:10 AlexG-2003