image-match
image-match copied to clipboard
Signature Compression
Currently the entire signature is stored as an array, and the words are stored as 32-bit(?) integers. Investigate compression schemes.
I'm having a look at the paper . Would be great if I could draw some insights from it . Will get back to you ! 👍
One idea: at the moment, signatures are stored as big int arrays, e.g.:
[-1, -2, 0, 0, 1, ... , 4, -1]
And inserted directly into the database. This takes a lot of space on disk. A better way might be to convert them to strings. For instance, if the signature (of length n) values run from -2..+2, you could make them all non-negative by adding 2:
[1, 0, 2, 2, 3, ... , 4, 1]
You can convert this into a number by multiplying out powers of 5..basically using base 5:
1*5^0 + 0*5^1 + 2*5^2 + 2*5^3 + 3*5^4 ... 4*5^(n-1), 1*5^n = T
By this construction, T is unique. So now you will have a very large integer. You can then compress it by factoring it in a larger base. Say you take all the digits 0-9 and letters a..z. That's 10 digits + 26 letters = base 36. How many characters will you need?
The maximum value is 5^(n+1) - 1, (if all coefficients are 4, check my math here), so guaranteeing a base 36 encoding is bigger requires 36^(m+1) - 1 >= 5^(n+1) - 1.
Solving for m (again, check my math), m >= log_36(5^(n+1)). As you can see, the formula is general (it doesn't matter if use base 5 and 36) and m is how long your signature will have to be. If n=648, the default, then:
import math
math.log(5**648+1, 36)
Gives 291.03118615207245 so we need a 292-character string. If you use upper and lowercase, you can get it down to 253. Still long, but probably less space than a big array of ints (should check this first though!).
To get it back to an array, you have to do this whole process in reverse.
Is doing so much computation to conserve a few bytes worth it ?(maybe im wrong) In my opinion, if you are looking at reducing the signature size, that process has to take place before you add the values to the big int array and not after generating the values . In your approach , we are considering that we have access to a signature array . That shouldn't be the case , right ? Because then we'd have array + string in memory . Please correct me if I'm wrong .
Maybe developing a method wherein values are somehow encoded/compressed while they are being generated may help !?
The space savings are not insignificant, especially if you are storing billions of records. For context, I did a little experiment with 100 images and made three different indexes -- tester which is normal, tester_str where i replaced the signature with a length-253 str, and tester_none where I removed the signature completely:
http://localhost:9200/tester,tester_str,tester_none/_stats/store
"indices": {
"tester_str": {
"primaries": {
"store": {
"size_in_bytes": 702295,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 702295,
"throttle_time_in_millis": 0
}
}
},
"tester": {
"primaries": {
"store": {
"size_in_bytes": 905047,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 905047,
"throttle_time_in_millis": 0
}
}
},
"tester_none": {
"primaries": {
"store": {
"size_in_bytes": 647291,
"throttle_time_in_millis": 0
}
},
"total": {
"store": {
"size_in_bytes": 647291,
"throttle_time_in_millis": 0
}
}
}
}
Another (complementary) route, is to remove the extraneous zeros from the signature -- any grid point on an edge with missing neighbors gets a zero for the difference. These are completely useless and waste space...which is my lazy engineering =)
Any plans on addressing this concern? It feels like with current approach, an index with a billion images may weight hundreds of G's.
I got an index with 200k images, which is 710.4mb (elasticsearch index size). So for only 2M records, I imagine this size will be roughly around 7G. So speaking of a scale of billions, 2B records would be around... uhm... 7TB? That sounds really scary.
Any thoughts?
Hi @ecdeveloper,
Yes, the signatures are pretty heavy and your ballpark figures are correct. Disk space was not a concern for us when we developed this, so we didn't put much effort into making it more efficient.
I don't have time to work on new image-match features, but if someone wants to work on a PR improving signature compression, I'll happily support them. Probably the easiest way is just to modify the code to not store the signature at all -- and just return images that match on the most words.
Thank you for your reply. I'd totally be happy to contribute to this amazing project.
I already started looking into the source code, and I was curious - is signature, stored in the db, used at all? From what I saw - you currently match images by words. So I wonder what was the purpose of storing those signatures.
Thanks.
That's exactly right. With the Elasticsearch implementation, the actual image distances are only used to sort the final results. This could easily be done with the Elasticsearch match score. So it's really not necessary to store the signatures.
Want to give it a shot?
Yes, for sure. I suppose I'll need to tweak this query object, right?
https://github.com/ascribe/image-match/blob/master/image_match/elasticsearch_driver.py#L57
@rhsimplex I created a quick prototype locally, which doesn't use signatures and distances, and returns only those results where score is greater than 0. But there is one problem with that approach. Currently, you can pass distance_cutoff as an argument to SignatureES(). If we rely on the score instead of distance - distance_cutoff won't have any effect.
Do you have any thoughts on how to deal with that issue?
You shouldn't need to alter the database query. SignatureES overrides the __init__ method. You can have it explicitly take a dist argument. Make the default None. Then, pass the base constructor a dummy value (or nothing, it will just take the default) and make sure to conditinally change every line in SignatureES that uses self.distance_cuttoff i.e. if dist is None skip this statement and sort by score instead of dist here. You might need another parameter like score_cutoff in the constructor.
Makes sense.
Running more tests, though, makes me feel that relying on the score doesn't sound like a great idea.
I indexed one image, then I searched for a bunch of similar images, and here are the results I got (providing the score and dist):
[{ 'score': 1.0, 'dist': 0.73803269432014051, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 10.0, 'dist': 0.3448189311714156, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 15.0, 'dist': 0.38898318088094741, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
--
[{'score': 12.0, 'dist': 0.47407832345730783, 'id': u'AVvNXJCpv9pvUKrtAH8F'}]
According to docs, higher scores are better. However, here we see in 2nd case score is 10 and dist is 0.3, whereas case 4 - score is 12 (higher then 2nd one) and dist is 0.47 (which is no match).
Actually this looks good to me. As long as we choose the default threshold to still get most of the positive results (high recall) it's ok to have worse specificity. Especially since the tradeoff involves such a huge disk space savings. It was unlikely distance would exactly track score anyway.
0.45 was chosen rather arbitrarily, based on the kinds of images I happened to develop this project for. You'll see in the original paper they suggest different thresholds.
In fact, in the reference implementation they often don't compute the distance at all, which lends some credibility to our approach. If you are willing to look into their code, there may be some more tricks too.
Ah, you are right. After comparing those images by looking at them, the one with distance 0.47 and score of 12.0 - is very similar to the indexed one. So in this case score is even showing a better match than dist. Do you have any suggestions on the default (recommended) threshold for score?
Something between 1 and 10? :rofl: Seriously though, just pick something in that range. If you really want to fix narrower bounds on value, you can look at the tests and see which images should and shouldn't match. By the way, Elasticsearch uses TF/IDF scoring, if you're curious what the score actually means.
Now I'm slightly confused 😕 I ran lots of tests with different images. I noticed that I always get elasticsearch scores, ranging 1 to 63. 63 is usually 1 to 1 match (normally when I search for the exact image I indexed). 1.0 means no match. After running all those tests, I figured that score of 20.0 to 60.0 means almost identical image match. 10.0 to 20.0 - very similar. Anything under 9.0 is either a somewhat match or no match at all. May this be related to a different elasticsearch version? I'm using Elasticsearch 5.3.0 with Lucene 6.4.1.
There are 63 words available for search. The Tf/Idf top term means all those words are an exact match, and the Idf means that every word is unique in the corpus. I believe 63 is the highest possible score then, and as you add more images and repeat the search, the number will likely go down.
So you are justified being confused, because my explanation was bad! So how about this -- ignore the cutoff completely, and just return the top size results?
I could definitely do that, but I see one issue with that approach - let's say there is no match in the index for the image I'm searching by. However, there are slightly similar images. So the top result would have the score of, let's say, 7. In my case this is a bit of a match, but in fact the images are completely different.
That's ok. It's a typical feature of "search engine"-like applications...some of the results are basically meaningless. But if you were doing a search, wouldn't you rather have a look at the marginal results as well?
Having a hard cutoff like we do now also has disadvantages, as you're seeing in #64
Ok, I guess this discussion is going a little outside of the scope of that particular issue :)
I can work on filing a PR, which will remove signatures from the index, and will use the ES scores when bringing back the results. I could also impose an optional score cutoff (10, let's say), which could be overridden by a param.
Do you think that's reasonable?
Yes go for it!
@rhsimplex I filed a PR. I kept dist for the non-elasticsearch implementation. I guess we'd also need to update the docs at some point.
docs.count | store.size 7488880 | 31.2gb 8316821 | 11.9gb
These are two indexes I have now. Second index is missing signatures. Pretty impressive results. However, I will keep looking into optimizing the space, it's still too much. With this improvement we're at 1.2TB for 800M images. I'll try to reduce it more.
Even better than expected! You can try reducing the number of words (N) and see how it affects your results.
Could that affect the results accuracy?
I also tried renaming the simple_word_* to w_*, this reduced the index size but very unsignificantly, so I reverted it back.
Yes it should strictly decrease accuracy, but I don't know by how much. The original paper used 100 words of length 10, so more and shorter words. I don't find their probabilistic argument that convincing, so might be worth a shot.
Another thing you can try is getting rid of the extraneous zeros. Signatures are generated by differences of nearest neighbors on a grid. All off-grid differences are just set to zero.
This is done here:
https://github.com/ascribe/image-match/blob/master/image_match/goldberg.py#L418
This function isn't terribly clear since it's written to be fast, but if you look through it you'll see that some parts of the returned array will always be zero. These always get baked into the signature but actually don't carry any information. If you're interested in trying this I can provide support, but this will be more difficult since the code is really tightly coupled here.
Oh, we're about to remove signatures from the index, remember? ;)
i need a vacation :rofl: