open-metric-learning
open-metric-learning copied to clipboard
Decrease memory foot print of computing FMR metric
Initial discussion is in https://github.com/OML-Team/open-metric-learning/pull/250
from @dapladoc :
I thought about some algorithm to find quantiles online, in a batch mode. I found that there is P^2 algorithm (link) and here is a discussion about it (link). Then I found more articles about quantile estimation: see references to this article (link). It looks like all these algorithms work only element-by-element. So, no way to parallelize it. I'm not a specialist in this, though. Maybe it could be done effectively in cython. I'll try to implement the P^2 (or some other algorithm) in cython.
Another solution is to approximate the quantile (and the fnmr@fmr) by means of histogram, that could be computed incrementally. I can try to estimate accuracy of this approach on some syntetic data.
At the moment I don't have other ideas, but I will think on it, and maybe something new will come up.
from me:
@dapladoc thank you for your thoughts and links! please, don't start working on cython implementation (except if you really want this), let's discuss it for now. I am afraid that the possible profit is way less than the potential effort needed to implement this algorithm. Let's just discuss possibilities for now.
I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.
It's still unclear what is the problem: the footprint of copying distances or memory overhead inside the quantile algorithm.
I like the idea of an approximation. I thought about an even simpler solution: we can consider only the part of pairwise distances if the matrix is too big. For example, randomly pick 10% or something like this.
I don't like this idea. It is very uncontrollable approach. I think there could be very high variance for this approach.
Here is a small simulation. I drew pos_dist
and neg_dist
from the gamma distribution. The size of pos_dist
is 1000, and the size of neg_dist
is 1,000,000. Then in a loop I take randomly 10% of negative distances, evaluate fnmr@fmr
and store it. Afterwards, I plot histograms of fnmr@fmr
approximations and the true fnmr@fmr
for different fmr_vals
.
The simulation of fnmr@fmr approximation on random subsets.
pos_dist = torch.from_numpy(np.random.gamma(shape=1, scale=1, size=1000))
neg_dist = torch.from_numpy(np.random.gamma(shape=6, scale=1, size=1_000_000))
fmr_vals = (0.1, 0.5, 1, 5)
fnmr_true = calc_fnmr_at_fmr(pos_dist, neg_dist, fmr_vals).numpy()
fnmr_approx = []
for i in tqdm(range(10000)):
_neg_dist = neg_dist[torch.rand(neg_dist.shape) > 0.9]
fnmr_approx.append(calc_fnmr_at_fmr(pos_dist, _neg_dist, fmr_vals).tolist())
fnmr_approx = np.array(fnmr_approx)
fig, ax = plt.subplots(len(fnmr_true))
for i in range(len(fnmr_true)):
ax[i].hist(fnmr_approx[:, i], bins=100)
ax[i].axvline(fnmr_true[i], color='r')
ax[i].set_title(f'fmr_val = {fmr_vals[i]}')
print(f'Std(fnmr@fmr({fmr_vals[i]})) = {fnmr_approx[:, i].std():.4e}')
print(f'Bias(fnmr@fmr({fmr_vals[i]})) = {np.mean(np.abs(fnmr_approx[:, i] - fnmr_true[i])):.4e}')
fig.set_size_inches(6.4, 4.8 * 3)
And here is its output
Std(fnmr@fmr(0.1)) = 5.5389e-01
Bias(fnmr@fmr(0.1)) = 4.0927e-01
Std(fnmr@fmr(0.5)) = 3.2004e-01
Bias(fnmr@fmr(0.5)) = 2.2448e-01
Std(fnmr@fmr(1)) = 1.5432e-01
Bias(fnmr@fmr(1)) = 6.6530e-02
Std(fnmr@fmr(5)) = 9.9451e-02
Bias(fnmr@fmr(5)) = 7.1290e-02
Here are histograms of fnmr@fmr
approximations. The red lines are "ground truth".
Usually we are interested in fnmr@fmr
for small values of fmr
, and it will have the biggest bias.
The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.
If we want to work on this problem systematically, we have to measure the following first (for example, on SOP dataset):
- Memory is needed to compute 3 basic metrics (CMC, precision, ap)
- Memory is needed to do p.1 + copy the positive and negative distances
- Memory is needed to do p.2 + compute fnm@fmr metric
Then it would be clear what and how should be optimized
PS. My IMHO is that we can postpone the problem for now (https://github.com/OML-Team/open-metric-learning/pull/250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.
The graphs above seem okay to me, it's I consider fnmr@fmr only as an additional (auxiliary) metric. So, it's not super important 30% or 32% of positive distances are smaller than 10% of the negative ones. Of course, If I submit these results somewhere, it's not accurate enough.
It really depends on the domain. For security related tasks (i.e. biometrics) 99.9% and 99.99% are noticebly different results.
PS. My IMHO is that we can postpone the problem for now (#250 is enough) and focus on the tasks with higher priorities. In the end, it's probably not optimal to optimize fmr metric without (at least) having examples from the verification domain.
I agree with you. There should be more thorough memory measurements done in order to find possible ways to solve this issue.
A bit optimizer it along with OML 3.0 release (removed unwilling convertations): https://github.com/OML-Team/open-metric-learning/blob/main/oml/functional/metrics.py#L375