chefboost icon indicating copy to clipboard operation
chefboost copied to clipboard

findDecision incorrect?

Open rjgarciar opened this issue 2 years ago • 5 comments

I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:

       Person1     Person2  Idx1  Idx2  Distance Decision
0   Aaron Paul  Aaron Paul     0     1    0.3245      Yes
1   Aaron Paul  Aaron Paul     0     2    0.2281      Yes
2   Aaron Paul  Aaron Paul     0     3    0.4737      Yes
3   Aaron Paul  Aaron Paul     0     4    0.4103      Yes
4   Aaron Paul  Aaron Paul     0     5    0.3236      Yes
5   Aaron Paul  Aaron Paul     0     6    0.3270      Yes
6   Aaron Paul  Aaron Paul     0     7    0.4873      Yes
7   Aaron Paul  Aaron Paul     0     8    0.3988      Yes
8   Aaron Paul  Aaron Paul     1     2    0.2357      Yes
9   Aaron Paul  Aaron Paul     1     3    0.2613      Yes
10  Aaron Paul  Aaron Paul     1     4    0.3827      Yes
11  Aaron Paul  Aaron Paul     1     5    0.2221      Yes
12  Aaron Paul  Aaron Paul     1     6    0.2183      Yes
13  Aaron Paul  Aaron Paul     1     7    0.4568      Yes
14  Aaron Paul  Aaron Paul     1     8    0.2391      Yes
15  Aaron Paul  Aaron Paul     2     3    0.4439      Yes
16  Aaron Paul  Aaron Paul     2     4    0.4086      Yes
17  Aaron Paul  Aaron Paul     2     5    0.2592      Yes
18  Aaron Paul  Aaron Paul     2     6    0.2863      Yes
19  Aaron Paul  Aaron Paul     2     7    0.4588      Yes

And I use this script to calculate findDecision tree:

import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()

if __name__ == '__main__':
	##############################################################################
	# Leer CSV para determinar el mejor threshold...
	df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
	print(df.head(20))

	df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
	df2 = df[df['Decision'] == "No"]['Distance'].copy()
	print(f"Count Yes: {df1.count()}")
	print(f"Average Yes: {round(df1.mean(), 4)}")
	print(f"Std. deviation Yes: {round(df1.std(), 4)}")
	print(f"Min Yes: {round(df1.min(), 4)}")
	print(f"Max Yes: {round(df1.max(), 4)}")
	print(f"Mode Yes: {round(df1.mode()[0], 4)}")

	print(f"Count No: {df2.count()}")
	print(f"Average No: {round(df2.mean(), 4)}")
	print(f"Std. deviation No: {round(df2.std(), 4)}")
	print(f"Min No: {round(df2.min(), 4)}")
	print(f"Max No: {round(df2.max(), 4)}")
	print(f"Mode No: {round(df2.mode()[0], 4)}")

	df1.plot.kde()
	df2.plot.kde()
	plt.legend(["Yes", "No"])
	plt.grid()
	plt.axhline(0,color='red')
	plt.axvline(0,color='red')
	plt.show()

	from chefboost import Chefboost as chef
	config = {'algorithm': 'C4.5'}

	tmp_df = df[['Distance', 'Decision']].copy()
	model = chef.fit(tmp_df, config)
	print (model)

The results I get are:

Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465

Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114

[INFO]:  8 CPU cores will be allocated in parallel running
C4.5  tree is going to be built...
-------------------------
finished in  135.35767483711243  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  99.81929981118321 % on  59901985  instances
Labels:  ['Yes' 'No']
Confusion matrix:  [[43, 1], [108242, 59793699]]
Precision:  97.7273 %, Recall:  0.0397 %, F1:  0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}

The plot is:

ArcFace-cosine

and outputs/rules/rules.py:

def findDecision(obj): #obj[0]: Distance
	# {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
	if obj[0]>0.0:
		return 'No'
	elif obj[0]<=0.0:
		return 'Yes'
	else: return 'Yes'

As you can see, it gives me a 0.0 threshold when it should be around 0.68.

Am I doing something wrong?

Regards

rjgarciar avatar Apr 06 '22 10:04 rjgarciar

can you share the data set?

serengil avatar Apr 08 '22 11:04 serengil

it might be a rounding problem. in the comment line it says "metric_value": 0.0191

serengil avatar Apr 08 '22 11:04 serengil

Shouldn't it be around 0.5 looking at the plot?

The data set has these columns:

  • "Person1": The name of first person
  • "Person2": The name of second person
  • "Idx1": Annoy index of ArcFace embedding of Person1
  • "Idx2": Annoy index of ArcFace embedding of Person2
  • "Distance": cosine distance between embeddings of Person1 and Person2
  • "Decision": Yes if Person1 is the same as Person2, No if it isn't

I know this is out of topic, it should be in your deepface package, but it's the reason I was trying to stablish a threshold: it is relatively common that:

  • two faces of different persons have a very small cosine distance (< 0.1)
  • two faces of same person have a very big cosine distance (> 1.0)

Perhaps this is the reason of "metric_value": 0.0191...

Data set is available here

rjgarciar avatar Apr 08 '22 13:04 rjgarciar

Data set size is really large and i cannot download it. Could subsample it and share here again?

serengil avatar May 21 '22 23:05 serengil

I have uploaded here faces_3.csv a 50% subsampling of original data.

rjgarciar avatar May 23 '22 06:05 rjgarciar

For large datasets the code isn't evaluating every possible partition (presumably due to performance). Instead it's using mean and +/- 1-3 std deviations. This subsampling is implemented in ‎processContinuousFeatures.

alwaysmpe avatar Nov 11 '23 16:11 alwaysmpe