chefboost
chefboost copied to clipboard
findDecision incorrect?
I have a CSV with pre-calculated cosine distance between face embeddings of people images in my dataset like this:
Person1 Person2 Idx1 Idx2 Distance Decision
0 Aaron Paul Aaron Paul 0 1 0.3245 Yes
1 Aaron Paul Aaron Paul 0 2 0.2281 Yes
2 Aaron Paul Aaron Paul 0 3 0.4737 Yes
3 Aaron Paul Aaron Paul 0 4 0.4103 Yes
4 Aaron Paul Aaron Paul 0 5 0.3236 Yes
5 Aaron Paul Aaron Paul 0 6 0.3270 Yes
6 Aaron Paul Aaron Paul 0 7 0.4873 Yes
7 Aaron Paul Aaron Paul 0 8 0.3988 Yes
8 Aaron Paul Aaron Paul 1 2 0.2357 Yes
9 Aaron Paul Aaron Paul 1 3 0.2613 Yes
10 Aaron Paul Aaron Paul 1 4 0.3827 Yes
11 Aaron Paul Aaron Paul 1 5 0.2221 Yes
12 Aaron Paul Aaron Paul 1 6 0.2183 Yes
13 Aaron Paul Aaron Paul 1 7 0.4568 Yes
14 Aaron Paul Aaron Paul 1 8 0.2391 Yes
15 Aaron Paul Aaron Paul 2 3 0.4439 Yes
16 Aaron Paul Aaron Paul 2 4 0.4086 Yes
17 Aaron Paul Aaron Paul 2 5 0.2592 Yes
18 Aaron Paul Aaron Paul 2 6 0.2863 Yes
19 Aaron Paul Aaron Paul 2 7 0.4588 Yes
And I use this script to calculate findDecision tree:
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()
if __name__ == '__main__':
##############################################################################
# Leer CSV para determinar el mejor threshold...
df = pd.read_csv(R"\\10.15.20.109\e$\MODELS\ProtecFR\Model\faces2.csv", encoding='UTF8')
print(df.head(20))
df1 = df[df['Decision'] == "Yes"]['Distance'].copy()
df2 = df[df['Decision'] == "No"]['Distance'].copy()
print(f"Count Yes: {df1.count()}")
print(f"Average Yes: {round(df1.mean(), 4)}")
print(f"Std. deviation Yes: {round(df1.std(), 4)}")
print(f"Min Yes: {round(df1.min(), 4)}")
print(f"Max Yes: {round(df1.max(), 4)}")
print(f"Mode Yes: {round(df1.mode()[0], 4)}")
print(f"Count No: {df2.count()}")
print(f"Average No: {round(df2.mean(), 4)}")
print(f"Std. deviation No: {round(df2.std(), 4)}")
print(f"Min No: {round(df2.min(), 4)}")
print(f"Max No: {round(df2.max(), 4)}")
print(f"Mode No: {round(df2.mode()[0], 4)}")
df1.plot.kde()
df2.plot.kde()
plt.legend(["Yes", "No"])
plt.grid()
plt.axhline(0,color='red')
plt.axvline(0,color='red')
plt.show()
from chefboost import Chefboost as chef
config = {'algorithm': 'C4.5'}
tmp_df = df[['Distance', 'Decision']].copy()
model = chef.fit(tmp_df, config)
print (model)
The results I get are:
Count Yes: 108285
Average Yes: 0.4496
Std. deviation Yes: 0.1557
Min Yes: 0.0
Max Yes: 1.0644
Mode Yes: 0.3465
Count No: 59793700
Average No: 0.7976
Std. deviation No: 0.1112
Min No: 0.0
Max No: 1.2973
Mode No: 0.8114
[INFO]: 8 CPU cores will be allocated in parallel running
C4.5 tree is going to be built...
-------------------------
finished in 135.35767483711243 seconds
-------------------------
Evaluate train set
-------------------------
Accuracy: 99.81929981118321 % on 59901985 instances
Labels: ['Yes' 'No']
Confusion matrix: [[43, 1], [108242, 59793699]]
Precision: 97.7273 %, Recall: 0.0397 %, F1: 0.0794 %
{'trees': [<module 'outputs/rules/rules' from 'c:\\DESARROLLOS\\Python\\VID\\outputs/rules/rules.py'>], 'alphas': [], 'config': {'algorithm': 'C4.5', 'enableRandomForest': False, 'num_of_trees': 5, 'enableMultitasking': False, 'enableGBM': False, 'epochs': 10, 'learning_rate': 1, 'max_depth': 3, 'enableAdaboost': False, 'num_of_weak_classifier': 4, 'enableParallelism': True, 'num_cores': 8}, 'nan_values': [['Distance', None]]}
The plot is:
and outputs/rules/rules.py:
def findDecision(obj): #obj[0]: Distance
# {"feature": "Distance", "instances": 59901985, "metric_value": 0.0191, "depth": 1}
if obj[0]>0.0:
return 'No'
elif obj[0]<=0.0:
return 'Yes'
else: return 'Yes'
As you can see, it gives me a 0.0 threshold when it should be around 0.68.
Am I doing something wrong?
Regards
can you share the data set?
it might be a rounding problem. in the comment line it says "metric_value": 0.0191
Shouldn't it be around 0.5 looking at the plot?
The data set has these columns:
- "Person1": The name of first person
- "Person2": The name of second person
- "Idx1": Annoy index of ArcFace embedding of Person1
- "Idx2": Annoy index of ArcFace embedding of Person2
- "Distance": cosine distance between embeddings of Person1 and Person2
- "Decision": Yes if Person1 is the same as Person2, No if it isn't
I know this is out of topic, it should be in your deepface package, but it's the reason I was trying to stablish a threshold: it is relatively common that:
- two faces of different persons have a very small cosine distance (< 0.1)
- two faces of same person have a very big cosine distance (> 1.0)
Perhaps this is the reason of "metric_value": 0.0191...
Data set is available here
Data set size is really large and i cannot download it. Could subsample it and share here again?
I have uploaded here faces_3.csv a 50% subsampling of original data.
For large datasets the code isn't evaluating every possible partition (presumably due to performance). Instead it's using mean and +/- 1-3 std deviations. This subsampling is implemented in processContinuousFeatures.