serenata-de-amor icon indicating copy to clipboard operation
serenata-de-amor copied to clipboard

Final analyses in Generalization of reimbursements using DeepLearning Keras

Open silviodc opened this issue 7 years ago • 12 comments

  1. Converting PDF files to png and then applying sift descriptors
  2. DeepLearning Keras to detect generalization in reimbursements

silviodc avatar May 18 '17 22:05 silviodc

Hi @silviodc, thank you very much for the contribution! 🎉 Thanks for you patience!

Me, @jtemporal and @cabral took a closer look into it and found the best way to understand what you have done! Discussing in the public group helped us to find a way to do that! :)

So you are looking for some answers, so here it is: The document is the following one

It is clear to me that the description of the items was made by someone else than the restaurant, is it allowed ???

No it is not allowed, should be described on the receipt

Are the deputies or acessors changing a document?? What are the implications about it?

No, it is not allowed that the receipt is altered (as the law says). It looks very suspicious. It is important to notice that the receipt was not changed but extra information was added by hand outside of the receipt while scanning it, and the CEAP clearly states that also can't happen. We have added an image that highlights part of the law bellow that proves that.

image

Taking a closer look at the place, that is the value of at least two people meal

Can we bring the discussion here?

anaschwendler avatar Jun 07 '17 14:06 anaschwendler

Hey @silviodc,

I have some questions on the effectiveness of your model! First things first, thanks for sending a PR to serenata that includes CNNs and DNNs to this project, it's a great way to contribute to Brazilian politics! Keep doing it man!

Let me ask you some questions (then it will help you to see if your work is really doing what it is meant to be or not):

  • Have you tried to create a cross validation set to test out the dataset?
  • Is the 70% accuracy rate suffering any kind of overfitting/underfitting?
  • What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

vmesel avatar Jun 07 '17 17:06 vmesel

Hi everyone, So, regarding the suspicious recipe, let's bring the discussion here :+1: 1- How this scanning of documents is done? a - by the Deputy Assessor? b - Someone in the chamber of deputies in charge of it? 2- Since it is not allowed, who assumes the responsibility of this mistake ?? a- The deputies? b- The employees? c- Both?

Going to the method:

  • Have you tried to create a cross validation set to test out the dataset?

I didn't try, but it is my plan to the future. I will use the 2483 suspicious reimbursements i got in my run to build a proper training set and out dataset.

  • Is the 70% accuracy rate suffering any kind of overfitting/underfitting?
  • What would happen if a new kind of receipt is inputed (a new place, new items or other data like CNPJ has changed)? Would it still be effective?

So, verifying the first results and the 2483 files, it is suffering of overfitting. I thought that generating many images with ImageDataGenerator could increase the performance of the model, but it seems not the case.

The main point is: I used few data (500 reimbursements; 250 per class), i was really lazy to build the corpus haha. In the example i followed, they are using (2000 images, 1000 per class) and in my opinion it has a huge impact. Well, to change the cnpj, data or place don’t lead to get false positives. However, the change of items (description and number) has a huge impact.

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include. In my opinion is too soon to use it directly. However, I think that including the reimbursements showed on twitter and Facebook and validated by the citizens, could be an approach in the future. Only with this 2483 files I’ve noticed some interesting cases.

PS: thank you guys so much for this project! I'm very proud of you :+1: and happy for participate

silviodc avatar Jun 08 '17 00:06 silviodc

Hi @silviodc, so according the law, we have that:

1- How this scanning of documents is done? a - by the Deputy Assessor? b - Someone in the chamber of deputies in charge of it?

According to the law, the congressperson have to give the original document, and the chamber will scan it. By knowing that, they cannot bring any scanned document to the Chamber. The law is below: " Art 3 - § 2º Será objeto de ressarcimento a despesa comprovada por documento original, em primeira via, quitado e em nome do Deputado, ressalvado o disposto nos §§ 4º a 6ºdeste artigo e admitindo-se, na hipótese de conta telefônica, apenas a apresentação da folha de rosto, acompanhada do pertinente comprovante de quitação. (Parágrafo com redação dada pelo Ato da Mesa nº 66, de 8/1/2013) " source

2- Since it is not allowed, who assumes the responsibility of this mistake ?? a- The deputies? b- The employees? c- Both?

The responsibility lies on the deputy, according to this piece of the law:

" Art. 4º A solicitação de reembolso será efetuada mediante requerimento padrão, assinado pelo parlamentar, que, nesse ato, declarará assumir inteira responsabilidade pela liquidação da despesa, atestando que: I - o material foi recebido ou o serviço, prestado; II - o objeto do gasto obedece aos limites estabelecidos na legislação; III - a documentação apresentada é autêntica e legítima." source

So, I already classified 1400 reimbursements of 2483 by hand. Maybe until august I will have the training dataset prepared. I guess with this dataset we can have a better view whether it’s a good approach to include.

So you are thinking about classifying all those reimbursements to create a training dataset, and plan to use it as approach to include? Is there something that we can do to help, besides testing and giving opinion about it? You always can call us :)

anaschwendler avatar Jun 08 '17 13:06 anaschwendler

Hi @anaschwendler

Continuing the discussion. Since the responsibility lies only in the deputy, what we do when they push it to others?

Take a look:

A Câmara possui um setor que analisa e aprova as prestações de contas de cada parlamentar, portanto, se este valor (que consumi e paguei) estivesse ferindo alguma norma da Casa este setor responsável teria apontado o problema e, certamente, não aprovaria minha despesa, devolvendo a nota sem ressarcimento.

Nota deputado Marcon sobre refeição de 130 reais

Did you find an answer to it?

Nosso próximo passo é descobrir a qual instância recorrer quando os próprios deputados negam os números, os dados públicos e a matemática.

Cuducos post

PS: For how long we have to play the game: "I'm not guilty..." ?

So, for the method....

  1. You can help validating the reimbursements i didn't check.
  2. Publishing it on web to be validate by others...

need validation

Best, Silvio

silviodc avatar Jun 08 '17 23:06 silviodc

Did you find an answer to it?

We’ve found a answer and we wrote a article in Portuguese about that.

Creating news around it and developing social pressure will have good results and education. We’re aligning with lawyers the next steps.

PS: For how long we have to play the game: "I'm not guilty..." ?

We are not playing, they’re :)

So, for the method....

We had an amazing experience with crowdsourcing before, and we've got plans to do the same here. Soon we’ll release a spreadsheet and drop a line through FB and Twitter.

Thank you very much! :tada:

anaschwendler avatar Jun 13 '17 10:06 anaschwendler

hi @silviodc and everyone! Just an update here: We are releasing today the file for the crowdsourcing ;)

Pretty soon we'll have the files identified for testing the method 🎉

jtemporal avatar Jun 19 '17 11:06 jtemporal

Hi @jtemporal Great news!! I guess already in this week we can test this method again :D

silviodc avatar Jun 19 '17 11:06 silviodc

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model? It will make the process faster. And after that, a better model can be built. More specifically, we can classify by hand only the receipts that have a probability between 0.75 and 0.35.

thiagoalencar avatar Jun 19 '17 13:06 thiagoalencar

Hi everyone!

Finally i executed the code over the new Dataset! Take a look in the results they are amazing :D

silviodc avatar Jun 24 '17 12:06 silviodc

Hi @thiagoalencar

Thanks so much for follow this PR too. So, regarding the overfitting, i guess you missed that the 91% was in an external dataset. It was also suggested by @vmesel to verify whether we could use it on other recipes.

During the construction, we archived 86% in training and 94% in evaluation. acc: 0.8691 - val_loss: 0.2265 - val_acc: 0.9423

Therefore, the built model can also be generalized to other data. I suggest you to take a look in the definition of overfitting in this paper vldb ML

For instance, overfitting occurs when the model fits the training data well but does not generalize to unseen or test data.

I think that now the model is pretty good to be used. We have 91% in an external test data.

Regarding the probability value, mentioned here:

Why not run the model on all ~160k meals receipts and make the spreadsheet available and ordered by the probability generated by the model?

Well, this is the probability value regarding the model. It doesn't confirm any think about the reimbursement. In the end we will have a lot of reimbursements which must be validate by hand. It will lead us to some points:

  1. Validate reimbursements by hand is a laborious task, and we want to avoid it! It is the main reason why we built the ML model.
  2. Since the first model wasn't good enough, we will have tons of data with strange probability.

Thanks so much for your comments and interest.

silviodc avatar Jun 24 '17 15:06 silviodc

thanks for the response. but I think that a small test data does not capture all variability of the dataset. and applying in all dataset will not generate too many cases. With this accuracy, only a feel cases will remain in doubt. There will never be a model that prevents manual classification. So, the questionable classification can be iteratively corrected, like google us in they captcha system. Can you post the confusion matrix,sensitivity and specificityof the model?

thiagoalencar avatar Jun 24 '17 18:06 thiagoalencar