whistleblower Attach document image to tweets

Following @talespaiva's suggestion in a tweet and recent replies to @RosieDaSerenata.

Docs for Twitter API: https://python-twitter.readthedocs.io/en/latest/_modules/twitter/api.html?highlight=%22def%20PostUpdate%22

Jul 03 '17 08:07 Irio

A possible roadmap:

convert the receipt from PDF to PNG
crop the white paper areas (sometimes small receipts are in an A4 page size scanned image)
upload the PNG to someplace like Imgur
add the PNG URL to the tweet

Jul 03 '17 10:07 cuducos

I'll give this one a try.

Oct 05 '17 14:10 paulocezar

Hi @paulocezar

To convert the PDF to Image I suggest you to take a look in these notebooks from https://github.com/datasciencebr/serenata-de-amor/pull/238

This, you can speed up your work focusing on crop the image and uploading it. The necessary libraries are in the end of the docker file. "OpenCV, Wand..." . Try to look in segmentation techniques to crop the recipe :) Looking forward to see your #PR ;)

Oct 07 '17 09:10 silviodc

Hi, I do not know how the progress of this issue is? Maybe the trim function http://docs.wand-py.org/en/0.3-maintenance/wand/image.html#wand.image.Image.trim might help!

Oct 19 '17 12:10 murilobsd

Thanks for sharing it if us @murilobsd . ;)

In Addition, I would like to suggest who is implementing it to take a look in these libraries to upload the images :)

https://pypi.python.org/pypi/python-tumblpy/1.1.4 https://github.com/Imgur/imgurpython

Oct 19 '17 16:10 silviodc

Why not upload image to twitter? Keeping only twitter as external service dependency.

https://developer.twitter.com/en/docs/media/upload-media/api-reference/post-media-upload

Nov 09 '17 15:11 CauanCabral

Why not upload image to twitter?

I think this is better/easier/simpler than using a third-party service for hosting images (unless @silviodc has other usages for the storage in mind). What's is needed IMHO is an implementation sich as:

Get reimbursement data needed to build the receipt URL (applicant_id, year and document_id) to concatenate some string and get the receipt URL
Try to fetch the PDF
If it succeeded convert to PNG
Crop it
Add to the tweet with the API @CauanCabral linked

Nov 09 '17 15:11 cuducos

Hi! Does anyone need some help with this issue?

Jan 24 '18 12:01 rodolfolottin

AFAIK there's no one working on that, @rodolfolottin – make yourself at home : )

Jan 24 '18 13:01 cuducos

Ok @cuducos . I'll give it a try.

Jan 24 '18 13:01 rodolfolottin

So, I have some doubts about how to test this functionally. First one is: how can I get some data, once that the tests are using mocks? I know that I can get the pdf directly in the camara’s web site, but I also want to see the data from each reimbursement tuple.

Another one is: what my tests should test? I get that I should test the tweet content that is going to be posted, but what about the fetched pdf? And the blank area that I have to crop, how can/should I test that? Should I use some example pdf as fixture?

Many thanks!

Jan 29 '18 14:01 rodolfolottin

Hi @rodolfolottin, let me recap road map drafted above:

Get reimbursement data needed to build the receipt URL (applicant_id, year and document_id) to concatenate some string and get the receipt URL

Try to fetch the PDF

If it succeeded convert to PNG

Crop it

Add to the tweet with the API @CauanCabral linked

Given these steps, this is my 2c:

In steps 1 and 5 we're responsible for generating the right calls to external services, but not responsible to manage the calls themselves. What I mean is that:

In step 1 we must assert we're generating the proper URL to download the PDF and passing it to the download function (for example, urllib.request.urlretrieve)
In step 5 we must assert we're properly calling the Twitter API with the image attached

That said, we I'd say that in step 1 we can mock the download method and:

Assert it's called with the proper URL
Use a fixture as it's response, so we have a real PDF file to test steps 2, 3 and 4

Then we must mock the Twitter API call and assert that we're calling it with the image as an attachement.

Does that make sense to you all?

Jan 29 '18 15:01 cuducos

Sorry for the late answer.

Yeah, @cuducos . Thanks again!

Now I'm working on croping the image using wand, but it's not being easy. My first approach was to try to crop the image based in it background color. As the image is a scan itself, the whole background of the image have the same white color. I'm looking for some related problems, but the ones that I've founded have, in general, two different colors, which makes easier to differentiate the image.

Edit: just thinking here, but maybe, IDK, I could parse the rows and colums from the image and crop the pixels based on the presence of a different color than white.

Feb 04 '18 23:02 rodolfolottin

Hi @rodolfolottin,

I see to non-exclusive possibilities here:

Ask people in the Telegram group if they have any experience in automatic cropping scanned images (because scanning always leave some pixels here and there and I think a simple approach based on color won't work)
Baby steps: we put this image in production without cropping and adding it as a feature later ; )

Feb 05 '18 13:02 cuducos

Hi guys,

One question about the crop of images.

The function mentioned by @murilobsd (trim) doesn't work?

trim(color=None, fuzz=0) Parameters: | color (Color) – the border color to remove. if it’s omitted top left pixel is used by defaultfuzz (numbers.Integral) – Defines how much tolerance is acceptable to consider two colors as the same.

PS: In your case you will use it without color and defining a defaultfuzz empirically.

Feb 05 '18 13:02 silviodc

Hi @silviodc and @cuducos . Thanks for your help.

@cuducos , as the part of crop the image was the hard one to me I decided to go for and do some tests. Because of that, I don't have the another part done yet, but I can work on finish it.

@silviodc , in my tests with this function I was using the white color and I did'nt get what I expected. For this image, the more I increase the defaultfuzz value, most of the image (the invoice) is cropped. In both cases, using the white color and not using it, I got the same results. And as sometimes the invoice is not in the center of the scanned image, I did'nt fell secure to go on with this approach. As an example I am attaching a cropped image with a defaultfuzz value of 50%.

Here it is.

I'm taking the @cuducos advice of doing baby steps and I will worry with the croping function later.

Feb 05 '18 23:02 rodolfolottin

Hi @rodolfolottin Thanks for the feedback. Maybe this weekend I will try to combine some edge detection and crop... I will let you know if it works.

Feb 05 '18 23:02 silviodc

Hey, today I asked for help a friend who work with image processing in the job and he suggest the use of OpenCV for that.

The response: https://twitter.com/begnini/status/960547129264615425 StackOverflow related link: https://pt.stackoverflow.com/a/265916

Both are in portuguese.

Feb 06 '18 00:02 CauanCabral

Hi,

my friend @CauanCabral pointed me to this issue and I'm played a little with the documents. I work with digitalized documents and I know some are hard to manipulate, so, what I made is good, but is not pixel perfect.

To proof the concept, I downloaded 100 pdfs from the jarbas.sereneta.ai home page, and with pdfimage extracted all images from these pdfs. After this, I processed these images.

The result you can see here https://github.com/begnini/document_crop/blob/master/crop.md. The code is in this repository, too (https://github.com/begnini/document_crop/blob/master/crop.py).

I'll improve the documentation later, but if you have any questions, be free to ask me.

Feb 06 '18 01:02 begnini

whistleblower whistleblower copied to clipboard

Attach document image to tweets

whistleblower
whistleblower copied to clipboard