rumors-line-bot Collect image / video and upload to s3

Collect image / video and upload to s3

Open MrOrz opened this issue 7 years ago • 10 comments

不需要蓋 article，一樣回覆「我們還不支援文字以外的訊息唷」但把東西存起來之後分析

Apr 12 '17 13:04 MrOrz

請 @godgunman 開 s3 然後給大松參與者 token

Apr 26 '17 13:04 MrOrz

字很多的圖與影片 screenshot_20170623-213803 screenshot_20170623-213746

我覺得不用擔心圖片變造而闢謠不完的狀況，因為製作變造圖片的成本其實比變造文字還要高，甚至可能比回應他還高。考量到圖片與影片的傳播普及率，我認為幫圖片與影片做回應應該是划算的。

Jun 29 '17 17:06 MrOrz

For video subtitle extraction:

"End-to-End Subtitle Detection and Recognition for Videos in East Asian Languages via CNN Ensemble with Near-Human-Level Performance" https://arxiv.org/abs/1611.06159

"Text Detection, Tracking and Recognition in Video: A Comprehensive Survey" http://ieeexplore.ieee.org/document/7452620/?reload=true

Jan 10 '18 06:01 MrOrz

香港人蒐集台灣電視劇 youtube 來 train subtitle extractor XD

https://towardsdatascience.com/automatic-speech-recognition-data-collection-with-youtube-v3-api-mask-rcnn-and-google-vision-api-2370d6776109

Nov 27 '18 06:11 MrOrz

hey 👋, 我看到更簡單的比對方式。

for example:

{ "message": {"type": "image", "id": "9165680563567" ...} }

when message type is image, do

client.getMessageContent(9165680563567).then(...)

so in the closure we can probably get response.body as a binary string, I just did a simple experiment by a same picture from different forwarding sources, message ids seemingly have changed such as 9165680563567 to 9165672759911, but to compare the response.body 100% matched.

IMO, picture and video, recording would harder to changed (mostly), since we can getMessageContent by message id, good new is we can compare these easily.

So the second concern should be store in database, we perhaps need to add new attributes to articles such as article_type, and we can keep binary data in text, to render the binary image on website should be easy as well!

I can help to develop this feature for line bot, website and database but I also need to setup these project in my local site first, do you think it's doable @MrOrz?

Jan 11 '19 13:01 CarolHsu

ㄟ不過我現在才看到 issue 是 2017 的，所以現在的狀況是...？ XD

Jan 11 '19 13:01 CarolHsu

現況是沒在動 XDDD

我以為轉傳畫質會一直掉，原來是同一份 binary 呀。

我覺得 file content 可能還是要塞 s3 之類的地方，DB 這裡留個 file hash / checksum / perceptual hash 就好。

article.text 的部分我們 mapping 是設計成讓 Elasticsearch 把 article.text 用 bigram 斷詞之後製作 index，應該是無法塞 binary 的（它應該會抱怨）；如果考量到搜尋，article.text 乖乖塞塞圖片或影片的「文字內容」或許比較有機會搜尋到同內容或內容相近的圖片或影片呢。

處理多媒體檔案我覺得最麻煩的地方，其實是在編輯端，包含讓編輯可以輕鬆看圖的方式（例如說先初步 OCR 過，或甚至是讓送出圖片的人在 LINE bot 裡幫忙檢查 OCR 再送出）、輕鬆看影片的方式（查證影片內容真的很崩潰，程式要處理影片也崩潰⋯⋯ orz 如果有逐字稿工具或自動截圖工具就好了呢）等等。

可以想想看有沒有什麼幫編輯在這方面減壓的配套～～

Jan 11 '19 16:01 MrOrz

我覺得因為這一個算大的功能，應該可以分好幾個 phase 來做才會推得動。我認為傳到 s3 變成一個 file 以後再做各種 convert 比較不直觀，比對誤差也更大，如果 binary 會 100% 符合就是不需要做字串的相似度比對，暴力一點讓他 == 就好了。

假設轉傳圖片完全相等這其實是合理的，想想看 line 也不想自己的 db 會因為無限的使用者轉傳圖片導致爆炸，所以字串應該會是同一份，也沒有理由在這個動作上多做一些 converting 的工（？）

我的想法是，先開始從圖片著手，開一個存 binary 的欄位。這其實是做得到的，我昨天抓了一個 binary 也大概長度 7xxxx, 78 kb 左右，如果真的要與現有資料庫分開，我還是建議存成另一個 binary 型式的欄位，存成 s3 反而導致了事情變得比較麻煩。

而且我覺得步驟應該是先解決圖片謠言 -> 影片謠言 -> 錄音謠言, 一步一步會比較合理。

圖片解決方式

phase 0 增加 schema, 給一個 binary 欄位存 image binary, 增加 article type = 'image'

phase 1 [line bot] 支持 create image 到這個欄位（看起來現有就是存去 s3） [website] 呈現圖片

base64_image = Base64.encode64(binary_image)

<img src="data:image/png;base64,<BASE64_IMAGE_STRING>">

在 phase 1 這裡，以圖片謠言來說，可以看出來沒有任何理由要上傳 file 到 s3。

phase 2 減輕編輯負荷：增加 cronjob, 定時對每日（甚至數次，有需要的話）新上傳圖片們做 OCR, 並將結果存在 text 欄位。

Enhancement 允許部分編輯（for example 等級高的？）去修改可能不正確的 OCR 文字結果 ...etc

所以如果我的想像是合理的，如果圖片的部分可行，影片或語音說不定可以用類似的方法解決。

Jan 12 '19 02:01 CarolHsu

Case study from Cofact Thailand

Chatbot side

Code changes: https://github.com/opendream/rumors-line-bot/commit/22accd49813a36bd6220ecac05e5adcce3c59359

Uses image-hash, which is a perceptual hash
Search / update flow when receiving image
1. Get image file from LINE and calculate hash
2. Use hash to query Google drive
3. Upload to google drive if not exist; returns fileData from Google drive
4. Construct input string as $image__<hash>__<fileData.id> and continue search flow (which will look up $image__<hash>__<fileData.id> in elasticsearch)
It stores a magic string ($image__<hash>__<fileData.id>) to database and also uses $image__${hash}__${fileData.id} to search database
For videos it takes a screenshot at 6th second and the image as its hash

Website side

Code changes: https://github.com/opendream/rumors-site/commit/229904fe0b8e93334833fe8feab7fd5ced829b1a

Uses google drive directly to host images & video

Aug 02 '21 04:08 MrOrz

Design doc & alternatives in different perspective https://g0v.hackmd.io/aJqHn8f5QGuBDLSMH_EinA

Aug 25 '21 17:08 MrOrz

rumors-line-bot rumors-line-bot copied to clipboard

Collect image / video and upload to s3

圖片解決方式

Chatbot side

Website side

rumors-line-bot
rumors-line-bot copied to clipboard