vimac
vimac copied to clipboard
Linux port using a neural net
I would be interested in a linux port. Since "accessibility API" (uses to retrieve button location) are not available on linux, one needs something else to gather the buttons location.
I'm thinking about building a small neural net performing object detection, to retrieve the button locations. Recent advance have made lightweight but CPU run network possible, with an acceptable detection performance and input to output delay. example
Alternatively, since bounding box are not really needed (only 1 coordinate / button is needed), a segmentation neural net trained to output a heatmap of button location could be another approach/ kinda like this paper (ignoring the segmentation part of course)
The most challenging aspect would be collecting a good dataset of various GUI with labeled buttons. Web scrapping and HTML parsing could be done to find the button location, giving a big dataset for cheap.
However one would only have "web looking" button, and no "desktop looking" button. One could use MacOs GUI + accessibilty API to further diversify the dataset.
The advantage of such an approach would be that such tool "should" be compatible with all apps out of the box. What are your thought on such an approach ?
Not sure how well it would work. I don't know the first thing about ML. I'm exploring automating detection of buttons from screenshots in iOS apps (just the data collection part) right now for a freelance gig, will have to see the results. Not sure if I would support linux either (given the userbase size + willlingness to pay for stuff).
On 15 Feb 2023, at 5:59 AM, Maxime G @.*** @.***>> wrote:
I would be interested in a linux port. Since "accessibility API" (uses to retrieve button location) are not available on linux, one needs something else to gather the buttons location.
I'm thinking about building a small neural net performing object detection, to retrieve the button locations. Recent advance have made lightweight but CPU run network possible, with an acceptable detection performance and input to output delay. example https://github.com/ultralytics/ultralytics Alternatively, since bounding box are not really needed (only 1 coordinate / button is needed), a segmentation neural net trained to output a heatmap of button location could be another approach/ kinda like this paper https://ieeexplore.ieee.org/abstract/document/8593678/(ignoring the segmentation part of course)
The most challenging aspect would be collecting a good dataset of various GUI with labeled buttons. Web scrapping and HTML parsing could be done to find the button location, giving a big dataset for cheap.
However one would only have "web looking" button, and no "desktop looking" button. One could use MacOs GUI + accessibilty API to further diversify the dataset.
The advantage of such an approach would be that such tool "should" be compatible with all apps out of the box. What are your thought on such an approach ?
— Reply to this email directly, view it on GitHub https://github.com/dexterleng/vimac/issues/490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIE6VXFT5DAJMFJIQ7P6I3TWXP55TANCNFSM6AAAAAAU4DBIHQ. You are receiving this because you are subscribed to this thread.
relevant link, haven't tested yet : https://github.com/phil294/vimium-everywhere
Last year I made this https://github.com/garywill/vimouse
Uses opencv to do vision recognition based click
The screenshot may seem ugly right now. The algorithm and parameters may need changing. Haven implemented any AI. Currently it just finds any "object" on screen (at least almost every button found lol)
It is in very very very early stage.
Cross-platform & lightweight. I made that in 300 lines python code.
BTW, I listed many similar projects in that readme