WebCord icon indicating copy to clipboard operation
WebCord copied to clipboard

Add keybind socket

Open epetousis opened this issue 2 years ago • 3 comments

Because Wayland forbids non-root processes from listening to system-wide keypresses, Discord and WebCord are unable to listen for keybinds when the window doesn't have focus. This pull request adds a socket server that allows for sending a payload to forward keybinds to WebCord, so that you may use a privileged process to handle shortcuts like sway or swhkd. An example of activating the push-to-talk key is:

echo -e 'PUSH_TO_TALK\x00\x01\x99\x00' | nc -U $XDG_RUNTIME_DIR/webcord-keybinds.sock

The payload begins with one of the keybind actions that Discord supports, which is terminated with a null (\x00) byte. The byte that follows is a boolean value - a true value (\x01) indicates a keyDown event, and a false value (\x00) indicates a keyUp event. Note that keybinds that aren't toggle-based, like PUSH_TO_TALK, generally only respond to keyUp events.

This feature works by pulling the user's keybinds, finding the key set for a particular command, and then simulating a keypress through setInputEvent on the main Electron window. While the implementation sounds weird, it's the best way to go about it without injecting code into the webpage to interface directly with the keybind handlers - something that I understand you want to avoid with WebCord for user safety reasons.

An extra package was added called keycode to translate the JavaScript keycodes into accelerators, that are then passed to setInputEvent. I've pinned this package to the latest version just in case - if you'd like this unpinned or removed entirely then let me know.

While the changes in this pull request allow for sending key events for any keybindings that have been set, of course there's no actual mechanism to set any of these keybinds apart from PUSH_TO_TALK, because Discord disables the keybinding interface on the web version. I was thinking of potentially adding an option to configure keybindings via the socket interface, but opted against it for this pull request. At least now, any future global keybinds interface can be made to work for both Wayland and X11.

If there's any issues or concerns with the socket payload format, or I missed something else in the implementation, feel free to suggest changes and I can sort.

epetousis avatar Sep 10 '22 16:09 epetousis

To be honest, the current implementation of this is too complex to be pushed undocumented into master, I may work on providing documentation for it and maybe the documentation for Wayland in general. I also dislike announcing yet another dependency and that privileged process sends key code to the WebCord rather than the opposite. Personally, I would also change a packet structure (feel free to discuss about it through) for this reason and the way WebCord communicates with the clients.

Below are my suggestions, they're meant to regularly change – I'll add more details to it based on what I'll decide.


A currently proposed (generic, maybe incomplete) structure of the binary packets:

Offset (byte) Type Description
0 UTF-8 encoded string: WKB. Packet header
3 Raw byte Message type
4 Raw byte Command
5 Multiple / other Data (undecided yet how to encode; might only exist in server replies).

Message types:

  • 0x00reserved, possibly for some kind of the authentication (like verifying if the process is privileged?); it doesn't have to be implemented right now, but it is better to keep a space for it.
  • 0x01PUSH_TO_TALK

Commands:

  • 0x00 – query, for push-to-talk messages: it sends back to client the keybind (either as a string or key code, depending how we would decide the returned data should be encoded).
  • 0x01 – post, for push-to-talk messages: it sends an empty message to server that will communicate back and reply with the result (0x00 means success, other codes mean error).

A proposed human-friendly communication diagram (I hope it clearly shows what I want to achieve):

sequenceDiagram
    participant C as Socket client;
    participant S as Socket server;
    participant W as Website (IPC);
    activate C;
    activate S;
    Note over C,S: Authentication (maybe?);
    deactivate S;
    C->>S: What is your keybind for N?;
    activate S;
    S->>W: What is registered keybind for N?
    activate W;
    Note over W: Query localStorage
    W->>S: The registered keybind is XYZ.;
    deactivate W;
    S->>C: The registered keybind is XYZ.;
    deactivate S;
    Note over C: Listen to keybind
    C->>S: Emit keybind for N.;
    activate S;
    Note over S: Emit key event
    S->>C: Done! Result code: X.;
    deactivate S;
    deactivate C;
ℹ️ The communication between socket server (main process) and website (renderer process) is transmitted via Electron's IPC and it won't be encoded in the same way as packets sent between socket server and client.

SpacingBat3 avatar Sep 10 '22 20:09 SpacingBat3

I'm a fan of the proposed packet structure. In regards to the communication spec, to clarify - when actually triggering a keybind, is this still done by simply sending a keybind type? If that's the case, it would make sense to use the data bytes to specify "keyUp" or "keyDown" - while it's probably good to have a way to simply toggle a keybind with one packet as well, setting keybind in programs like sway or swhkd support a key release event, and I'd want to make sure that the key press and key release events correspond to the same events in WebCord.

If your proposal is that Discord's keybinds are triggered by sending the corresponding keycodes (rather than the name of a keybind e.g PUSH_TO_TALK) to WebCord, I'm not sure if I feel great about the user being able to forward arbitrary keys to WebCord. But I'm not totally sure if this is what you meant, so forgive me if I'm off track.

Apart from that, this makes much more sense, and creating an enum based off of keycode types from Discord would be smart as well, something like:

enum DiscordKeybinds {
  TOGGLE_PRIORITY_SPEAKER = 1,
  PUSH_TO_TALK,
  PUSH_TO_TALK_PRIORITY,
  PUSH_TO_MUTE,
  TOGGLE_MUTE,
  TOGGLE_DEAFEN,
  TOGGLE_VOICE_MODE,
  TOGGLE_OVERLAY,
  TOGGLE_OVERLAY_INPUT_LOCK,
  OVERLAY_ACTIVATE_REGION_TEXT_WIDGET,
  TOGGLE_GO_LIVE_STREAMING,
  TOGGLE_STREAMER_MODE,
  NAVIGATE_BACK,
  NAVIGATE_FORWARD,
  SOUNDBOARD,
  SOUNDBOARD_HOLD
}

I also dislike announcing yet another dependency and that privileged process sends key code to the WebCord rather than the opposite.

I understand the dependency being an issue, I'm not sure if there's a better way to take JavaScript keycodes and convert them into the accelerators that sendInputEvent requires, but if we can do it library free then I'll happily go for that. In terms of the privileged process sending keycodes to WebCord, I'm not quite sure what you mean here, could you clarify? A process sends a keybind "type" (e.g PUSH_TO_TALK) to WebCord, then WebCord pulls the specific keybind and its corresponding bound keycode(s) from the user's localStorage, and then uses sendInputEvent. The external process currently doesn't even know what the corresponding keycode(s) are (a limitation you've addressed, of course) - I'm not sure if this is the privileged process you were referring to.

Sorry for the massive response, and I'll mark this pull request as a draft for the time being until we're both happy with the way we want this socket implementation to look.

epetousis avatar Sep 11 '22 02:09 epetousis

I'm a fan of the proposed packet structure. In regards to the communication spec, to clarify - when actually triggering a keybind, is this still done by simply sending a keybind type?

When "registering", WebCord knows the keybind that user specified in localStorage. When client wants to register the same keybind as in Discord, it would query WebCord for that keybind to be actually capable of doing that. However, I actually forgot to show there how socket should behave when the keybing changes. That adds yet another command, sync: WebCord would send a message back to connected client each time when the user changes the keybind in Discord.

But going back to question: I want from sent packets to the server/socket to represent information: I want you to emit a keybind for PUSH_TO_TALK, whatever it is rather than Emit CTRL+V in the renderer. The format of the keybind sent back to client is undecided, yet I think it might be dependant on the arrays I've found in localStorage.

I understand the dependency being an issue, I'm not sure if there's a better way to take JavaScript keycodes and convert them into the accelerators that sendInputEvent requires, but if we can do it library free then I'll happily go for that.

With my communication structure, that should be possible, as Discord saves keybinds as array of tuples of three values, where the second value actually seems to be a keycode (not sure about two others through, which stands for a question if we really need to care about them).

SpacingBat3 avatar Sep 11 '22 08:09 SpacingBat3

would it not be better to just use https://github.com/flatpak/xdg-desktop-portal/pull/711 ? like we gotta build of master branch but better this is quite alot

bgkillas avatar Sep 22 '22 14:09 bgkillas

@bgkillas last time I took a look at the initial proposal for this, it seemed that the Flatpak maintainers weren't interested in adding separate shortcuts for key releasing - but it looks like this may have changed at some point? If so, then it definitely renders this pull request obsolete.

epetousis avatar Sep 22 '22 14:09 epetousis

Took another look at the pull request, the author added "activated" and "deactivated" signals for shortcuts which allow us to determine when the PTT key is pressed and released. @SpacingBat3 I'll be closing this pull request, I see no reason to work on this any further since a much cleaner solution can be implemented with xdg-desktop-portal (not by us, to be clear - this is something that should be done in Electron/Chromium). Let me know if you disagree and would prefer to pursue this option.

epetousis avatar Sep 24 '22 08:09 epetousis