discord.py RFC: Voice Receive API Design/Usage

As I have progressed through writing and redesigning this feature a few times, Danny and I have come to the conclusion regarding the inclusion of voice receive in discord.py. Discord considers voice receive a second class citizen as a feature and will likely never officially support or document it. With no such guarantees, all development is based on reverse engineering and is liable to be broken by discord at any point.

The conclusion is that voice receive as a discord bot feature does not belong in the library. However, the alternative is to simply use an extension module to implement it. See https://github.com/Rapptz/discord.py/pull/9288#issuecomment-1785942942 for more details.

This is exactly what I've been working on. https://github.com/imayhaveborkedit/discord-ext-voice-recv/

The foundational work has been largely complete and the code is functional, but as stated in the readme it's not quite complete, not guaranteed stable and subject to change. Basic documentation is done but more comprehensive docs and examples are on the todo list. It also requires v2.4 of discord.py (currently the master branch), not yet released on pypi at the time of writing this.

Old issue content

This information is technically outdated, but a large amount of the design still applies.

Note: DO NOT use this in production. The code is messy (and possibly broken) and probably filled with debug prints. Use only with the intent to experiment or give feedback, although almost everything in the code is subject to change.

Behold the voice receive RFC. This is where I ask for design suggestions and feedback. Unfortunately not many people seem to have any idea of what their ideal voice receive api would look like so it falls to me to come up with everything. Should anyone have any questions/comments/concerns/complaints/demands please post them here. I will be posting the tentative design components here for feedback and will update them occasionally. For more detailed information on my progress see the project on my fork. I will also be adding an example soonish.

Overview

The main concept behind my voice receive design is to mirror the voice send api as much as possible. However, due to receive being more complex than send, I've had to take some liberties in creating some new concepts and functionality for the more complex parts. The basic usage should be relatively familiar:

vc = await channel.connect()
vc.listen(MySink())

The voice send api calls an object that produces PCM packets a Source, whereas the receive api refers to them as a Sink. Sources have a read() function that produces PCM packets, so Sinks have a write(data) function that does something with PCM packets. Sinks can also optionally accept opus data to bypass the decoding stage if you so desire. The signature of the write(data) function is currently just a payload blob with the opus data, pcm data, and rtp packet, mostly for my own convenience during development. This is subject to change later on.

The new VoiceClient functions are basically the same as the send variants, with listen() being the new counterpart to play().

Note: The stop() function has been changed to stop both playing and listening. I have added stop_playing() and stop_listening() for individual control.

Built in Sinks

For simply saving voice data to a file, you can use the built in WaveSink to write them to a wav file. The way I have this currently implemented, however, is completely broken for more than one user.

Note: Here lies my biggest problem. I currently do not have any way to combine multiple voice "streams" into one stream. The way this works is Discord sends packets for all users on the same socket, differentiated by an id (aka ssrc, from RTP spec). These packets have timestamps, but with a random start offset, per ssrc. RTP has a mechanism where the reference time is sent in a control packet, but as far as I can tell, Discord doesn't send these control packets. As such, I have no way of properly synchronizing streams without excessive guesswork based on arrival time in the socket (unreliable at best). Until I can solve this there will be a few holes in the design, for example, how to record the whole conversation in a voice channel instead of individual users.

Sinks can be composed much like Sources can (PCMVolumeTransformer+FFmpegPCMAudio, etc). I will have some built in sinks for handling various control actions, such as filtering by user or predicate.

# only listen to message.author
vc.listen(UserFilter(MySink(), message.author))

# listen for 10 seconds
vc.listen(TimedFilter(MySink(), 10))

# arbitrary predicate, could check flags, permissions, etc
vc.listen(ConditionalFilter(MySink(), lambda data: ...))

and so forth. As usual, these are subject to change when I go over this part of the design again.

As mentioned before, mixing is still my largest unsolved problem. Combining all voice data in a channel into one stream is surely a common use case, and i'll do my best to try and figure out a solution, but I can't promise anything yet. If it turns out that my solution is too hacky, I might have to put it in some ext package on pypi (see: ext.colors).

For volume control, I recently found that libopus has a gain setting in the decoder. This is probably faster and more accurate than altering pcm packets after they've been decoded. Unfortunately, I haven't quite figured out how to expose this setting yet, so I don't have any public api to show for it.

That should account for most of the public api part that i've designed so far. I still have a lot of miscellaneous things to do so no ETA. Again, if you have any feedback whatsoever please make yourself known either here or in the discord server.

Feb 20 '18 02:02 imayhaveborkedit

My thoughts. As a starting point, yes, the API should provide separate PCM chunks for each member being listened to. If no filters are set, all members are listened to, including those joining the call after listening began.

To decode PCM properly, the library needs to put packets into order and identify lost packets. The packets have an incrementing sequence number that will be used. This all implies buffering and some error handling, such as filling in lost data (with opus) and simply discarding packets if they are received too late.

Each audio chunk gets a number specifying its position (in milliseconds) in relation to when the listening session began. These chunks can then be fed into a mixer to produce a single stream if desired. For simplicity, all chunks should be assigned into perfect 20ms slots (e.g. 40 and 60, not 36 and 51).

No special timing information should be necessary. Record the time whenever someone starts speaking; every following chunk can be placed exactly 20ms after the previous one. There will be five silent packets to signify they've stopped speaking. In case those are not received there should also be a timeout.

If enough latency is added, the library should be able to give a reliable output under most conditions. With minimal latency, there will be more issues. There's no free lunch :)

Feb 28 '18 19:02 Ruuttu

I've already had a go at writing all of the decoding and processing for this. I spent a lot of time writing classes for packet types discord doesn't send. The timestamp data in RTP packets starts from a random offset, and that offset is different per ssrc. RTCP packets have the reference time for calculating when these packets were created. Relying on gateway websocket events to sync up voice socket UDP packets is racy and not reliable, especially considering the gateway code can be blocked by user code and speaking events have no id (timestamp).

Feb 28 '18 22:02 imayhaveborkedit

Seems safe to bet that Discord makes no effort to synchronize speakers. Each stream is basically played "as soon as it's received" with the shortest buffer they think they can get away with.

If there's someone in the call from New Zealand and we're always making fun of him for laughing at jokes three seconds too late, I would expect the data coming from the Voice Receive API to reflect that. I would not expect it to "fix" the latency.

So I think you can track the local packet receive times and use some of those without shame. Again, most packets can actually be placed right after the previous one, ignoring the receive time. Never the first packet after silence tho, of course.

Mar 01 '18 04:03 Ruuttu

Going to assume that this attempt has stalled?

Any idea on if anyone will try and implement it?

Also a good api would just pass in the raw PCM that comes out of the opus library. Must like the mumble python api. (Another chat server that uses opus)

This should always be the same length unless discord is changing the opus config options on the fly.

Feb 11 '19 17:02 wgaylord

I'm still working on it. The fact that this is an "unsupported" feature and my desire to make a sufficiently high level api for ease of use coupled with my sporadic motivation and inspiration make for slow progress. It also seems that few people have useful input on this issue meaning that I am basically on my own for the most part. If you want to see what I have so far I keep my fork updated. There's still a lot to do and I haven't written docs or added an example yet so make of it what you will. https://github.com/imayhaveborkedit/discord.py

Feb 11 '19 22:02 imayhaveborkedit

Welp guess I am stuck on using my own server with mumble for my HamRadio Remote station.

Feb 12 '19 00:02 wgaylord

I see this fork is still getting regular updates.

You've mentioned a use-case, where it's possible to use a WaveSink to get raw PCM/Wav data. As it happens this is exactly the kind of thing I need.

Basically, my use-case is simply getting raw PCM data and passing it along to do some basic speech recognition (fwiw, a very basic speech-to-text bot for a deaf friend). I assume no audio mixing means there is no way to actually tell which packet comes from which user? Regardless, such a situation doesn't impact me greatly, since, assuming the (in my case, 2 or 3) users speak in a somewhat orderly fashion, my output would simply be the transcript of what has been said, no matter who said it. The only other thing needed in this case, for me, would be a way to detect when a user has stopped speaking so I can pass along the file/buffer without words being cut out. Ideally, this could be done by storing the audio data in a buffer, but writing to a file and using that as input would work fairly good, as well.

Would it be possible to share a minimal piece of code that exemplifies the use-case you described, or provide any hints for the direction I should take in my implementation of this use-case?

Mar 18 '19 18:03 Apfelin

Don't worry, that use case is probably one of the two major cases that I expect people to have. WaveSink is specifically for writing data to a wav file, the point being that the built in wave module takes care of writing all the headers and such. The data you get in the first place is already PCM, so unless whatever flow you have requires a file on the filesystem, you don't need that one.

When I mention "mixing" I'm referring to combining the various streams of user voice data into a single combined stream. This is a problem I haven't quite figured out how to do properly yet since discord doesn't seem to provide the required RTCP packets necessary to synchronize the streams. If I do come up with something and it ends up being too jank to be in the main lib I'm considering making some sort of ext.voice package on pypi. Anyways, these "streams" are per user (actually ssrc, which is just an id in the RTP spec, but are mapped to user ids), so the data you get will include a member object. The exact format of this I haven't decided on yet, so right now it's just sort of a payload blob object with the pcm, opus, and rtppacket object (mostly for my own convenience during development).

Delimiting speech segments is still something I don't quite know how to handle yet. I think this might be a problem I have to put onto the user since I don't see a good way to do it lib-side. In the example I'm writing for this I'm thinking about setting the pattern for doing so to use the priority speaking feature. Relying on speaking events and or arbitrary waiting periods does not sound reliable enough to use by default. Using priority speaking to indicate the segment of speech to be recognized/processed would be very convenient for both me since I don't need to do anything in the lib for it and for the user since being a PTT feature means if they mess it up it's their fault.

Unfortunately have speech recognition in an example is a bit out of scope for the lib examples, but I plan on having an additional example in a gist that demonstrates this using the SpeechRecognition module most likely. In your case, until I design out how it would work with the various services in the aforementioned library (which might end up being the same anyways), it would probably be waiting for priority speaking from some member, collecting their voice data in your container of choice (in memory or filesystem based), and processing them once priority speaking ends.

Mar 18 '19 21:03 imayhaveborkedit

I have updated the OP. Anyone vaguely interested in this feature should read the new content.

Mar 21 '19 22:03 imayhaveborkedit

Got around to writing a short example this weekend. It's written in a rush, it's probably not how discord.py is meant to be used, but it works.

As mentioned in the OP, I don't know if the mixing works for this example. I've used a UserFilter, but I haven't yet tested with more users, to see if it actually filters. I managed to do speech-to-text by just coarsely segmenting data into 5 second chunks, since I don't think anyone speaks in long-winded sentences. Or at least, I don't. The accuracy is determined by the speech recognition service, but in general, it's pretty decent. A downside to this segmenting approach is that it only processes/posts the resulting text every 5 seconds. For speech to text, it's not ideal, but this could work reasonably well for some sort of voice commands. Another issue is that, sometimes, the 5 second segment might get only part of your sentence, but this is somewhat mitigated by stripping leading zeros in the buffer.

I faintly remember someone mentioning silence is marked by 5 chunks of 0x00, so I've been trying to implement a way to delimit speech (or rather, words) by looking for these chunks, but I haven't found a reliable way to do this yet. I've been looking over raw bytes output, seeing if this theory holds up, and it seems like it might, but I'd probably have to apply some sort of regex to make sure there really aren't any chunks when speaking occurs.

FWIW, here's the gist: https://gist.github.com/Apfelin/c9cbb7988a9d8e55d77b06473b72dd57

Mar 25 '19 14:03 Apfelin

Looks great, but I keep getting an error on line 12

This is not yet implemented in the main library. @Apfelin was using imayhaveborkedit's fork, which does have discord.reader.AudioSink

Mar 25 '19 21:03 gamescom15

@imayhaveborkedit Further to your proposed API, I have implemented a listener bot (that saves stuff to disk) in discord.js (I'm hoping to move it to Python ultimately). This leaves things such as mixing and figuring out when to join streams up to the user - which I think is the right thing to do as everyone will want something different. (Providing a separate lib for the common use cases makes sense, such as mixing users - I think it should be out of scope of this.)

The implementation in JS involves listening for the "speaking" event, then binding a receiver/sink (stream) to it to accept the data. When binding the receiver you can select the mode (eg: PCM, or Opus. Wav also makes sense since Python has that built in). Then every chunk is written to the receiver as it comes in (the actual implementation for out of order packets etc is unclear to me).

When the user stops talking/they release PTT, the end event is triggered, and the stream is closed.

I would argue, for the moment, the filters per user etc are not required, rather, in the on_speaking event, the user can decide if the want to save this stream or not (the event is given the member details), and they could return the stream to write to (or call a method on a passed object), if no stream is returned, no action is taken (the overhead of this would be minimal compared to everything else that has to happen to stream data). Again, some common classes could be provided to simplify the process (eg: a stream to file class).

I know I'm coming late to the party here and may have missed some stuff (I have not yet used this, but trying to figure it out), but hopefully that makes some sense.

I'm sure you have see this, but just in case, this is the VoiceReceiver API for discord.js: https://discord.js.org/#/docs/main/master/class/VoiceReceiver (not much there), and an example of the Voice API in use (slightly out of date, but gives you a feel for it as the discord.js docs are not great): https://gist.github.com/eslachance/fb70fc036183b7974d3b9191601846ba

Apr 04 '19 11:04 sillyfrog

I have updated the OP again with new info about stopping sinks. (I guess I didn't...?)

@sillyfrog Sorry for not responding until now. The whole mixing thing if I do figure it out will still be entirely optional. It would exist as a batteries-included utility. If a user wants to handle the data differently of course they can go about it their own way. The problem is that this is not an easy thing to do, even less so doing it correctly. I honestly don't expect many people to be able to come up with a decent solution to this that involves mixing the data live. I still believe that getting the combined audio of everyone speaking in a channel is a common and valid use case and as such a utility for doing so should be included in the lib.

The d.js example vaguely follows the concept I had in mind for this but I would design a somewhat higher level interface for it. Perhaps with a context manager. Or maybe I wont and just do it "manually" in the example to set the precedent. Or maybe just leave it as an exercise to the user.

Apr 13 '19 01:04 imayhaveborkedit

I've used d.js in the past and initially found it annoying to have to deal with individual user audiostreams, so +1 to a simple vc.listen(AudioSink()) function.

What's currently blocking your progress on this? I'm trying to bring my mental model of the current problems up to speed so that I can hopefully contribute.

Apr 17 '19 20:04 0xBERNDOG

re: Sinks and filters

Source and Sink are basically IO streams, with a filter being an in-memory sink (i.e. doesn't write out to a file). To compose filters, for example, could be done in essentially a list or a linked list where a call to write() at the head will propagate the data down the chain recursively. Example usage:

wavSink = WavSink(filename='output_file.wav')
volumeFilter = VolumeFilter()
someOtherFilter1 = SomeFilter1()
someOtherFilter2 = SomeFilter2()

composedFilter = volumeFilter.compose([someOtherFilter1, someOtherFilter2])
# or maybe
# composedFilter = BlankFilter([volumeFilter, someOtherFilter1, someOtherFilter2])
wavSink.filter = composedFilter

vc.listen(wavSink)

Writes would propagate like so:

wavSink.write(data)
volumeFilter.write(data)
someOtherFilter1.write(data) 
someOtherFilter2.write(data)
-----------------------------------------------
> someOtherFilter2 modifies data and returns it
> someOtherFilter1 modifies data and returns it
> volumeFilter modifies data and returns it
> wavSink writes data to file

Apr 28 '19 15:04 0xBERNDOG

To be honest, this is horrifying. Voice is already threaded. You don't need a thread for state. You don't need two different instances of the sink. This is not how this class is used. It's clear that I need extensive examples to try to prevent people from writing code like this.

For reference, this is the typical usage pattern:

vc.listen(discord.UserFilter(discord.WaveSink('file.wav'), some_member))
...
vc.stop_listening()

That's it. Note that the Filter objects are probably going to be changed at some point since I don't like the design very much in its current state.

Aug 30 '19 20:08 imayhaveborkedit

Is it possible to send audio from system microphone with this?

And can i send audio from vc.listen() to system speakers in real time?

Oct 31 '19 08:10 brownbananas95

@brownbananas95 Sending audio is already implemented. I think you can but i'm unsure

Oct 31 '19 09:10 apple502j

@brownbananas95 sending (mike->discord) works without issue. Receiving (Discord->speakers) is a bit more complex, as you currently receive each user as a separate stream. You can receive these, but must mix them prior to sending to the computer's speakers.

Nov 01 '19 23:11 gerth2

@gerth2 do you have a code example for sending mike->discord?

Nov 01 '19 23:11 brownbananas95

Sure thing - I yanked the meaningful guts from a project I have in-flight at the moment using this Voice Receive fork of discord.py:

https://gist.github.com/gerth2/8ee0c918606b4c501759a9c333393398

Let me know if you run into issues, I can zip up a latest copy of what we're working on to send to you.

Nov 02 '19 00:11 gerth2

Exactly what I needed! Thank you!

Nov 02 '19 00:11 brownbananas95

FWIW, yesterday, I got a crude but (apparently?) functional PCM audio mixing strategy sorted out, outside of this library, but using its API's. Working on open-sourcing the code some time this week (still has hardcoded private API keys in it, need to fix).

Since this is an RFC, my comment: I like the API as provided - the alignment with the existing read side feels nice and fuzzy. Any issues with internals aside, it looks nice and seems to work fine from the outside.

Nov 03 '19 13:11 gerth2

Let me know if you run into issues, I can zip up a latest copy of what we're working on to send to you.

What is your contact information?

FWIW, yesterday, I got a crude but (apparently?) functional PCM audio mixing strategy sorted out, outside of this library, but using its API's. Working on open-sourcing the code some time this week (still has hardcoded private API keys in it, need to fix).

Since this is an RFC, my comment: I like the API as provided - the alignment with the existing read side feels nice and fuzzy. Any issues with internals aside, it looks nice and seems to work fine from the outside.

This is interesting. Perhaps this would help complete the Voice Receive fork by imayhaveborkedit, and potentially get the PR approved in discord.py/master :-)

Nov 04 '19 09:11 brownbananas95

@brownbananas95 code is here

Nov 05 '19 03:11 gerth2

Is this still being considered as a feature or will it only exist in forks?

Feb 19 '20 03:02 DMcP89

What is the status of this RFC? Any timeframe as to when this will be ready to be integrated into master, considering the fork, which seem to have this working?

May 20 '20 06:05 JessicaTegner

Maybe writing a function that mutes all other users except a user it wants to specifically listen to from the perspective of the bot would work? Since discord takes it all as one stream? There's no point listening to multiple people at the same time unless you want to record it, or create a gateway to ps party for example. surely?

May 20 '20 19:05 ChonkyWonky

FWIW I believe a recent discord API change makes this particular PR out of date - changes will need ported forward to function.

Jul 25 '20 14:07 gerth2

@gerth2 is correct, in particular it's this pretty simple change that is needed to get this variant to work again, and nothing else thankfully. I was just able to start utilizing imayhaveborkedit's excellent work on this to accomplish something I was trying to do which required the ability for the bot to receive audio from a user.

On that note, I have to thank you tremendously gerth2 for sharing your work and providing a good example of how to use the features offered by this fork, it helped me achieve what I was working on basically 100%, not sure if I could of done it without them, at least not near as easily. Additionally, like what gerth said here, this really feels almost 100% in-sync with the main voice send part of the existing library, to the point I was almost able to simply reverse the process of sending audio I already had setup, and have it work for receiving.

I think you've done a lot more than you take credit for @imayhaveborkedit lol.

Aug 12 '20 06:08 NormHarrison