matrix-doc/proposals/2746-reliable-voip.md

21 KiB

MSC2746: Improved Signalling for 1:1 VoIP

Historically, Matrix has basic support for signalling 1:1 WebRTC calls which suffer a number of shortcomings:

  • If several devices try to answer the same call, there is no way for them to determine clearly that the caller has set up the call with a different device, and no way for the caller to determine which candidate events map to which answer.
  • Hangup reasons are often incorrect.
  • There is confusion and no clear guidance on how clients should determine whether an incoming invite is stale or not.
  • There is no support for renegotiation of SDP, for changing ICE candidates / hold/resume functionality, etc.
  • There is no distinction between rejecting a call and ending it, which means that in trying to reject a call, a client can inadvertently cause a call that has been successfully set up on a different device to be hung up.

Proposal

Change the version field in all VoIP events to "1"

The version property is changed to "1" in all existing VoIP events (m.call.answer, m.call.candidates, m.call.hangup m.call.invite). Note that this changes the type of the version field from an integer to a string, as described in the Unstable Prefix section.

This will be used to determine whether devices support this new version of the protocol. For example, clients can use this field to know whether to expect an m.call.select_answer event from their opponent. If clients see events with version other than 0 or "1" (including, for example, the numeric value 1), they should treat these the same as if they had version == "1".

Note that this implies any and all future versions of VoIP events should be backwards-compatible. If it does become necessary to introduce a non backwards-compatible VoIP spec, the intention would be for it to simply use a separate set of event types.

Define the configurations of WebRTC streams and tracks

The spec does not currently define the WebRTC streams and tracks that should be sent. Under this proposal, clients are expected to send one stream with one track of kind audio (creating a voice call). They can optionally send a second track in the same stream of kind video (creating a video call).

Clients implementing this specification use the first stream and will ignore any streamless tracks. Note that in the Javascript WebRTC API, this means addTrack() must be passed two parameters: a track and a stream, not just a track, and in a video call the stream must be the same for both audio and video track.

A client may send other streams and tracks but the behaviour of the other party with respect to presenting such streams and tracks is undefined.

This follows the existing known implementations of v0 VoIP.

Add invitee field to m.call.invite

This allows for the following use cases:

  • Placing a call to a specific user in a room where other users are also present.
  • Placing a call to oneself.

The field should be added for all invites where the target is a specific user, and should be set to the Matrix user ID of that user. Invites without an invitee field are defined to be intended for any member of the room other than the sender of the event.

Clients should consider an incoming call if they see a non-expired invite event where the invitee field is either absent or equal to their user's Matrix ID, however they should evaluate whether or not to ring based on their user's trust relationship with the callers and/or where the call was placed. As a starting point, it is suggested that clients ignore call invites from users in public rooms. It is strongly recommended that when clients do not ring for an incoming call invite, they still display the call invite in the room and annotate that it was ignored.

Add party_id to all VoIP events

Whenever a client first participates in a new call, it generates a party_id for itself to use for the duration of the call. This needs to be long enough that the chance of a collision between multiple devices both generating an answer at the same time generating the same party ID is vanishingly small: 8 uppercase + lowercase alphanumeric characters is recommended. Parties in the call are identified by the tuple of (user_id, party_id).

The client adds a party_id field containing this ID to the top-level of the content of all VoIP events it sends on the call, including m.call.invite. Clients use this to identify remote echo of their own events: since a user may now call themselves, they can no longer ignore events from their own user. This field also identifies different answers sent by different clients to an invite, and matches m.call.candidates events to their respective answer/invite.

A client implementation may choose to use the device ID used in end-to-end cryptography for this purpose, or it may choose, for example, to use a different one for each call to avoid leaking information on which devices were used in a call (in an unencrypted room) or if a single device (ie. access token) were used to send signalling for more than one call party.

A grammar for party_id is defined below.

Introduce m.call.select_answer

This event is sent by the caller's client once it has decided which other client to talk to, by selecting one of multiple possible incoming m.call.answer events. Its selected_party_id field indicates the answer it's chosen. The call_id and party_id of the caller is also included. If the callee's client sees a select_answer for an answer with party ID other than the one it sent, it ends the call and informs the user the call was answered elsewhere. It does not send any events. Media can start flowing before this event is seen or even sent. Clients that implement previous versions of this specification will ignore this event and behave as they did before.

Example:

{
    "type": "m.call.select_answer",
    "content": {
        "version": "1",
        "call_id": "12345",
        "party_id": "67890",
        "selected_party_id": "111213",
    }
}

Introduce m.call.reject

  • If the m.call.invite event has version "1", a client wishing to reject the call sends an m.call.reject event. This rejects the call on all devices, but if the calling device sees an answer before the reject, it disregards the reject event and carries on. The reject has a party_id just like an answer, and the caller sends a select_answer for it just like an answer. If another client had already sent an answer and sees the caller select the reject response instead of its answer, it ends the call.
  • If the m.call.invite event has version 0, the callee sends an m.call.hangup event.

Example:

{
    "type": "m.call.reject",
    "content" : {
        "version": "1",
        "call_id": "12345",
        "party_id": "67890",
    }
}

If the calling user chooses to end the call before setup is complete, the client sends m.call.hangup as previously.

Clarify what actions a client may take in response to an invite

The client may:

  • Attempt to accept the call by sending an m.call.answer.
  • Actively reject the call everywhere: send an m.call.reject as per above, which will stop the call from ringing on all the user's devices and the caller's client will inform them that the user has rejected their call.
  • Ignore the call: send no events, but stop alerting the user about the call. The user's other devices will continue to ring, and the caller's device will continue to indicate that the call is ringing, and will time the call out in the normal way if no other device responds.

Introduce more reason codes to m.call.hangup

  • ice_timeout: The connection failed after some media was exchanged (as opposed to current ice_failed which means no media connection could be established). Note that, in the case of an ICE renegotiation, a client should be sure to send ice_timeout rather than ice_failed if media had previously been received successfully, even if the ICE renegotiation itself failed.
  • user_hangup: Clients must now send this code when the user chooses to end the call, although for backwards compatibility with version 0, a clients should treat an absence of the reason field as user_hangup.
  • user_media_failed: The client was unable to start capturing media in such a way that it is unable to continue the call.
  • user_busy: The user is busy. Note that this exists primarily for bridging to other networks such as the PSTN. A Matrix client that receives a call whilst already in a call would not generally reject the new call unless the user had specifically chosen to do so.
  • unknown_error: Some other failure occurred that meant the client was unable to continue the call rather than the user choosing to end it.

Introduce m.call.negotiate

This introduces SDP negotiation semantics for media pause, hold/resume, ICE restarts and voice/video call up/downgrading. Clients should implement & honour hold functionality as per WebRTC's recommendation: https://www.w3.org/TR/webrtc/#hold-functionality

If both the invite event and the accepted answer event have version equal to "1", either party may send m.call.negotiate with a description field to offer new SDP to the other party. This event has call_id with the ID of the call and party_id equal to the client's party ID for that call. The caller ignores any negotiate events with party_id + user_id tuple not equal to that of the answer it accepted and the callee ignores any negotiate events with party_id + user_id tuple not equal to that of the caller. Clients should use the party_id field to ignore the remote echo of their own negotiate events.

This has a lifetime field as in m.call.invite, after which the sender of the negotiate event should consider the negotiation failed (timed out) and the recipient should ignore it.

The description field is the same as the offer field in m.call.invite and answer field in m.call.answer and is an RTCSessionDescriptionInit object as per https://www.w3.org/TR/webrtc/#dom-rtcsessiondescriptioninit.

Example:

{
    "type": "m.call.negotiate",
    "content": {
        "version": "1",
        "call_id": "12345",
        "party_id": "67890",
        "lifetime": 10000,
        "description": {
            "sdp": "[some sdp]",
            "type": "offer",
	},
    }
}

Once an m.call.negotiate event is received, the client must respond with another m.call.negotiate event, with the SDP answer (with "type": "answer") in the description property.

This MSC also proposes clarifying the m.call.invite and m.call.answer events to state that the offer and answer fields respectively are objects of type RTCSessionDescriptionInit. Hence the type field, whilst redundant in these events, is included for ease of working with the WebRTC API and is mandatory. Receiving clients should not attempt to validate the type field, but simply pass the object into the WebRTC API.

Designate one party as 'polite'

In line with WebRTC perfect negotiation (https://w3c.github.io/webrtc-pc/#perfect-negotiation-example) we introduce rules to establish which party is polite in the process of renegotiation. The callee is always the polite party. In a glare situation, the politenes of a party is therefore determined by whether the inbound or outbound call is used: if a client discards its outbound call in favour of an inbound call, it becomes the polite party.

Add explicit recommendations for call event liveness.

m.call.invite contains a lifetime field that indicates how long the offer is valid for. When a client receives an invite, it should use the event's age field in the sync response plus the time since it received the event from the homeserver to determine whether the invite is still valid. The use of the age field ensures that incorrect clocks on client devices don't break calls. If the invite is still valid and will remain valid for long enough for the user to accept the call, it should signal an incoming call. The amount of time allowed for the user to accept the call may vary between clients. For example, it may be longer on a locked mobile device than on an unlocked desktop device.

The client should only signal an incoming call in a given room once it has completed processing the entire sync response and, for encrypted rooms, attempted to decrypt all encrypted events in the sync response for that room. This ensures that if the sync response contains subsequent events that indicate the call has been hung up, rejected, or answered elsewhere, the client does not signal it.

If on startup, after processing locally stored events, the client determines that there is an invite that is still valid, it should still signal it but only after it has completed a sync from the homeserver.

The minimal recommended lifetime is 90 seconds - this should give the user enough time to actually pick up the call.

Introduce recommendations for batching of ICE candidates

Clients should aim to send a small number of candidate events, with guidelines:

  • ICE candidates which can be discovered immediately or almost immediately in the invite/answer event itself (eg. host candidates). If server reflexive or relay candidates can be gathered in a sufficiently short period of time, these should be sent here too. A delay of around 200ms is suggested as a starting point.
  • The client should then allow some time for further candidates to be gathered in order to batch them, rather than sending each candidate as it arrives. A starting point of 2 seconds after sending the invite or 500ms after sending the answer is suggested as a starting point (since a delay is natural anyway after the invite whilst the client waits for the user to accept it).

Mandate the end-of-candidates candidate

Define that an ICE candidate whose value is the empty string means that no more ICE candidates will be sent, and mandate that clients must send such a candidate in an m.call.candidates message. The WebRTC spec requires browsers to generate such a candidate, however note that at time of writing, not all browsers do (Chrome does not, but does generate an icegatheringstatechange event). The client should send any remaining candidates once candidate generation finishes, ignoring timeouts above. This allows bridges to batch the candidates together when bridging to protocols that don't support trickle ICE.

Add DTMF

Add that Matrix clients can send DTMF as specified by WebRTC. The WebRTC standard as of August 2020 does not support receiving DTMF but a Matrix client can receive and interpret the DTMF sent in the RTP payload.

Specify exact grammar for VoIP IDs

call_ids and the newly introduced party_id are explicitly defined to be between 1 and 255 characters long, consisting of the characters [0-9a-zA-Z._~-].

(Note that this matches the grammar of 'opaque IDs' from
MSC1597, and that of the id property of the m.login.sso flow schema.)

Specify behaviour on room leave

If the client sees the user it is in a call with leave the room, the client should treat this as a hangup event for any calls that are in progress. No specific requirement is given for the situation where a client has sent an invite and the invitee leaves the room, but the client may wish to treat it as a rejection if there are no more users in the room who could answer the call (eg. the user is now alone or the invitee field was set on the invite).

The same behaviour applies when a client is looking at historic calls.

Clarify that supported codecs should follow the WebRTC spec

The Matrix spec does not mandate particular audio or video codecs, but instead defers to the WebRTC spec. A compliant matrix VoIP client will behave in the same way as a supported 'browser' in terms of what codecs it supports and what variants thereof. The latest WebRTC specification applies, so clients should keep up to date with new versions of the WebRTC specification whether or not there have been any changes to the Matrix spec.

Potential issues

  • The ability to call yourself makes the protocol a little more complex for clients to implement, and is somewhat of a special case. However, some of the necessary additions are also required for other features so this MSC elects to make it possible.
  • Clients must make a decision on whether to ring for any given call: defining this in the spec would be cumbersome and would limit clients' ability to use reputation-based systems for this decision in the future. However, having a call ring on one client and not the other because one had categorised it as a junk call and not the other would be confusing for the user.

Alternatives

  • This MSC does not allow for ICE negotiation before the user chooses to answer the call. This can make call setup faster by allowing connectivity to be established whilst the call is ringing. This is problematic with Matrix since any device or user could answer the call, so it is not known which device is going to answer before the user chooses to answer. It would also leak information on which of a user's devices were online.
  • We could define that the ID of a call is implicitly the event ID of the invite event rather than having a specific call_id field. This would mean that a client would be unable to know the ID of a call before the it received the response from sending the invite event, which could complicate implementations. There is probably no compelling reason to change this.
  • m.call.select_answer was chosen such that its name reflect the intention of the event. m.call.ack is more succinct and mirrors SIP, but this MSC opts for the more descriptive name.
  • This MSC elects to allow invites without an invitee field to mean a call for anyone in the room. This could be useful for hunt group style semantics where an incoming call causes many different users' phones to ring and any one of them may pick up the call. This does mean clients will need to not blindly ring for any call invites in any room, since this would make unsolicited calls easy in public rooms. We could opt to leave this out, or make it more explicit with a specific value for the invitee field.
  • party_id is one of many potential solutions: callees could add answer_ids to their events and callers could be identified by the lack of an answer_id. An explicit field on every event may be easier to comprehend, less error-prone and clearer in the backwards-compatibility scenario.
  • We could make party_ids more prescriptive, eg. the caller could always have a party_id of the empty string, the word caller or equal to the call_id, which may make debugging simpler.
  • To allow for bridging into protocols that don't support trickle ICE, this proposal requires that clients send an empty candidate to signal the end of candidates. This means it will be up to bridges to buffer the invite and edit the SDP to add the candidates once they arrive, adding complexity to bridges. The alternative would be a discovery mechanism so clients could know whether a callee supports trickle ICE before calling, and disable it if so. This would add complexity to every Matrix client as well as having to assume that all current clients did not, disabling trickle ICE everywhere until clients support the discovery mechanism. The mechanism would also have to be per-user which would make sense for bridged users, but not where some of a users devices support trickle ICE and some do not.

Security considerations

  • IP addresses remain in the room in candidates, as they did in the previous version of the spec. This is not ideal, but alternatives were either sending candidates over to-device messages (would slow down call setup because a target device would have to be established before sending candidates) or redacting them afterwards (the volume of events sent during calls can already cause rate limiting issues and this would exacerbate this).
  • Clients must take care to not ring for any call, as per the 'alternatives' section.

Unstable prefix

Since VoIP events already have a 'version' field, we would ideally use a string, namespaced version during development, but this field is defined to be an int in version 0. This MSC proposes changing the version field to a string so that this namespacing can be used for future changes. Since there is no other easy way to namespace events whilst in development and ensure interoperability, we have chosen not to use an unstable prefix for this change, on the understanding that in future we will be able to use the string version field for the unstable prefix.

For backwards compatibility, strongly typed implementations should allow for version to either be a string or the integer 0.