15 KiB
MSC3401: Native Group VoIP signalling
Note: previously this MSC included SFU signalling which has now been moved to MSC3898 to avoid making this MSC too large.
Problem
VoIP signalling in Matrix is currently conducted via timeline events in a 1:1 room. This has some limitations, especially if you try to broaden the approach to multiparty VoIP calls:
- VoIP signalling can generate a lot of events as candidates are incrementally
discovered, and for rapid call setup these need to be relayed as rapidly as
possible.
- Putting these into the room timeline means that if the client has a gappy sync, for VoIP to be reliable it will need to go back and fill in the gap before it can process any VoIP events, slowing things down badly.
- Timeline events are (currently) subject to harsh rate limiting, as they are assumed to be a spam vector.
- VoIP signalling leaks IP addresses. There is no reason to keep these around for posterity, and they should only be exposed to the devices which care about them.
- Candidates are ephemeral data, and there is no reason to keep them around for posterity - they're just clogging up the DAG.
Meanwhile we have no native signalling for group calls at all, forcing you to instead embed a separate system such as Jitsi, which has its own dependencies and doesn't directly leverage any of Matrix's encryption, decentralisation, access control or data model.
Proposal
This proposal provides a signalling framework using to-device messages which can be applied to native Matrix 1:1 calls, full-mesh calls and in the future SFU calls, cascaded SFU calls MCU calls, and hybrid SFU/MCU approaches. It replaces the early flawed sketch at MSC2359.
This does not immediately replace the current 1:1 call signalling, but may in future provide a migration path to unified signalling for 1:1 and group calls.
Diagrammatically, this looks like:
1:1:
A -------- B
Full mesh between clients
A -------- B
\ /
\ /
\ /
\ /
C
SFU (aka Focus):
A __ __ B
\ /
F
|
|
C
Where F is an SFU focus
Cascaded decentralised SFU:
A1 --. .-- B1
A2 ---Fa ----- Fb--- B2
\ /
\ /
\ /
\ /
Fc
| |
C1 C2
Where Fa, Fb and Fc are SFU foci, one per homeserver, each with two clients.
m.call state event
The user who wants to initiate a call sends a m.call
state event into the room to inform the room participants that a call is happening in the room. This effectively becomes the placeholder event in the timeline which clients would use to display the call in their scrollback (including duration and termination reason using m.terminated
). Its body has the following fields:
m.intent
to describe the intended UX for handling the call. One of:m.ring
if the call is meant to cause the room participants devices to ring (e.g. 1:1 call or group call)m.prompt
is the call should be presented as a conference call which users in the room are prompted to connect tom.room
if the call should be presented as a voice/video channel in which the user is immediately immersed on selecting the room.
m.type
to say whether the initial type of call is voice only (m.voice
) or video (m.video
). This signals the intent of the user when placing the call to the participants (i.e. "i want to have a voice call with you" or "i want to have a video call with you") and warns the receiver whether they may be expected to view video or not, and provide suitable initial UX for displaying that type of call... even if it later gets upgraded to a video call.m.terminated
if this event indicates that the call in question has finished, including the reason why. (A voice/video room will never terminate.) (do we need a duration, or can we figure that out from the previous state event?).m.name
as an optional human-visible label for the call (e.g. "Conference call").- The State key is a unique ID for that call. (We can't use the event ID, given
m.type
andm.terminated
is mutable). If there are multiple non-terminated conf ID state events in the room, the client should display the most recently edited event.
For instance:
{
"type": "m.call",
"state_key": "cvsiu2893",
"content": {
"m.intent": "m.room",
"m.type": "m.voice",
"m.name": "Voice room"
}
}
We mandate at most one call per room at any given point to avoid UX nightmares - if you want the user to participate in multiple parallel calls, you should simply create multiple rooms, each with one call.
Call participation
Users who want to participate in the call declare this by publishing a m.call.member
state event using their matrix ID as the state key (thus ensuring other users cannot edit it). The event contains an array m.calls
of objects describing which calls the user is participating in within that room. This array must contain one item (for now).
The fields within the item in the m.calls
contents are:
m.call_id
- the ID of the conference the user is claiming to participate in. If this doesn't match an unterminatedm.call
event, it should be ignored.m.devices
- The list of the member's active devices in the call. A member may join from one or more devices at a time, but they may not have two active sessions from the same device. Each device contains the following properties:device_id
- The device id to use for to-device messages when establishing a callsession_id
- A unique identifier used for resolving duplicate sessions from a given device. When thesession_id
field changes from an incomingm.call.member
event, any existing calls from this device in this call should be terminated.session_id
should be generated once per client session on application load.expires_ts
- A POSIX timestamp in milliseconds describing when this device data should be considered stale. When updating their own device state, clients should choose a reasonable value forexpires_ts
in case they go offline unexpectedly. If the user stays connected for longer than this time, the client must actively update the state event with a new expiration timestamp. A device must be ignored if theexpires_ts
field indicates it has expired, or if the user'sm.room.member
event's membership field is notjoin
.feeds
- Contains an array of feeds the member is sharing and the opponent member may reference when setting up their WebRTC connection.purpose
- Eitherm.usermedia
orm.screenshare
otherwise the feed should be ignored.
For instance:
{
"type": "m.call.member",
"state_key": "@matthew:matrix.org",
"content": {
"m.calls": [
{
"m.call_id": "cvsiu2893",
"m.devices": [
{
"device_id": "ASDUHDGFYUW", // Used to target to-device messages
"session_id": "GHKJFKLJLJ", // Used to resolve duplicate calls from a device
"expires_ts": 1654616071686,
"feeds": [
{
"purpose": "m.usermedia",
"id": "qegwy64121wqw", // WebRTC MediaStream id
"tracks": [
{
"kind": "audio",
"id": "zvhjiwqsx", // WebRTC MediaStreamTrack id
"label": "Sennheiser Mic",
"settings": { // WebRTC MediaTrackSettings object
"channelCount": 2,
"sampleRate": 48000,
"m.maxbr": 32000, // Matrix-specific extension to advertise the max bitrate of this track
}
},
{
"kind": "video",
"id": "zbhsbdhzs",
"label": "Logitech Webcam",
"settings": {
"width": 1280,
"height": 720,
"facingMode": "user",
"frameRate": 30.0,
"m.maxbr": 512000,
}
},
],
},
{
"purpose": "m.screenshare",
"id": "suigv372y8378",
"tracks": [
{
"kind": "video",
"id": "xbhsbdhzs",
"label": "My Screenshare",
"settings": {
"width": 3072,
"height": 1920,
"cursor": "moving",
"displaySurface": "monitor",
"frameRate": 30.0,
"m.maxbr": 768000,
}
},
]
}
]
}
]
}
],
}
}
This builds on MSC3077, which describes streams in m.call.*
events via a sdp_stream_metadata
field.
TODO: Do we need all of this data? Why would we need it? TODO: This doesn't follow the MSC3077 format very well - can we do something about that? TODO: Add tracks field TODO: Add bitrate/format fields
Clients should do their best to ensure that calls in m.call.member
state are removed when the member leaves the call. However, there will be cases where the device loses network connectivity, power, the application is forced closed, or it crashes. If the m.call.member
state has stale device data the call setup will fail. Clients should re-attempt invites up to 3 times before giving up on calling a member.
Call setup
In a full mesh call, for any two participants, the one with the lexicographically lower user ID is responsible for calling the other. If two participants share the same user ID (that is, if a user has joined the call from multiple devices), then the one with the lexicographically lower device ID is responsible for calling the other.
Call setup then uses the normal m.call.*
events, except they are sent over to-device messages to the relevant devices (encrypted via Olm). This means:
- When initiating a 1:1 call, the
m.call.invite
is sent to the devices listed inm.call.member
event'sm.devices
array using thedevice_id
field. m.call.*
events sent via to-device messages should also include the following properties in their content:conf_id
- The group call id listed inm.call
dest_session_id
- The recipient's session id. Incoming messages with adest_session_id
that doesn't match your current session id should be discarded.seq
- The sequence number of the to-device message. This is done since the order of to-device messages is not guaranteed. With each new to-device message this number gets incremented by1
and it starts at0
- In addition to the fields above
m.call.invite
events sent via to-device messages should include the following properties :device_id
- The message sender's device id. Used by the opponent member to send response to-device signalling messages even if them.call.member
event has not been received yet.sender_session_id
- Like thedevice_id
thesender_session_id
is used by the opponent member to filter out messages unrelated to the sender's session even if them.call.member
event has not been received yet.
- For 1:1 calls, we might want to let the to-device messages flow and cause the client to ring even before the
m.call
event propagates, to minimise latency. Therefore we'll need to include anm.intent
on them.call.invite
too.
Example Diagrams
Legend
Arrow Style | Description |
---|---|
Solid | State Event |
Dashed | Event (sent as to-device message) |
Basic Call
sequenceDiagram
autonumber
participant Alice
participant Room
participant Bob
Alice->>Room: m.call
Alice->>Room: m.call.member
Bob->>Room: m.call.member
Alice-->>Bob: m.call.invite
Alice-->>Bob: m.call.candidates
Alice-->>Bob: m.call.candidates
Bob-->>Alice: m.call.answer
Bob-->>Alice: m.call.candidates
Alice-->>Bob: m.call.select_answer
Potential issues
To-device messages are point-to-point between servers, whereas today's m.call.*
messages can transitively traverse servers via the room DAG, thus working around federation problems. In practice if you are relying on that behaviour, you're already in a bad place.
Alternatives
There are many many different ways to do this. The main other alternative considered was not to use state events to track membership, but instead gossip it via either to-device or DC messages between participants. This fell apart however due to trust: you effectively end up reinventing large parts of Matrix layered on top of to-device or DC. So you might as well publish and distribute the participation data in the DAG rather than reinvent the wheel.
An alternative to to-device messages is to use DMs. You still risk gappy sync problems though due to lots of traffic, as well as the hassle of creating DMs and requiring canonical DMs to set up the calls. It does make debugging easier though, rather than having to track encrypted ephemeral to-device msgs.
Security considerations
State events are not encrypted currently, and so this leaks that a call is happening, and who is participating in it, and from which devices.
Malicious users in a room could try to sabotage a conference by overwriting the m.call
state event of the current ongoing call.
Unstable prefix
stable event type | unstable event type |
---|---|
m.call | org.matrix.msc3401.call |
m.call.member | org.matrix.msc3401.call.member |