444 lines
16 KiB
Markdown
444 lines
16 KiB
Markdown
# MSC3898: Native Matrix VoIP signalling for cascaded SFUs
|
|
|
|
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
|
|
specifies how full-mesh group calls work in Matrix. While that MSC works well
|
|
for small group calls, it does not work so well for large conferences due to
|
|
bandwidth (and other) issues.
|
|
|
|
Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
|
|
between peers (which could be clients or SFUs or both). To make use of them
|
|
effectively, peers need to be able to tell the SFU which streams they want to
|
|
receive at what resolutions.
|
|
|
|
To solve the issue of centralization, the SFUs are also allowed to connect to
|
|
each other ("cascade") and therefore the peers also need a way to tell an SFU to
|
|
which other SFUs to connect.
|
|
|
|
## Proposal
|
|
|
|
- **TODO: spell out how this works with active speaker detection & associated
|
|
signalling**
|
|
|
|
### Diagrams
|
|
|
|
The diagrams of how this all looks can be found in
|
|
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).
|
|
|
|
### Additions to the `m.call.member` state event
|
|
|
|
This MSC proposes adding two _optional_ fields to the `m.call.member` state event:
|
|
`m.foci.preferred` and `m.foci.active`.
|
|
|
|
Informational: This attempts to avoid the situation where a conference is ongoing
|
|
with several users in, for example, New York. These users are all connected to the
|
|
focus in New York. Alice joins from London: rather than connecting to the focus
|
|
in London, she connects directly to the one in New York since that's where all the
|
|
other participants are connected. If more users then join from London, however, they
|
|
will all make the same decision and connect to the New York focus rather than the
|
|
optimal configuration of the London users connected to the London focus. With active
|
|
and preferred foci, the second user that joins from London will know that although
|
|
Alice's active focus is New York, her preferred is London, and can therefore choose
|
|
the London focus instead.
|
|
|
|
For instance:
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.member",
|
|
"state_key": "@matthew:matrix.org",
|
|
"content": {
|
|
"m.calls": [
|
|
{
|
|
"m.call_id": "cvsiu2893",
|
|
"m.devices": [{
|
|
"device_id": "U738KDF9WJ",
|
|
"m.foci.active": [
|
|
{ "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
|
|
],
|
|
"m.foci.preferred": [
|
|
{ "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
|
|
{ "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
|
|
]
|
|
}]
|
|
}
|
|
],
|
|
"m.expires_ts": 1654616071686
|
|
}
|
|
}
|
|
```
|
|
|
|
#### `m.foci.active`
|
|
|
|
This field is a list of foci the user's device is publishing to. Usually, this
|
|
list will have a length of 1, yet a client might publish to multiple foci if
|
|
they are on different networks, for instance, or to simultaneously fan-out in
|
|
different directions from the client if there is no nearby focus. If the client
|
|
is participating full-mesh, it should either omit this field from the state
|
|
event or leave the list empty.
|
|
|
|
#### `m.foci.preferred`
|
|
|
|
This field is a list of foci the client would prefer to switch to from the
|
|
current active focus, if any other client also starts using the given focus. If
|
|
the client is already using one of its preferred foci, it should either omit
|
|
this field from the state event or leave the list empty.
|
|
|
|
### Choosing a focus
|
|
|
|
#### Discovering foci
|
|
|
|
- **TODO: How does a client discover foci? We could use well-known or a custom endpoint**
|
|
|
|
Foci are identified by a tuple of `user_id` and `device_id`.
|
|
|
|
#### Determining the best focus
|
|
|
|
There are many ways to determine the best focus; this MSC recommends the
|
|
following:
|
|
|
|
- Is the quickest to respond to `m.call.invite` with `m.call.answer`.
|
|
- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered
|
|
port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to
|
|
the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to
|
|
measure media-path latency rather than signalling path latency.
|
|
- Has the best latency of data-channel traffic flows.
|
|
- Has the best latency and bandwidth determined by sending a small splurge of
|
|
media down the pipe to probe.
|
|
|
|
#### Joining a call
|
|
|
|
The following diagram explains how a client chooses a focus when joining a call.
|
|
|
|
```mermaid
|
|
flowchart TD;
|
|
wantsToJoin[Wants to join a call];
|
|
hasPreferred(Has preferred focus?);
|
|
callPreferred[Calls preferred foci without media to grab a slot];
|
|
publishPreferred[Publishes `m.foci.preferred`];
|
|
checkMembers(Call has more than 2 members including the client itself?);
|
|
callFullMesh[Calls other member full-mesh];
|
|
callMembersFoci[Tries calling foci from `m.call.member` events];
|
|
orderFoci[Orders foci from best to worst];
|
|
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
|
|
callBestPreferred[Calls the focus];
|
|
callBestActive[Calls the best active focus in room];
|
|
publishActive[Publishes `m.foci.active`];
|
|
|
|
wantsToJoin-->hasPreferred;
|
|
hasPreferred--->|Yes|callPreferred;
|
|
hasPreferred--->|No|checkMembers;
|
|
callPreferred--->publishPreferred;
|
|
publishPreferred--->checkMembers;
|
|
checkMembers--->|Yes|callMembersFoci;
|
|
checkMembers--->|No|callFullMesh;
|
|
callMembersFoci--->orderFoci;
|
|
orderFoci--->findFocusPreferredByOtherMember;
|
|
findFocusPreferredByOtherMember--->|Found|callBestPreferred;
|
|
callBestPreferred--->publishActive;
|
|
findFocusPreferredByOtherMember--->|Not found|callBestActive;
|
|
callBestActive--->publishActive;
|
|
```
|
|
|
|
#### Mid-call changes
|
|
|
|
Once in a call, the client listens for changes to `m.call.member` state events
|
|
and if another member starts using one of the client's preferred foci, the client
|
|
switches to that focus.
|
|
|
|
**TODO: other cases?**
|
|
|
|
### Initial offer/answer dance
|
|
|
|
During the initial offer/answer dance, the client establishes a data-channel
|
|
between itself and the SFU to use later for rapid signalling.
|
|
|
|
### Simulcast
|
|
|
|
#### RTP munging
|
|
|
|
#### vp8 munging
|
|
|
|
### RTCP re-transmission
|
|
|
|
### Data-channel messaging
|
|
|
|
The client uses the established data channel connection to the SFU to perform
|
|
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
|
|
ping messages, metadata, cascade and perform re-negotiation.
|
|
|
|
See the section about the [rationale](#the-use-of-the-data-channels-for-signaling)
|
|
behind the use of the data channels for signaling.
|
|
|
|
- **TODO: Spell out how the DC traffic interacts with application-layer
|
|
traffic**
|
|
|
|
#### SDP Stream Metadata extension
|
|
|
|
The client will be receiving multiple streams from the SFU and it will need to
|
|
be able to distinguish them, this therefore builds on
|
|
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
|
|
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
|
|
provide the client with the necessary metadata. Some of the data-channel events
|
|
include an `sdp_stream_metadata` field including a description of the stream
|
|
being sent either from the SFU to the client or from the client to the SFU.
|
|
|
|
Other than mute information and stream purpose, the metadata includes video
|
|
track resolution. The SFU may not be able to determine the resolution of the
|
|
track itself but it does need to know for simulcast; therefore, we include this
|
|
in the metadata.
|
|
|
|
```json
|
|
{
|
|
"streamId1": {
|
|
"purpose": "m.usermedia",
|
|
"audio_muted": false,
|
|
"video_muted": true,
|
|
"tracks": {
|
|
"trackId1": {
|
|
"width": 1920,
|
|
"height": 1080
|
|
},
|
|
"trackId2": {}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Event types
|
|
|
|
This MSC adds a few new `m.call.*` events and extends a few of the existing ones.
|
|
|
|
##### `m.call.track_subscription`
|
|
|
|
This event is sent to the focus to let it know about the tracks the client would
|
|
like to start/stop subscribing to.
|
|
|
|
Upon receiving this event, a focus should make the subscribe changes based on
|
|
the `start` and `stop` arrays and respond with an `m.call.negotiate` event.
|
|
|
|
In the case of video tracks, in the `start` array the client may also request a
|
|
specific resolution for a given track; this resolution is a resolution the
|
|
client wishes to receive but the SFU may send a lower one due to bandwidth etc.
|
|
|
|
If the user for example switches from "spotlight" (one large tile) to "grid"
|
|
(multiple small tiles) view, it should also send this event with the updated
|
|
resolution in the `start` array to let the focus know of the resolution change.
|
|
|
|
Clients may request each track only once: foci should ignore multiple requests
|
|
of the same track.
|
|
|
|
- **TODO: how do we prove to the focus that we have the right to subscribe to
|
|
track?**
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.track_subscription",
|
|
"content": {
|
|
"subscribe": [
|
|
{
|
|
"stream_id": "streamId1",
|
|
"track_id": "trackId1",
|
|
"width": 1920,
|
|
"height": 1080
|
|
},
|
|
{
|
|
"stream_id": "streamId2",
|
|
"track_id": "trackId2",
|
|
"width": 256,
|
|
"height": 144
|
|
}
|
|
],
|
|
"unsubscribe": [
|
|
{
|
|
"stream_id": "streamId3",
|
|
"track_id": "trackId4"
|
|
},
|
|
{
|
|
"stream_id": "streamId4",
|
|
"track_id": "trackId4"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
##### `m.call.negotiate`
|
|
|
|
This event works exactly like the `m.call.negotiate` event in 1:1 calls.
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.negotiate",
|
|
"content": {
|
|
"description": {
|
|
"type": "offer",
|
|
"sdp": "..."
|
|
},
|
|
"sdp_stream_metadata": {...} // As specified in the Metadata section
|
|
}
|
|
}
|
|
```
|
|
|
|
##### `m.call.sdp_stream_metadata_changed`
|
|
|
|
This event works very similarly to the 1:1 call `m.call.sdp_stream_metadata_changed`.
|
|
|
|
- **TODO: Spec how foci actually use this to advertise tracks**
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.sdp_stream_metadata_changed",
|
|
"content": {
|
|
"sdp_stream_metadata": {...} // As specified in the Metadata section
|
|
}
|
|
}
|
|
```
|
|
|
|
##### `m.call.ping`, `m.call.pong`
|
|
|
|
A ping message must be sent by the focus to the client at an interval
|
|
no greater than 30 seconds. On receiving a ping message, a client must respond
|
|
immediately with a pong message. A client may therefore detect that the
|
|
connection has failed after an amount of time of its choosing (greater than
|
|
30 seconds) has elapsed since it last saw a ping message. A server may deem a
|
|
client unresponsive after not receiving a pong some amount of time after it
|
|
has sent a ping, again the amount of time the server waits is up to the
|
|
implementation. Either send should hang up once deeming the other side
|
|
unresponsive.
|
|
|
|
focus -> client:
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.ping",
|
|
"content": {}
|
|
}
|
|
```
|
|
|
|
client -> focus:
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.pong",
|
|
"content": {}
|
|
}
|
|
```
|
|
|
|
##### `m.call.connect_to_focus`
|
|
|
|
If a user is using their focus in a call, it will need to know how to connect to
|
|
other foci present in order to participate in the full-mesh of SFU traffic (if
|
|
any). The client is responsible for doing this using the
|
|
`m.call.connect_to_focus` event.
|
|
|
|
```json
|
|
{
|
|
"type": "m.call.connect_to_focus",
|
|
"content": {
|
|
// TODO: How should this look?
|
|
}
|
|
}
|
|
```
|
|
|
|
### Notes
|
|
|
|
#### Hiding behind foci
|
|
|
|
We do not recommend that users utilise a focus to hide behind for privacy, but
|
|
instead use a TURN server, only providing relay candidates, rather than
|
|
consuming focus resources and unnecessarily mandating the presence of a focus.
|
|
|
|
## Potential issues
|
|
|
|
The SFUs participating in a conference end up in a full mesh. Rather than
|
|
inventing our own spanning-tree system for SFUs however, we should fix it for
|
|
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
|
|
similar to decide what better-than-full-mesh topology to use. In practice, full
|
|
mesh cascade between SFUs is probably not that bad (especially if SFUs only
|
|
request the streams over the trunk their clients care about) - and on aggregate
|
|
will be less obnoxious than all the clients hitting a single SFU.
|
|
|
|
Too many foci will chew bandwidth due to full-mesh between them. In the worst
|
|
case, if every use is on their own HS and picks a different foci, it degenerates
|
|
to a full-mesh call (just server-side rather than client-side). Hopefully this
|
|
shouldn't happen as you will converge on using a single SFU with the most
|
|
clients, but need to check how this works in practice.
|
|
|
|
SFrame mandates its own ratchet currently which is almost the same as megolm but
|
|
not quite. Switching it out for megolm seems reasonable right now (at least
|
|
until MLS comes along)
|
|
|
|
## Alternatives
|
|
|
|
An option would be to treat 1:1 (and full mesh) entirely differently to SFU
|
|
based calling rather than trying to unify them. Also, it's debatable whether
|
|
supporting full mesh is useful at all. In the end, it feels like unifying 1:1
|
|
and SFU calling is for the best though, as it then gives you the ability to
|
|
trivially upgrade 1:1 calls to group calls and vice versa, and avoids
|
|
maintaining two separate hunks of spec. It also forces 1:1 calls to take
|
|
multi-stream calls seriously, which is useful for more exotic capture devices
|
|
(stereo cameras; 3D cameras; surround sound; audio fields etc).
|
|
|
|
### The use of the data channels for signaling
|
|
|
|
The current specification assumes that signaling works over Matrix, but
|
|
side-chains to the data channel once the peer connection is established
|
|
in order to perform low-latency signaling.
|
|
|
|
In an ideal scenario the use of the data channels would not be required and
|
|
the usage of native Matrix signaling would be sufficient, however due to
|
|
the fact that regular Matrix signaling may need to traverse different
|
|
servers, e.g. `client <-> home server <-> home server <-> sfu`, our
|
|
signaling would not be quite as fast as we need it to be. The effect will
|
|
be even greater when coupled with the fact that certain protocols like
|
|
HTTP would not be as efficient for a real-time communication as e.g. WebRTC
|
|
data channels or WebSockets.
|
|
|
|
The problem would be solved if the clients could connect to the SFU
|
|
**directly** and communicate via Matrix for all signaling messages. This
|
|
would allow us to use a faster transport (WebSockets, QUIC etc) to transmit
|
|
signaling messages. However, this is **currently** not possible due to the fact
|
|
that it would require the support of the P2P Matrix that is still being under
|
|
development at the time of writing this MSC.
|
|
|
|
To read more about the problem and get more context, please refer to the
|
|
[discussion](https://github.com/matrix-org/matrix-spec-proposals/pull/3898#discussion_r1019098025).
|
|
|
|
### Cascading
|
|
|
|
One option here is for SFUs to act as an AS and sniff the `m.call.member`
|
|
traffic of their associated server, and automatically call any other `m.foci`
|
|
which appear. (They don't need to make outbound calls to clients, as clients
|
|
always dial in).
|
|
|
|
## Security considerations
|
|
|
|
Malicious users could try to DoS SFUs by specifying them as their foci.
|
|
|
|
SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
|
|
enough to all the participants when a participant leave (and meanwhile if we
|
|
keep using the old session, we're technically leaking call media to the parted
|
|
participant until we manage to rotate).
|
|
|
|
Need to ensure there's no scope for media forwarding loops through SFUs.
|
|
|
|
In order to authenticate that only legitimate users are allowed to subscribe to
|
|
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
|
|
sniff the `m.call` events on their associated server, and only act on to-device
|
|
`m.call.*` events which come from a user who is confirmed to be in the room for
|
|
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
|
|
connect to the SFU without having the keys to decrypt the traffic, but this
|
|
feature is desirable for non-E2EE confs and to stop bandwidth DoS)
|
|
|
|
## Unstable prefixes
|
|
|
|
We probably don't care for this for the data-channel?
|
|
|
|
While this MSC is not considered stable, implementations should use
|
|
`org.matrix.msc3898` as a namespace.
|
|
|
|
|Stable (post-FCP) |Unstable |
|
|
|------------------|-----------------------------------|
|
|
|`m.foci.active` |`org.matrix.msc3898.foci.active` |
|
|
|`m.foci.preferred`|`org.matrix.msc3898.foci.preferred`|
|