matrix-doc/proposals/3898-sfu.md

# MSC3898: Native Matrix VoIP signalling for cascaded SFUs

[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
specifies how full-mesh group calls work in Matrix. While that MSC works well
for small group calls, it does not work so well for large conferences due to
bandwidth (and other) issues.

Selective Forwarding Units (SFUs) - servers which forwarding WebRTC streams
between peers (which could be clients or SFUs or both). To make use of them
effectively, peers need to be able to tell the SFU which streams they want to
receive at what resolutions.

To solve the issue of centralization, the SFUs are also allowed to connect to
each other ("cascade") and therefore the peers also need a way to tell an SFU to
which other SFUs to connect.

## Proposal

- **TODO: spell out how this works with active speaker detection & associated
signalling**

### Diagrams

The diagrams of how this all looks can be found in
[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401).

### Additions to the `m.call.member` state event

This MSC proposes adding two _optional_ fields to the `m.call.member` state event:
`m.foci.preferred` and `m.foci.active`.

Informational: This attempts to avoid the situation where a conference is ongoing
with several users in, for example, New York. These users are all connected to the
focus in New York. Alice joins from London: rather than connecting to the focus
in London, she connects directly to the one in New York since that's where all the
other participants are connected. If more users then join from London, however, they
will all make the same decision and connect to the New York focus rather than the
optimal configuration of the London users connected to the London focus. With active
and preferred foci, the second user that joins from London will know that although
Alice's active focus is New York, her preferred is London, and can therefore choose
the London focus instead.

For instance:

```json
{
    "type": "m.call.member",
    "state_key": "@matthew:matrix.org",
    "content": {
        "m.calls": [
            {
                "m.call_id": "cvsiu2893",
                "m.devices": [{
                    "device_id": "U738KDF9WJ",
                    "m.foci.active": [
                        { "user_id": "@sfu-lon:matrix.org", "device_id": "FS5F589EF" }
                    ],
                    "m.foci.preferred": [
                        { "user_id": "@sfu-bon:matrix.org", "device_id": "3FSF589EF" },
                        { "user_id": "@sfu-mon:matrix.org", "device_id": "GFSDH93EF" },
                    ]
                }]
            }
        ],
        "m.expires_ts":  1654616071686
    }
}
```

#### `m.foci.active`

This field is a list of foci the user's device is publishing to. Usually, this
list will have a length of 1, yet a client might publish to multiple foci if
they are on different networks, for instance, or to simultaneously fan-out in
different directions from the client if there is no nearby focus. If the client
is participating full-mesh, it should either omit this field from the state
event or leave the list empty.

#### `m.foci.preferred`

This field is a list of foci the client would prefer to switch to from the
current active focus, if any other client also starts using the given focus. If
the client is already using one of its preferred foci, it should either omit
this field from the state event or leave the list empty.

### Choosing a focus

#### Discovering foci

- **TODO: How does a client discover foci? We could use well-known or a custom endpoint**

Foci are identified by a tuple of `user_id` and `device_id`.

#### Determining the best focus

There are many ways to determine the best focus; this MSC recommends the
following:

- Is the quickest to respond to `m.call.invite` with `m.call.answer`.
- Is the quickest to rapidly reject a spurious HTTPS request to a high-numbered
  port on the SFU's IP address, if the SFU exposes its IP somewhere - similar to
  the [apenwarr/blip](https://github.com/apenwarr/blip) trick, in order to
  measure media-path latency rather than signalling path latency.
- Has the best latency of data-channel traffic flows.
- Has the best latency and bandwidth determined by sending a small splurge of
  media down the pipe to probe.

#### Joining a call

The following diagram explains how a client chooses a focus when joining a call.

```mermaid
flowchart TD;
wantsToJoin[Wants to join a call];
hasPreferred(Has preferred focus?);
callPreferred[Calls preferred foci without media to grab a slot];
publishPreferred[Publishes `m.foci.preferred`];
checkMembers(Call has more than 2 members including the client itself?);
callFullMesh[Calls other member full-mesh];
callMembersFoci[Tries calling foci from `m.call.member` events];
orderFoci[Orders foci from best to worst];
findFocusPreferredByOtherMember(Goes through ordered foci to find one which is preferred by at least one other member);
callBestPreferred[Calls the focus];
callBestActive[Calls the best active focus in room];
publishActive[Publishes `m.foci.active`];

wantsToJoin-->hasPreferred;
hasPreferred--->|Yes|callPreferred;
hasPreferred--->|No|checkMembers;
callPreferred--->publishPreferred;
publishPreferred--->checkMembers;
checkMembers--->|Yes|callMembersFoci;
checkMembers--->|No|callFullMesh;
callMembersFoci--->orderFoci;
orderFoci--->findFocusPreferredByOtherMember;
findFocusPreferredByOtherMember--->|Found|callBestPreferred;
callBestPreferred--->publishActive;
findFocusPreferredByOtherMember--->|Not found|callBestActive;
callBestActive--->publishActive;
```

#### Mid-call changes

Once in a call, the client listens for changes to `m.call.member` state events
and if another member starts using one of the client's preferred foci, the client
switches to that focus.

**TODO: other cases?**

### Initial offer/answer dance

During the initial offer/answer dance, the client establishes a data-channel
between itself and the SFU to use later for rapid signalling.

### Simulcast

#### RTP munging

#### vp8 munging

### RTCP re-transmission

### Data-channel messaging

The client uses the established data channel connection to the SFU to perform
low-latency signalling to rapidly (un)subscribe/(un)publish streams, send
ping messages, metadata, cascade and perform re-negotiation.

See the section about the [rationale](#the-use-of-the-data-channels-for-signaling)
behind the use of the data channels for signaling.

- **TODO: Spell out how the DC traffic interacts with application-layer
traffic**

#### SDP Stream Metadata extension

The client will be receiving multiple streams from the SFU and it will need to
be able to distinguish them, this therefore builds on
[MSC3077](https://github.com/matrix-org/matrix-spec-proposals/pull/3077) and
[MSC3291](https://github.com/matrix-org/matrix-spec-proposals/pull/3291) to
provide the client with the necessary metadata. Some of the data-channel events
include an `sdp_stream_metadata` field including a description of the stream
being sent either from the SFU to the client or from the client to the SFU.

Other than mute information and stream purpose, the metadata includes video
track resolution. The SFU may not be able to determine the resolution of the
track itself but it does need to know for simulcast; therefore, we include this
in the metadata.

```json
{
    "streamId1": {
        "purpose": "m.usermedia",
        "audio_muted": false,
        "video_muted": true,
        "tracks": {
            "trackId1": {
                "width": 1920,
                "height": 1080
            },
            "trackId2": {}
        }
    }
}
```

#### Event types

This MSC adds a few new `m.call.*` events and extends a few of the existing ones.

##### `m.call.track_subscription`

This event is sent to the focus to let it know about the tracks the client would
like to start/stop subscribing to.

Upon receiving this event, a focus should make the subscribe changes based on
the `start` and `stop` arrays and respond with an `m.call.negotiate` event.

In the case of video tracks, in the `start` array the client may also request a
specific resolution for a given track; this resolution is a resolution the
client wishes to receive but the SFU may send a lower one due to bandwidth etc.

If the user for example switches from "spotlight" (one large tile) to "grid"
(multiple small tiles) view, it should also send this event with the updated
resolution in the `start` array to let the focus know of the resolution change.

Clients may request each track only once: foci should ignore multiple requests
of the same track.

- **TODO: how do we prove to the focus that we have the right to subscribe to
track?**

```json
{
    "type": "m.call.track_subscription",
    "content": {
        "subscribe": [
            {
                "stream_id": "streamId1",
                "track_id": "trackId1",
                "width": 1920,
                "height": 1080
            },
            {
                "stream_id": "streamId2",
                "track_id": "trackId2",
                "width": 256,
                "height": 144
            }
        ],
        "unsubscribe": [
            {
                "stream_id": "streamId3",
                "track_id": "trackId4"
            },
            {
                "stream_id": "streamId4",
                "track_id": "trackId4"
            }
        ]
    }
}
```

##### `m.call.negotiate`

This event works exactly like the `m.call.negotiate` event in 1:1 calls.

```json
{
    "type": "m.call.negotiate",
    "content": {
        "description": {
            "type": "offer",
            "sdp": "..."
        },
        "sdp_stream_metadata": {...} // As specified in the Metadata section
    }
}
```

##### `m.call.sdp_stream_metadata_changed`

This event works very similarly to the 1:1 call `m.call.sdp_stream_metadata_changed`.

- **TODO: Spec how foci actually use this to advertise tracks**

```json
{
    "type": "m.call.sdp_stream_metadata_changed",
    "content": {
        "sdp_stream_metadata": {...} // As specified in the Metadata section
    }
}
```

##### `m.call.ping`, `m.call.pong`

A ping message must be sent by the focus to the client at an interval
no greater than 30 seconds. On receiving a ping message, a client must respond
immediately with a pong message. A client may therefore detect that the
connection has failed after an amount of time of its choosing (greater than
30 seconds) has elapsed since it last saw a ping message. A server may deem a
client unresponsive after not receiving a pong some amount of time after it
has sent a ping, again the amount of time the server waits is up to the
implementation. Either send should hang up once deeming the other side
unresponsive.

focus -> client:

```json
{
    "type": "m.call.ping",
    "content": {}
}
```

client -> focus:

```json
{
    "type": "m.call.pong",
    "content": {}
}
```

##### `m.call.connect_to_focus`

If a user is using their focus in a call, it will need to know how to connect to
other foci present in order to participate in the full-mesh of SFU traffic (if
any). The client is responsible for doing this using the
`m.call.connect_to_focus` event.

```json
{
    "type": "m.call.connect_to_focus",
    "content": {
        // TODO: How should this look?
    }
}
```

### Notes

#### Hiding behind foci

We do not recommend that users utilise a focus to hide behind for privacy, but
instead use a TURN server, only providing relay candidates, rather than
consuming focus resources and unnecessarily mandating the presence of a focus.

## Potential issues

The SFUs participating in a conference end up in a full mesh. Rather than
inventing our own spanning-tree system for SFUs however, we should fix it for
Matrix as a whole (as is happening in the LB work) and use a Pinecone tree or
similar to decide what better-than-full-mesh topology to use. In practice, full
mesh cascade between SFUs is probably not that bad (especially if SFUs only
request the streams over the trunk their clients care about) - and on aggregate
will be less obnoxious than all the clients hitting a single SFU.

Too many foci will chew bandwidth due to full-mesh between them. In the worst
case, if every use is on their own HS and picks a different foci, it degenerates
to a full-mesh call (just server-side rather than client-side).  Hopefully this
shouldn't happen as you will converge on using a single SFU with the most
clients, but need to check how this works in practice.

SFrame mandates its own ratchet currently which is almost the same as megolm but
not quite.  Switching it out for megolm seems reasonable right now (at least
until MLS comes along)

## Alternatives

An option would be to treat 1:1 (and full mesh) entirely differently to SFU
based calling rather than trying to unify them. Also, it's debatable whether
supporting full mesh is useful at all. In the end, it feels like unifying 1:1
and SFU calling is for the best though, as it then gives you the ability to
trivially upgrade 1:1 calls to group calls and vice versa, and avoids
maintaining two separate hunks of spec.  It also forces 1:1 calls to take
multi-stream calls seriously, which is useful for more exotic capture devices
(stereo cameras; 3D cameras; surround sound; audio fields etc).

### The use of the data channels for signaling

The current specification assumes that signaling works over Matrix, but
side-chains to the data channel once the peer connection is established
in order to perform low-latency signaling.

In an ideal scenario the use of the data channels would not be required and
the usage of native Matrix signaling would be sufficient, however due to
the fact that regular Matrix signaling may need to traverse different
servers, e.g. `client <-> home server <-> home server <-> sfu`, our
signaling would not be quite as fast as we need it to be. The effect will
be even greater when coupled with the fact that certain protocols like
HTTP would not be as efficient for a real-time communication as e.g. WebRTC
data channels or WebSockets.

The problem would be solved if the clients could connect to the SFU
**directly** and communicate via Matrix for all signaling messages. This
would allow us to use a faster transport (WebSockets, QUIC etc) to transmit
signaling messages. However, this is **currently** not possible due to the fact
that it would require the support of the P2P Matrix that is still being under
development at the time of writing this MSC.

To read more about the problem and get more context, please refer to the
[discussion](https://github.com/matrix-org/matrix-spec-proposals/pull/3898#discussion_r1019098025).

### Cascading

One option here is for SFUs to act as an AS and sniff the `m.call.member`
traffic of their associated server, and automatically call any other `m.foci`
which appear.  (They don't need to make outbound calls to clients, as clients
always dial in).

## Security considerations

Malicious users could try to DoS SFUs by specifying them as their foci.

SFrame E2EE may go horribly wrong if we can't send the new megolm session fast
enough to all the participants when a participant leave (and meanwhile if we
keep using the old session, we're technically leaking call media to the parted
participant until we manage to rotate).

Need to ensure there's no scope for media forwarding loops through SFUs.

In order to authenticate that only legitimate users are allowed to subscribe to
a given `conf_id` on an SFU, it would make sense for the SFU to act as an AS and
sniff the `m.call` events on their associated server, and only act on to-device
`m.call.*` events which come from a user who is confirmed to be in the room for
that `m.call`. (In practice, if the conf is E2EE then it's of limited use to
connect to the SFU without having the keys to decrypt the traffic, but this
feature is desirable for non-E2EE confs and to stop bandwidth DoS)

## Unstable prefixes

We probably don't care for this for the data-channel?

While this MSC is not considered stable, implementations should use
`org.matrix.msc3898` as a namespace.

|Stable (post-FCP) |Unstable                           |
|------------------|-----------------------------------|
|`m.foci.active`   |`org.matrix.msc3898.foci.active`   |
|`m.foci.preferred`|`org.matrix.msc3898.foci.preferred`|