6.0 KiB
E2E Encrypted SFU VoIP conferencing via Matrix
Background
Matrix has experimented with many different VoIP conferencing approaches over the years:
-
Using FreeSWITCH as an MCU (multipoint conferencing unit - i.e. mixer) via matrix-appservice-verto, where Riot would place a normal Matrix 1:1 VoIP call to an endpoint on the MCU derived from the conf room ID where the conf call was being triggered, with the existence of the ongoing conf call tracked in the conf room’s room state. This predated Matrix E2EE, and suffered due to problems with tuning FreeSWITCH to handle low bandwidth connections, as well as suffered bad UX relative to an SFU, and was removed from Riot in ~2017.
-
Using Jitsi as an SFU (stream forwarding unit) via widgets augmented by native support. This provides a much better UX, but doesn’t provide E2EE. It’s fiddly to get working (particularly screensharing) on Riot/Desktop though, and the React/Native dependencies on Riot/Mobile end up being quite a pain to maintain. Jitsi occasionally adds unwanted analytics dependencies & functionalities too. It’s also a bit of a shame to rely on embedding a random “out of band” centralised focal point for conferencing via a widget, rather than leveraging Matrix as a data transport or signalling layer.
-
Using full mesh VoIP calls, where all the clients in a given room initiate 1:1 VoIP calls in DMs in order to establish a conf call. This was done as a quick hack for vrdemo and worked surprisingly well - but has not been evolved due to lack of braincycles (and because Jitsi was working well enough, with a nice UX). It provides decentralised E2EE conferencing out of the box, but consumes significant bandwidth & CPU/GPU/power to handle all the simultaneous 1:1 calls.
This proposal is a sketch of a 4th type of conferencing, providing SFU semantics but leveraging Matrix’s E2EE to stop the SFU being able to intercept call media.
Overview
- You start off with a normal E2EE matrix room
- All members start a VoIP 1:1 call in a DM with the SFU
- However, the SRTP keys for the media RTP (not RTCP) streams are
deliberately stripped from the SDP of the m.call.invite and m.call.answer
by the clients, so the SFU can’t decrypt the call media. The call
signalling negotiates typical SFU srtp streams for:
- Sending audio (if not muted)
- Sending thumbnail video (if not muted)
- Sending full-res video (if requested by the SFU and not muted)
- Receiving 1-n multiplexed audio streams
- Receiving 1-n multiplexed video streams (mix of thumbnail & full-res)
- The 1:1 rooms could/should be E2EE to protect metadata, although this isn’t strictly necessary to protect the call media.
- However, the SRTP keys for the media RTP (not RTCP) streams are
deliberately stripped from the SDP of the m.call.invite and m.call.answer
by the clients, so the SFU can’t decrypt the call media. The call
signalling negotiates typical SFU srtp streams for:
- The members exchange the SRTP keys via timeline events (ideally state events, but they’re not E2EE yet) in the main conference rooms, so the clients can decrypt the forwarded SRTP streams.
- The SFU itself:
- Looks at the bandwidth of the media streams being received from the
various clients, and uses REMB or TMMBR or whatever RTCP congestion
control mechanism to request that the sending client’s full-res bitrate is
clamped to the lowest receive bitrate determined from the clients which
are currently trying to view the full-res streams.
- (Particularly slow receiving clients could be ignored and be forced to (use the thumbnail rather than the full-res stream instead)
- Tracks which clients are trying to view the full-res streams (via
datachannel?) and forwards the full-res streams to the clients in question
(requesting them via datachannel from the client if needed).
- The SFU could also use the datachannel to determine who’s currently claiming to talk, to let users control the conference focus.
- Does the same for thumbnails too. (Could assume that everyone wants a copy of the thumbnail streams).
- Relays the audio streams to everyone.
- Looks at the bandwidth of the media streams being received from the
various clients, and uses REMB or TMMBR or whatever RTCP congestion
control mechanism to request that the sending client’s full-res bitrate is
clamped to the lowest receive bitrate determined from the clients which
are currently trying to view the full-res streams.
- We use the datachannel for the SFU control rather than Matrix to minimise latency (which is really important when rapidly switching focus based on voice detection in a call).
- This consciously leaks metadata about who was talking and when, but at least the call data isn’t leaked.
- The fact the SFU can’t decrypt the streams means that some tricks aren’t
available:
- We can’t framedrop when sending to slow clients, as we don’t know where the frames are. (Unless we provide some custom RTP headers or RTCP packets outside the SRTP payloads to identify the frame types, but WebRTC doesn’t support this afaik?)
- We also can’t downsample for slow clients, obviously. We could however negotiate multiple send streams from the clients to try to support a slower clients better.
- SVC (which is patent encumbered anyway) probably is ruled out, as exploiting spatial redundancy between the low & high res send streams is probably impossible between the separated streams.
- However, some tricks are still available?
- We can however forward keyframe requests from clients via RTCP.
- This has been written without reference to perc, so is probably missing insights from there.
TL;DR: it works like a normal SFU, except the SRTP keys for the media streams are exchanged in the megolm room where the conference was initiated, so the SFU can never decrypt the media - but can still do rate control and forward the streams around intelligently.
Details
Need to specify:
- matrix timeline events for advertising the SRTP keys for the various streams in the conf room
- matrix state events for announcing the existence of a conf call in the conf room
- DataChannel API for SFU floor control (or perhaps we could start off with Matrix to keep things a bit simpler?)
- resolution/fps of the pyramid of send streams? ability to let the SFU dynamically negotiate the send stream resolution/fps?
- TMMBR or REMB or whatever folks use for CC these days?