matrix-doc/proposals/2359-e2ee-voip-conferencing.md

112 lines
6.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# E2E Encrypted SFU VoIP conferencing via Matrix
## Background
Matrix has experimented with many different VoIP conferencing approaches over the years:
* Using FreeSWITCH as an MCU (multipoint conferencing unit - i.e. mixer) via
matrix-appservice-verto, where Riot would place a normal Matrix 1:1 VoIP call
to an endpoint on the MCU derived from the conf room ID where the conf call
was being triggered, with the existence of the ongoing conf call tracked in
the conf rooms room state. This predated Matrix E2EE, and suffered due to
problems with tuning FreeSWITCH to handle low bandwidth connections, as well
as suffered bad UX relative to an SFU, and was removed from Riot in ~2017.
* Using Jitsi as an SFU (stream forwarding unit) via widgets augmented by native
support. This provides a much better UX, but doesnt provide E2EE. Its
fiddly to get working (particularly screensharing) on Riot/Desktop though, and
the React/Native dependencies on Riot/Mobile end up being quite a pain to
maintain. Jitsi occasionally adds unwanted analytics dependencies &
functionalities too. Its also a bit of a shame to rely on embedding a random
“out of band” centralised focal point for conferencing via a widget, rather
than leveraging Matrix as a data transport or signalling layer.
* Using full mesh VoIP calls, where all the clients in a given room initiate
1:1 VoIP calls in DMs in order to establish a conf call. This was done as a
quick hack for
[vrdemo](https://github.com/matrix-org/matrix-vr-demo/blob/master/src/js/components/structures/FullMeshConference.js)
and worked surprisingly well - but has not been evolved due to lack of
braincycles (and because Jitsi was working well enough, with a nice UX). It
provides decentralised E2EE conferencing out of the box, but consumes
significant bandwidth & CPU/GPU/power to handle all the simultaneous 1:1
calls.
This proposal is a sketch of a 4th type of conferencing, providing SFU
semantics but leveraging Matrixs E2EE to stop the SFU being able to intercept
call media.
## Overview
* You start off with a normal E2EE matrix room
* All members start a VoIP 1:1 call in a DM with the SFU
* However, the SRTP keys for the media RTP (not RTCP) streams are
deliberately stripped from the SDP of the m.call.invite and m.call.answer
by the clients, so the SFU cant decrypt the call media. The call
signalling negotiates typical SFU srtp streams for:
* Sending audio (if not muted)
* Sending thumbnail video (if not muted)
* Sending full-res video (if requested by the SFU and not muted)
* Receiving 1-n multiplexed audio streams
* Receiving 1-n multiplexed video streams (mix of thumbnail & full-res)
* The 1:1 rooms could/should be E2EE to protect metadata, although this
isnt strictly necessary to protect the call media.
* The members exchange the SRTP keys via timeline events (ideally state
events, but theyre not E2EE yet) in the main conference rooms, so the
clients can decrypt the forwarded SRTP streams.
* The SFU itself:
* Looks at the bandwidth of the media streams being received from the
various clients, and uses REMB or TMMBR or whatever RTCP congestion
control mechanism to request that the sending clients full-res bitrate is
clamped to the lowest receive bitrate determined from the clients which
are currently trying to view the full-res streams.
* (Particularly slow receiving clients could be ignored and be forced to
(use the thumbnail rather than the full-res stream instead)
* Tracks which clients are trying to view the full-res streams (via
datachannel?) and forwards the full-res streams to the clients in question
(requesting them via datachannel from the client if needed).
* The SFU could also use the datachannel to determine whos currently
claiming to talk, to let users control the conference focus.
* Does the same for thumbnails too. (Could assume that everyone wants a copy
of the thumbnail streams).
* Relays the audio streams to everyone.
* We use the datachannel for the SFU control rather than Matrix to minimise
latency (which is really important when rapidly switching focus based on
voice detection in a call).
* This consciously leaks metadata about who was talking and when, but at least
the call data isnt leaked.
* The fact the SFU cant decrypt the streams means that some tricks arent
available:
* We cant framedrop when sending to slow clients, as we dont know where
the frames are. (Unless we provide some custom RTP headers or RTCP
packets outside the SRTP payloads to identify the frame types, but WebRTC
doesnt support this afaik?)
* We also cant downsample for slow clients, obviously. We could however
negotiate multiple send streams from the clients to try to support a
slower clients better.
* SVC (which is patent encumbered anyway) probably is ruled out, as
exploiting spatial redundancy between the low & high res send streams is
probably impossible between the separated streams.
* However, some tricks are still available?
* We can however forward keyframe requests from clients via RTCP.
* This has been written without reference to perc, so is probably missing insights
from there.
TL;DR: it works like a normal SFU, except the SRTP keys for the media streams
are exchanged in the megolm room where the conference was initiated, so the
SFU can never decrypt the media - but can still do rate control and forward
the streams around intelligently.
## Details
Need to specify:
* matrix timeline events for advertising the SRTP keys for the various streams in the conf room
* matrix state events for announcing the existence of a conf call in the conf room
* DataChannel API for SFU floor control (or perhaps we could start off with Matrix to keep things a bit simpler?)
* resolution/fps of the pyramid of send streams? ability to let the SFU dynamically negotiate the send stream resolution/fps?
* TMMBR or REMB or whatever folks use for CC these days?