matrix-doc/proposals/2359-e2ee-voip-conferencing.md


# E2E Encrypted SFU VoIP conferencing via Matrix


## Background

Matrix has experimented with many different VoIP conferencing approaches over the years:

* Using FreeSWITCH as an MCU (multipoint conferencing unit - i.e. mixer) via
  matrix-appservice-verto, where Riot would place a normal Matrix 1:1 VoIP call
  to an endpoint on the MCU derived from the conf room ID where the conf call
  was being triggered, with the existence of the ongoing conf call tracked in
  the conf room’s room state.  This predated Matrix E2EE, and suffered due to
  problems with tuning FreeSWITCH to handle low bandwidth connections, as well
  as suffered bad UX relative to an SFU, and was removed from Riot in ~2017.

* Using Jitsi as an SFU (stream forwarding unit) via widgets augmented by native
  support.  This provides a much better UX, but doesn’t provide E2EE.  It’s
  fiddly to get working (particularly screensharing) on Riot/Desktop though, and
  the React/Native dependencies on Riot/Mobile end up being quite a pain to
  maintain. Jitsi occasionally adds unwanted analytics dependencies &
  functionalities too.  It’s also a bit of a shame to rely on embedding a random
  “out of band” centralised focal point for conferencing via a widget, rather
  than leveraging Matrix as a data transport or signalling layer.

* Using full mesh VoIP calls, where all the clients in a given room initiate
  1:1 VoIP calls in DMs in order to establish a conf call.  This was done as a
  quick hack for
  [vrdemo](https://github.com/matrix-org/matrix-vr-demo/blob/master/src/js/components/structures/FullMeshConference.js)
  and worked surprisingly well - but has not been evolved due to lack of
  braincycles (and because Jitsi was working well enough, with a nice UX). It
  provides decentralised E2EE conferencing out of the box, but consumes
  significant bandwidth & CPU/GPU/power to handle all the simultaneous 1:1
  calls.

This proposal is a sketch of a 4th type of conferencing, providing SFU
semantics but leveraging Matrix’s E2EE to stop the SFU being able to intercept
call media.


## Overview

* You start off with a normal E2EE matrix room
* All members start a VoIP 1:1 call in a DM with the SFU
  * However, the SRTP keys for the media RTP (not RTCP) streams are
    deliberately stripped from the SDP of the m.call.invite and m.call.answer
    by the clients, so the SFU can’t decrypt the call media. The call
    signalling negotiates typical SFU srtp streams for:
    * Sending audio (if not muted)
    * Sending thumbnail video (if not muted)
    * Sending full-res video (if requested by the SFU and not muted)
    * Receiving 1-n multiplexed audio streams
    * Receiving 1-n multiplexed video streams (mix of thumbnail & full-res)
  * The 1:1 rooms could/should be E2EE to protect metadata, although this
    isn’t strictly necessary to protect the call media.
* The members exchange the SRTP keys via timeline events (ideally state
  events, but they’re not E2EE yet) in the main conference rooms, so the
 clients can decrypt the forwarded SRTP streams.
* The SFU itself:
  * Looks at the bandwidth of the media streams being received from the
    various clients, and uses REMB or TMMBR or whatever RTCP congestion
    control mechanism to request that the sending client’s full-res bitrate is
    clamped to the lowest receive bitrate determined from the clients which
    are currently trying to view the full-res streams.
    * (Particularly slow receiving clients could be ignored and be forced to
      (use the thumbnail rather than the full-res stream instead)
  * Tracks which clients are trying to view the full-res streams (via
    datachannel?) and forwards the full-res streams to the clients in question
    (requesting them via datachannel from the client if needed).
    * The SFU could also use the datachannel to determine who’s currently
      claiming to talk, to let users control the conference focus.
  * Does the same for thumbnails too. (Could assume that everyone wants a copy
    of the thumbnail streams).
  * Relays the audio streams to everyone.
* We use the datachannel for the SFU control rather than Matrix to minimise
  latency (which is really important when rapidly switching focus based on
  voice detection in a call).
* This consciously leaks metadata about who was talking and when, but at least
  the call data isn’t leaked.
* The fact the SFU can’t decrypt the streams means that some tricks aren’t
  available:
  * We can’t framedrop when sending to slow clients, as we don’t know where
    the frames are.  (Unless we provide some custom RTP headers or RTCP
    packets outside the SRTP payloads to identify the frame types, but WebRTC
    doesn’t support this afaik?)
  * We also can’t downsample for slow clients, obviously.  We could however
    negotiate multiple send streams from the clients to try to support a
    slower clients better.
  * SVC (which is patent encumbered anyway) probably is ruled out, as
    exploiting spatial redundancy between the low & high res send streams is
    probably impossible between the separated streams.
* However, some tricks are still available?
  * We can however forward keyframe requests from clients via RTCP.
* This has been written without reference to perc, so is probably missing insights
  from there.

TL;DR: it works like a normal SFU, except the SRTP keys for the media streams
are exchanged in the megolm room where the conference was initiated, so the
SFU can never decrypt the media - but can still do rate control and forward
the streams around intelligently.

## Details

Need to specify:

* matrix timeline events for advertising the SRTP keys for the various streams in the conf room
* matrix state events for announcing the existence of a conf call in the conf room
* DataChannel API for SFU floor control (or perhaps we could start off with Matrix to keep things a bit simpler?)
* resolution/fps of the pyramid of send streams? ability to let the SFU dynamically negotiate the send stream resolution/fps?
* TMMBR or REMB or whatever folks use for CC these days?