29 KiB

Raw Permalink Blame History

MSC4143: MatrixRTC

MatrixRTC is short for Matrix real time communication. This MSC defines the modules with which the Matrix real time system is built.

MatrixRTC specifies how a real time session is described in a room and how matrix users can connect to a session.

The MatrixRTC specification is separated into different modules:

The MatrixRTC room state that defines the state of the real time session.
It is the source of truth for:
- Who is part of a session
- Who is connected via what technology/backend
- Metadata per device used by other participants to decide whether the streams from this source are of interest / need to be subscribed.
The MatrixRTC backend.
- Allows for multiple backend implementations to be used.
- It defines how to discover the available backend(s).
- It defines how to connect the participating peers.
- Defines how to connect to a server/other peers, how to update the connection, how to subscribe to different streams...
- A proposal utilising LiveKit is the standard for this as of writing.
- Another planned backend is a full mesh implementation based on MSC3401.
The MatrixRTC application.
- Each application type can have it's own spec.
- Voice and video conferencing can be done with an application of type m.call
- The application defines all the details of the RTC experience:
  - How to interpret the metadata of the member events.
  - What streams to connect to.
  - What data in which format to sent over the RTC channels.
  - What MatrixRTC backends are supported.
End-to-end encryption of media streams

This MSC will focus on the Matrix room state which is responsible for the high level signalling of a RTC session:

Proposal

Each RTC session is made out of a collection of m.rtc.member room state events. Each m.rtc.member event defines who (the member) is a participant of which session (the session).

The MatrixRTC room state

All data related to a MatrixRTC session (current session, sessions history, join/leave events, ...) only requires one event type.

(current session, sessions history, join/leave events, ...) only require one event type:.

We use a set of m.rtc.member (one for each participant) state events to represent a session.

based on the content a m.rtc.member state event can either represent a connected or a disconnected member.

Joining a session

Sending a well-formed m.rtc.member event that describes a connected state for a state key that is not yet used or contains a disconnected m.rtc.member event represents a join action.

The fields are as follows:

member required object - describes the participant of the RTC session:
- id required string - a unique identifier for this session membership as defined above. Recommended to be a UUID. It can be reused if the user leaves and rejoins the session. It should be unique across all devices of the user. TODO: define grammar
- device_id required string - the Matrix device ID of the device that is joining the session. This is used when sending to-device messages.
- user_id required string - the Matrix user ID of the user that is joining the session. This is needed as we cannot rely on the owner of state event as it might have been modified by an admin or similar.
session required object - an object that is used to uniquely identify this session across RTC member events of the Matrix room:
- application required string - a recognised application type. e.g. m.call as linked below
- additional fields as defined by the application type
created_ts - timestamp in milliseconds since UNIX epoch.
- this should not be present the first time that the m.rtc.member event is sent.
- if the m.rtc.member event is sent again, the created_ts should be populated with the origin_server_ts that was given to the previous version of the state event.
focus_active required Focus object - specifies the algorithm that defines how to choose a Focus for this member. See below for details.
foci_preferred required array of Focus objects - specifies the input data for this algorithm contributed by this member. See below for details.

Additional fields may be added depending on the application type.

A full m.rtc.member state event for a joined member looks like this:

// event type: "m.rtc.member"
// state key: see next section for definition
{
  "session": {
    "application": "m.call"
    // further fields for the application
  },
  "member": {
    "id": "xyzABCDEF10123",
    "device_id": "DEVICEID",
    "user_id": "@user:matrix.domain"
  },
  "created_ts": Time | undefined,
  "focus_active": {...FOCUS_A},
  "foci_preferred": [
    {...FOCUS_1},
    {...FOCUS_2}
  ]
}

This gives us the information, that user: @user:matrix.domain with member ID DEVICEID_m:call_123456789 is part of a session identified by {} using application of type m.call connected over FOCUS_A. This is sufficient information for another room member to detect the running session and join it.

created_ts is an optional property that caches the time of creation. It is not required for an event that, has not yet been updated, there the origin_server_ts is used.

[!NOTE] We introduce created_ts() as the notation for created_ts ?? origin_server_ts

Once the event gets updated the origin_server_ts needs to be copied into the created_ts field. An existing created_ts field implies that this is a state event updating the current session and a missing created_ts field implies that it is a join state event. All membership events that belong to one member session can be grouped with the index created_ts()+state_key. This is why the m.rtc.member events deliberately do NOT include something akin to a membership_id.

Other then the membership sessions, there is no event to represent a RTC session (containing all members). This event would include shared information where it is not trivial to decide who has authority over it. Instead the session is a computed value based on m.rtc.member events. The list of events with the same session content represent one session. This array allows to compute fields like participant count, start time etc.

Based on the value of application, the event might include additional parameters to provide additional session parameters.

A Third Room like experience could include the information of an approximate position on the map, so that clients can omit connecting to participants that are not in their area of interest.

State key for `m.rtc.member`

The state key is generated from the member field of the m.rtc.member event.

We want to choose a state key that is compatible with whichever state protection proposal is accepted to ensure that users cannot modify one another's sessions.

For MSC3757 we generate the state key by concatenating the following strings:

the Matrix ID of the user
an _ (underscore)
the member.id field

For example with a member.id of xyzABCDEF10123 for user @user:matrix.domain the state key would be @user:matrix.domain_xyzABCDEF10123.

For a client parsing the state key we would treat anything before the first _ as the Matrix ID of the user and anything after as the member.id field.

Leaving a session

Sending an empty m.rtc.member event represents a leave action. The state key must be the same as boefore

There is an optional leave_reason field that can be used to provide a reason for leaving the session:

leave_reason optional string - one of: lost_connection

An example of leaving a session where the user explicitly disconnects:

// event type: "m.rtc.member"
// state key: "@user:matrix.domain_xyzABCDEF10123"
{
}

The client should use the prev_content field of the room state event to determine the details of the leave event.

For example:

// event type: "m.rtc.member"
// state key: "@user:matrix.domain_xyzABCDEF10123"
{
  "content": {
    "leave_reason": "lost_connection"
  },
  "prev_content": {
    "session": {
      "application": "m.call",
      "call_id": ""
    },
    "member": {
      "id": "xyzABCDEF10123",
      "device_id": "DEVICEID",
      "user_id": "@user:matrix.domain"
    },
    "created_ts": 123456,
    "focus_active": {...FOCUS_A},
    "foci_preferred": [
      {...FOCUS_1},
      {...FOCUS_2}
    ]
  }
}

Reliability requirements for the room state

Room state is a very well suited place to store the data for a MatrixRTC session. It allows:

The client to determine current ongoing sessions without loading history for every room. Or doing additional work other then the sync loop that needs to run anyways.
The client can compute/access data of past sessions without any additional redundant data.
Sessions (start/end/participant count) are federated and there is not redundant data storage that could result in conflicts, or can get out of sync. The room state events are part of the DAG and this is solved like any other Persistent Data Unit (PDU) in Matrix.

However, a challenging circumstance with using the room state to represent a session is the disconnection behaviour. If the client disconnects from a call because of a network issue, an application crash or a user forcefully quitting the client, the room state cannot be updated anymore. The client is required to leave by sending a new empty state which cannot happen once connection is lost.

If the state is not updated correctly we end up with a room state that is not correctly representing the current RTC session state. Historic and current MatrixRTC session data would be broken.

For an acceptable solution, the following requirements need to be taken into consideration:

Room state is set to empty if the client looses connection. (A heartbeat like system is desired)
The best source of truth for a call participation is a working connection to the SFU. It is desired that the disconnect of the member on the SFU gets propagated to the room state.
It should be possible to updated the room state without the client being online.
All this should be compatible when Matrix uses cryptographic identities.

MSC4140 proposes a concept to delay the leave events until one of the leave conditions (heartbeat or SFU disconnect) occur and fulfil all of the these requirements.

A MatrixRTC client has to first send/schedule the following delayed leave event:

// event type: "m.rtc.member"
// state key: "@user:matrix.domain_xyzABCDEF10123"
{
  "leave_reason": "lost_connection"
}

only after that the actual state event can be sent, so that we guarantee that the state will be empty eventually. The leave_reason is added so clients can be more verbal about why a user disconnected from a call. It allows to communicate with other participants in a session if the user has disconnected intentionally or lost connection.

Session history

Since there is no single entry for a historic session (because of the ownership discussion), historic sessions need to be computed and most likely cached on the client.

Each state event can either mark a join or leave:

join: prev_state.session != current_state.session && current_state.session != undefined (where an empty m.rtc.member event would imply state.session == undefined)
leave: prev_state.session != current_state.session && current_state.session == undefined

Based on this one can find user sessions. (The range between a join and a leave event) of specific times. The collection of all overlapping user sessions with the same session contents define one MatrixRTC history event.

The RTC backend

Backend infrastructure in this context can be anything that can serve as the backend for a MatrixRTC session. In most cases this is a SFU. But also a full mesh implementation could be an infrastructure. Not all kind of infrastructure require a way of sourcing a backend resource (e.g. full-mesh). In this MSC we only refer to infrastructure where it is necessary to have access to additional data to participate in the MatrixRTC session.

The backend is referred to as a Focus or as Foci in plural.

Note that these backends are independent of the application (e.g. m.call) being used in the session.

A Focus is represented as a JSON object with one mandatory field:

type required string: The type of the Focus as defined by an RTC backend..

Additional fields will be present depending on type.

Only users with the same type can connect in one session. If a frontend does not support the used type they cannot connect.

Each Focus type will get its own MSC in which the detailed procedure to get from the foci information to working WebRTC connections to the streams of all the participants is explained.

Foci are represented in three places:

focus_active of m.rtc.member state event - specifies the algorithm that defines how to choose a Focus for this member.
foci_preferred of m.rtc.member state event- specifies the input data for this algorithm contributed by this member.
m.rtc_foci of the .well-known/matrix/client - specifies the list of available Foci for the homeserver.

The focus_active algorithm needs to be designed so that all participants converge to the same SFU/Focus.

The following Focus type values are defined:

livekit - a backend using the LiveKit SFU as described in MSC4195.
full_mesh - a backend using a full-mesh approach based on MSC3401.

Choosing the value of `foci_preferred` for the `m.rtc.member` state event

At some point session participants have to decide/propose which Focus they will use.

Based on the Focus type and application choosing the method by which the contents of the foci_preferred field on the m.rtc.member can be different.

There are three guidelines which should be obeyed by a client when building the foci_preferred list:

It is always desired to have as few Focus switches as possible.

If there are other participants on the session (i.e. other m.rtc.member events) the client should calculate what the Focus it should connect to based on the m.rtc.member events for the existing participants. This should happen reactively on each m.rtc.member state event change. Each MatrixRTC frontend is responsible that it can deal with focus switches based on changing state gracefully. It is part of the design of MatrixRTC and a requirement for a eventually consistent distributed system.

The calculated Focus should then be present at the start of the foci_preferred list.

The client should lookup the suggested foci from the homeserver .well-known/matrix/client as defined below.

MatrixRTC is designed around the same culture that makes Matrix possible: A large amount of infrastructure in the form of homeservers is provided by the users.

To achieve a stable and healthy ecosystem backend RTC infrastructure should be thought of as a part of a homeserver.

It is very similar to a TURN server: mostly traffic and little CPU load.

To not end up in a world where each user is only using one central SFU but where the traffic is split over multiple SFU's it is important that we leverage the SFU distribution on the homeserver federation.

These proposals from your own homeserver should come next in the foci_preferred list of the member event.

Clients should not use a hard-coded Focus.

Looking up the preferred Foci from a client is toxic to a federated system. If the majority of users decide to use the same client all of the users will use one Focus. This destroys the passive security mechanism, that each instance is not an interesting attack vector since it is only a fraction of the network. Additionally it will result in poor performance if every user on Matrix would use the same Focus.

However, there are cases where this is acceptable:

Transitioning to MatrixRTC. Here it might be beneficial to have a client that has a fallback Focus so calls also work with homeservers not supporting it.
For testing purposes where a different Focus should be tested but one does not want to touch the .well-known
For custom deployments that benefit from having the Focus configuration on a per client basis instead of per homeserver.

Therefore, if a client does use a hard-coded Focus it should come last in the foci_preferred list.

Discovery of Foci using `.well-known/matrix/client`

[!NOTE] Backend infrastructure in this context can be anything that can serve as the backend for a MatrixRTC session. In most cases this is a SFU. But also a full mesh implementation could be an infrastructure. Not all kind of infrastructure require a way of sourcing a backend resource (e.g. full-mesh). In this MSC we only refer to infrastructure where it is necessary to have access to additional data to participate in the MatrixRTC session.

We use a m.rtc_foci key in the homeserver .well-known/matrix/client that can be used to expose a sorted (by priority) list of Focus description objects.

For example in generic form:

{
  "m.rtc_foci": [
    {
      "type": "some-focus-type",
      "additional-type-specific-field": "https://my_focus.domain",
      "another-additional-type-specific-field": ["with", "Array", "type"]
    }
  ]
}

Or a concrete example for a livekit Focus:

{
  "m.rtc_foci": [
    {
      "type":"livekit",
      "livekit_service_url":"https://livekit-jwt.call.element.io"
    }
  ]
}

The RTC application types

Each application type might have its own specification in how the different streams are interpreted and even what Focus type to use. This makes this proposal extremely flexible. A Jitsi conference could be added by introducing a new application and a new Focus type and would be MatrixRTC compatible. It would not be compatible with applications that do not use the Jitsi Focus but clients would know that there is an ongoing session of unknown type and unknown Focus and could display/represent this in the user interface.

To make it easy for clients to support different application types, the recommended approach is to provide a Matrix widget for each application type. This way the client developers can use the widget as the first implementation if they want to support this RTC application type.

Each application should get its own MSC in which the all the additional fields are explained and how the communication with the possible foci is defined:

m.call - voice and video conferencing described by MSC4196.

Interoperability between applications

There is a use-case in which a m.call app might want to participate in a session of type (application) custom-call-with-more-features. A native mobile matrix client might support m.call and is at hand to join the feature rich application/session.

There could be fallback mechanisms but the most flexible approach is to treat it per application type. If it makes sense for an application type to fully conform to m.call a client that can connect to an m.call RTC session (application) could claim that it is also compatible with custom-call-with-more-features . It is than the job of the custom-call-with-more-features session type (application) to define some kind of feature list so that it can tell if users are joining with an m.call client or a dedicated custom-call-with-more-features client.

End-to-end encryption of media streams

We define how the key material is shared between the participants of the call to facilitate end-to-end encryption of the media streams.

The backend (e.g. LiveKit) MSC defines how the key material is actually used.

Shared password

A shared password may be used to encrypt the media streams sent via the RTC backend that has been distributed ahead of time to the participants.

For example, it could be in the query parameter of a private URL attached to a calendar invitation.

Per-participant sender key

A participant can share it's chosen key with other participants by sending Matrix to-device messaging to the other participants.

The key is sent as an event of type m.rtc.encryption_keys as an encrypted to-device message.

The device ID that is being sent to is the member.device_id from the m.rtc.member events.

The event contains the following fields:

session required object: The contents of the session from the m.rtc.member event.
member required object: The contents of the member from the corresponding m.rtc.member event.
keys required array of objects: The sender keys to be distributed to the participant:
- key required string: The base64 encoded key material.
- index required int: The index of the key to distinguish it from other keys. This must be a between 0 and 255 inclusive. In some implementations of MatrixRTC this may correspond to the keyID field of the WebRTC SFrame header.
- invalidates_key_index optional int: The index of the key that is invalidated by this key. If this is set, the application should invalidate the key identified by invalidates_key_index once it receives a frame with the new index. This is to protect against an exfiltrated key being used to forge frames.
- invalidates_after_ms optional int: The number of milliseconds after the key identified by invalidates_key_index is invalidated by this key even if no frames are received. Again, this is to protect against an exfiltrated key being used to forge frames.

Depending on the RTC application, additional fields may be added to this event.

An example to-device event:

// event type: "m.rtc.encryption_keys"
{
    "session": {
      "application": "m.call",
      "call_id": "",
      "scope": "m.room"
    },
    "member": {
      "id": "xyzABCDEF10123",
      "device_id": "DEVICEID",
      "user_id": "@user:matrix.domain"
    },
    "room_id": "!roomid:matrix.domain",
    "keys": [
        {
            "index": 10,
            "key": "base64encodedkey",
            "invalidates_key_index": 9,
            "invalidates_after_ms": 5000
        },
    ],
}

On receipt of the m.rtc.encryption_keys event the application can associate the received key with the RTC session by matching the session and member contents with the corresponding m.rtc.member event.

When the application joins the session it should send the key to all the existing participants.

To ensure forward secrecy and post compromise security, the key material should be rotated (i.e. a new key generated) when a participant joins or leaves the session.

Key rotation is done as follows:

the sending application generates the new key material for the participant.
the sending application sends the new key material to all the participants with a new index value and invalidates_key_index set to the current index.
the receiving application stores the new key material for the specified index.
the sending application continues to use the old/current key to encrypt media.
the sending application waits for a period of time. The default should be 3 seconds. It is possible to overwrite this on a per application basis in case an application has specific requirements on security or wants to minimize missed stream data. Also negotiation approaches can be defined where the RTC application uses data channels to communicate if everyone has received the next key.
the sending application starts to use the new key to encrypt media.
the receiving application invalidates the existing key with the invalidates_key_index value.

Discovery/negotiation of application types

Problem: If a user wants to make a call to a user or room, then which call/application options should the client present to the user?

This should also take account of non-MatrixRTC calling: legacy 1:1 VoIP, room state widget for Jitsi.

TODO: write up notes.

Potential issues

Alternatives

One state event per user

MSC3401 proposed to have one state event per user with that state event containing an array of memberships.

This introduces two problems:

potential inconsistency where one user device overwrites the state of another device during a concurrent update.
when handling client disconnects the MSC3757 proposal could not be used as you would not know what the correct state is at the time of the disconnect.

One state event per device

This would mean not using member.id in the state key anymore. Race conditions can be solved by the client which would need to manage multiple sessions at once.

A separate system not associated with Matrix accounts

This MSC proposes to combine the MatrixRTC backend infrastructure with the homeserver. Other sources where the backend could be sourced from are:

A separate system not associated with Matrix accounts. (you would need a Matrix account + a "LiveKit provider" account for example)
The client could bring its own backend link.
A centralized solution.

The centralized solution would not fit to Matrix. A separate system would match the distributed nature of Matrix but would not match the user experience goals for MatrixRTC calls.

The client defining the SFU that is used, is the current solution. This causes the issue, that clients in general are less distributed than homeservers. There is only a limited set of clients that a large percentage of users use. Using this as the source for the infrastructure would result in just a handful of very large infrastructure hosts. This is harder to scale and it is harder to justify who is covering the costs. (For Matrix homeservers, this is an already solved problem where there are individuals, communities and institutions that have their own individual solutions and answers for how and why they provide the infrastructure.)

`m.rtc.encryption_keys` room event

Earlier iterations of this MSC used an encrypted m.rtc.encryption_keys room event to distribute the per-participant sender keys.

Whilst reducing traffic by only needing to send one event per participant, this approach does not allow for perfect forward secrecy as the keys are stored in the room history.

The encrypted content of the m.rtc.encryption_keys event was as follows:

{
    "session": {
      "application": "m.call",
      "call_id": ""
    },
    "member": {
      "id": "xyzABCDEF10123",
      "device_id": "DEVICEID",
      "user_id": "@user:matrix.domain"
    }.
    "keys": [
        {
            "index": 0,
            "key": "base64encodedkey"
        },
    ],
}

Security considerations

Discoverability of infrastructure

The .well-known/matrix/client is publicly readable, hence everyone can read and know about the infrastructure which could lead to resource "stealing". Each infrastructure however has their own authentication mechanism defined in the infrastructure specification. Those mechanisms for instance can use a service to interact with the homeserver and based on that decide to allow users to use the infrastructure.

This is defined in the respective infrastructure MSC.

Forward secrecy for end-to-end encryption of media streams

The considerations to ensure forward secrecy are described in the End-to-end encryption of media streams section above.

End-to-end media encryption key rotation lag

The proposed key rotation semantics does mean that a participant could continue to decrypt media that was sent in the three seconds after leaving the session.

Unstable prefix

Use org.matrix.msc3401.call.member as the state event type in place of m.rtc.member.

For discovery via .well-known/matrix/client the prefix org.matrix.msc4158.rtc_foci is used in place of m.rtc_foci.

Use io.element.call.encryption_keys in place of the m.rtc.encryption_keys room event and to-device event types.

Dependencies

This proposal depends on MSC3757: Restricting who can overwrite a state event to provide access control for the decentralised management of call membership state. However, an alternative such as MSC3779: "Owned" State Events could be used instead with some adaptations.

This proposal also depends on MSC4140: Cancellable delayed events to provide a mechanism for clients to ensure that they can update the room state even if they lose connection.

29 KiB Raw Permalink Blame History