matrix-doc/proposals/4016-streaming-e2ee-file-tr...

# MSC4016: Streaming and resumable E2EE file transfer with random access

## Problem

* File transfers currently take twice as long as they could, as they must first be uploaded in their entirety to the
  sender’s server before being downloaded via the receiver’s server.
* As a result, relative to a dedicated file-copying system (e.g. scp) they feel sluggish. For instance, you can’t
  incrementally view a progressive JPEG or voice or video file as it’s being uploaded for “zero latency” file
  transfers.
* You can’t skip within them without downloading the whole thing (if they’re streamable content, such as an .opus file)
* For instance, you can’t do realtime broadcast of voice messages via Matrix, or skip within them (other than splitting
  them into a series of separate file transfers).
* You also can't resume uploads if they're interrupted.
* Another example is sharing document snapshots for real-time collaboration. If a user uploads 100MB of glTF in Third
  Room to edit a scene, you want all participants to be able to receive the data and stream-decode it with minimal
  latency.

Closes [https://github.com/matrix-org/matrix-spec/issues/432](https://github.com/matrix-org/matrix-spec/issues/432)

N.B. this MSC is *not* needed to do a streaming decryption or encryption of E2EE files (as opposed to streaming
transfer).  The current APIs let you stream a download of AES-CTR data and incrementally decrypt it without loading the
whole thing into RAM, calculating the hash as you go, and then either surfacing or deleting the decrypted result at the
end if the hash matches.

Relatedly, v2 MXC attachments can't be stream-transferred, even if combined with [MSC2246]
(https://github.com/matrix-org/matrix-spec-proposals/pull/2246), given you won't be able to send the hash in the event
contents until you've uploaded the media.

## Solution sketch

* Upload content in a single file made up of contiguous blocks of AES-GCM content.
    * Typically constant block size (e.g. 32KB)
    * Or variable block size (to allow time-based blocksize for low-latency seeking in streamable content) - e.g. one
      block per opus frame.  Otherwise a 32KB block ends up being 8s of typical opus latency.
        * This would then require a registration sequence to identify the starts of blocks boundaries when seeking
          randomly (potentially escaping the bitstream to avoid registration code collisions).
* Unlike today’s AES-CTR attachments, AES-GCM makes the content self-authenticating, in that it includes an
  authentication tag (AEAD) to hash the contents and protect against substitution attacks (i.e. where an attacker flips
  some bits in the encrypted payload to strategically corrupt the plaintext, and nobody notices as the content isn’t
  hashed).
    * (The only reason Matrix currently uses AES-CTR is that native AES-GCM primitives weren’t widespread enough on
      Android back in 2016)
* To prevent against reordering attacks, each AES-GCM block has to include an encrypted block header which includes a
  sequence number, so we can be sure that when we request block N, we’re actually getting block N back - or
  equivalent.
    * XXX: is there still a vulnerability here? Other approaches use Merkle trees to hash the AEADs rather than simple
      sequence numbers, but why?
* We use streaming HTTP upload (https://developer.chrome.com/articles/fetch-streaming-requests/) and/or
  [tus](https://tus.io/protocols/resumable-upload) resumable upload headers to incrementally send the file. This also
  gives us resumable uploads.
* We then use normal [HTTP Range](https://datatracker.ietf.org/doc/html/rfc2616#section-14.35.1) headers to seek while
  downloading.

## Advantages

* Backwards compatible with current implementations at the HTTP layer
* Fully backwards compatible for unencrypted transfers
* Relatively minor changes needed from AES-CTR to sequence-of-AES-GCM-blocks for implementations like
  [https://github.com/matrix-org/matrix-encrypt-attachment](https://github.com/matrix-org/matrix-encrypt-attachment)
* We automatically maintain a serverside E2EE store of the file as normal, while also getting 1:many streaming
  semantics
* Provides streaming transfer for any file type - not just media formats
* Minimises memory usage in Matrix clients for large file transfers. Currently all(?) client implementations store the
  whole file in RAM in order to check hashes and then decrypt, whereas this would naturally lend itself to processing
  files incrementally in blocks.
* Leverages AES-GCM’s existing primitives and hashing rather than inventing our own hashing strategy
* We've already implemented this once before (pre-Matrix) in our 'glow' codebase, and it worked excellently.
  pre-E2EE and pre-Matrix in our ‘glow’ codebase.
* Random access could enable torrent-like semantics in future (i.e. servers doing parallel downloads of different chunks
  from different servers, with appropriate coordination)
* tus looks to be under consideration by the IETF HTTP working group, so we're hopefully picking the right protocol for
  resumable uploads.

## Limitations

* Enterprisey features like content scanning and CDGs require visibility on the whole file, so would eliminate the
  advantages of streaming by having to buffering it up in order to scan it.  (Clientside scanners would benefit from
  file transfer latency halving but wouldn't be able to show mid-transfer files)
* When applied to unencrypted files, server-side content scanning (for trust & safety etc) would be unable to scan until
  it’s too late.
* For images & video, senders will still have to read (and decompress) enough of the file into RAM in order to thumbnail
  it or calculate a blurhash, so the benefits of streaming in terms of RAM use on the sender are reduced.  One could
  restrict thumbnailing to the first 500MB of the transfer (or however much available RAM the client has) though, and
  still stream the file itself, which would be hopefully be enough to thumbnail the first frame of a video, or most
  images, while still being able to transfer arbitrary length files.
* Cancelled file uploads will still leak a partial file transfer to receivers who start to stream, which could be
  awkward if the sender sent something sensitive, and then can’t tell who downloaded what before they hit the cancel
  button
* Small bandwidth overhead for the additional AEADs and block headers - ~32 bytes per block.
* Out of the box it wouldn't be able to adapt streaming to network conditions (no HLS or DASH style support for multiple
  bitstreams)
* Might not play nice with CDNs? (I haven't checked if they pass through Range headers properly)
* Recorded E2EE SFU streams (from a [MSC3898](https://github.com/matrix-org/matrix-spec-proposals/pull/3898) SFU or
  LiveKit SFU) could be made available as live-streamed file transfers through this MSC. However, these streams would
  also have their own S-Frame headers, whose keys would need to be added to the `EncryptedFile` block in addition to
  the AES-GCM layer.

## Detailed proposal

The file is uploaded asynchronously using [MSC2246](https://github.com/matrix-org/matrix-spec-proposals/pull/2246).

The proposed v3 `EncryptedFile` block looks like:

```json5
"file": {
    "v": "org.matrix.msc4016.v3",
    "key": {
        "alg": "A256GCM",
        "ext": true,
        "k": "cngOuL8OH0W7lxseExjxUyBOavJlomA7N0n1a3RxSUA",
        "key_ops": [
            "encrypt",
            "decrypt"
        ],
        "kty": "oct"
    },
    "iv": "HVTXIOuVEax4E+TB", // 96-bit base-64 encoded initialisation vector
    "url": "mxc://example.com/raAZzpGSeMjpAYfVdTrQILBI",
},
```

N.B. there is no longer a `hashes` key, as AES-GCM includes its own hashing to enforce the integrity of the file
transfer. Therefore we can authenticate the transfer by the fact we can decrypt it using its key & IV (unless an
attacker who controls the same key & IV has substituted it for another file - see Security Considerations below)

We split the file stream into blocks of AES-256-GCM, with the following simple framing:

 * File header with a magic number of: 0x4D, 0x58, 0x43, 0x03 ("MXC" 0x03) - just so `file` can recognise it.
 * 1..N blocks, each with a header of:
    * a 32-bit field: 0xFFFFFFFF (a registration code to let a parser handle random access within the file
    * a 32-bit field: block sequence number (starting at zero, used to calculate the IV of the block, and to aid random
      access)
    * a 32-bit field: the length in bytes of the encrypted data in this block.
    * a 32-bit field: a CRC32 checksum of the block, including headers. This is used when randomly seeking as a
      consistency check to confirm that the registration code really did indicate the beginning of a valid frame of
      data.  It is not used for cryptographic integrity.
    * the actual AES-GCM bitstream for that block.
        * the plaintext block size can be variable; 32KB is a good default for most purposes.
        * Audio streams may want to use a smaller block size (e.g. 1KB blocks for a CBR 32kbps Opus stream will give
          250ms of streaming latency).  Audio streams should be CBR to avoid leaking audio waveform metadata via block
          size.
        * The block is encrypted using an IV formed by concatenating the block sequence number of the `file` block with
          the IV from the `file` block (forming a 128-bit IV, which will be hashed down to 96-bit again within
          AES-GCM).  This avoids IV reuse (at least until it wraps after 2^32-1 blocks, which at 32KB per block is
          137TB (18 hours of 8k raw video), or at 1KB per block is 4TB (34 years of 32kbps audio)).
            * Implementations MUST terminate a stream if the seqnum is exhausted, to prevent IV reuse.
            * Receivers MUST terminate a stream if the seqnum does not sequentially increase (to prevent the server from
              shuffling the blocks)
            * XXX: Alternatively, we could use a 64-bit seqnum, spending 8 bytes of header on seqnums feels like a waste
              of bandwidth just to support massive transfers. And we'd have to manually hash it with the 96-bit IV
              rather than use the GCM implementation.
        * The block is encrypted including the 32-bit block sequence number as Additional Authenticated Data, thus
          stopping encrypted blocks from impersonating each other.

Or graphically, each frame is:

```
protocol "Registration Code (0xFFFFFFF):32,Block sequence number:32,Encrypted block length:32,CRC32:32,AES-GCM encrypted Data:64"

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                 Registration Code (0xFFFFFFF)                 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Block sequence number                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Encrypted block length                    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             CRC32                             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                     AES-GCM encrypted Data                    +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

```

The actual file upload can then be streamed in the request body in the PUT (requires HTTP/2 in browsers). Similarly, the
download can be streamed in the response body.  The download should stream as rapidly as possible from the media
server, letting the receiver view it incrementally as the upload happens, providing "zero-latency" - while also storing
the stream to disk.

For resumable uploads (or to upload in blocks for HTTP clients which don't support streaming request bodies), we use
[tus](https://tus.io/protocols/resumable-upload) 1.0.0.

For resumable downloads, we then use normal
[HTTP Range](https://datatracker.ietf.org/doc/html/rfc2616#section-14.35.1) headers to seek and resume while downloading.

TODO: We need a way to mark a transfer as complete or cancelled (via a relation?).  If cancelled, the sender should
delete the partial upload (but the partial contents will have already leaked to the other side, of course).

TODO: While we're at it, let's actually let users DELETE their file transfers, at last.

N.B. Clients which implement displaying blurhashes should progressively load the thumbnail over the top of the blurhash,
to make sure the detailed thumbnail streams in and is viewed as rapidly as possible.

## Alternatives

* We could use an existing streaming encrypted framing format of some kind rather (SRTP perhaps, which would give us
  timestamps for easier random access for audio/video streams) - but this feels a bit strange for plain old file
  streams.
* Alternatively, we could descope random access entirely, given it only makes sense for AV streams, and requires
  timestamps to work nicely - and simply being able to stream encryption/decryption is a win in its own right. For
  instance, glow doesn't let you seek randomly within files which are mid transfer; only tail.
* Split files into a series of separate m.file uploads which the client then has to glue back together (as the
  [voice broadcast feature](https://github.com/vector-im/element-meta/discussions/632) does in Element today).
    * Pros:
        * Works automatically with antivirus & CDGs
        * Could be made to map onto HLS or DASH? (by generating an .m3u8 which contains a bunch of MXC urls? This could
          also potentially solve the glitching problems we’ve had, by reusing existing HLS players augmented with our
          E2EE support)
    * Cons:
        * Is always going to be high latency (e.g. Element currently splits into ~30s chunks) given rate limits on
          sending file events
        * Can be a pain to glue media uploads back together without glitching
* Transfer files via streaming P2P file transfer via WebRTC data channels
  (https://github.com/matrix-org/matrix-spec/issues/189)
    * Pros:
        * Easy to implement with Matrix’s existing WebRTC signalling
        * Could use MSC3898-inspired media control to seek in the stream
    * Cons:
        * You don’t get a serverside copy of the data
        * Hard for clients to implement relative to a simple HTTP download
        * You expose client IPs to each other if going P2P rather than via TURN
* Do streaming voice/video messages/broadcast via WebRTC media channels instead
    * Pros:
        * Lowest latency
        * Could use media control to seek
        * Supports multiple senders
        * Works with CDGs and other enterprisey scanners which know how to scan VOIP payloads
        * Could automatically support variable streams via SFU to adapt to network conditions
        * If the SFU does E2EE and archiving, you get that for free.
    * Cons:
        * Complex; you can’t just download the file via HTTP
        * Requires client to have a WebRTC stack
        * A suitable SFU still doesn’t exist yet
* Transfer files out of band using a protocol which already provides streaming transfers (e.g. IPFS?)
* Could use ChaCha20-Poly1305 rather than AES-GCM, but no native webcrypto impl yet: https://github.com/w3c/webcrypto/issues/223
  * See also https://soatok.blog/2020/05/13/why-aes-gcm-sucks/ and https://andrea.corbellini.name/2023/03/09/authenticated-encryption/
* We could use YouTube's resumable upload API via `Content-Range` headers from
  https://developers.google.com/youtube/v3/guides/using_resumable_upload_protocol, but having implemented both it and
  tus, tus feels inordinately simpler and less fiddly.  YouTube is likely to be well supported by proxies etc, but if
  tus is ordained by the HTTP IETF WG, then it should be well supported too.

## Security considerations

* AES-GCM is not key-committing, so removing hashes on the event means:
  * the key committing attacks are all about an adversary which constructs a ciphertext C with multiple ((IV1, K1), (IV2, K2), ...) so that C decrypts to P1, P2, ... at the same time
  * given that AES GCM is specifically not key committing, we introduce this attack.
  * (thanks to @dkasak for pointing this out)
* Variable size blocks could leak metadata for VBR audio.  Mitigation is to use CBR if you care about leaking voice
  traffic patterns (constant size blocks isn’t necessarily enough, as you’d still leak the traffic patterns)
* Is encrypting a sequence number in block header (with authenticated encryption) sufficient to mitigate reordering
  attacks?
  * When doing random access, the reader has to trust the server to serve the right blocks after a discontinuity
* The resulting lack of atomicity on file transfer means that accidentally uploaded files may leak partial contents to
  other users, even if they're cancelled.
* Clients may well wish to scan untrusted inbound file transfers for malware etc, which means buffering the inbound
  transfer and scanning it before presenting it to the user.
* Removing the `hashes` entry on the EncryptedFile description means that an attacker who controls the key & IV of the
  original file transfer could strategically substitute the file contents.  This could be desirable for CDGs wishing to
  switch a file for a sanitised version without breaking the Matrix event hashes.  For other scenarios it could be
  undesirable - for instance, a malicious server could serve different file contents to other users or servers to evade
  moderation.  An alternative might be for the sender to keep sending new hashes in related matrix events as the
  stream uploads (but it's unclear if this is worth it, relative to MSC3888)

## Conclusion

For the voice broadcast use case, it's a bit unclear whether this is actually an improvement over splitting files into
multiple file uploads (or [MSC3888](https://github.com/matrix-org/matrix-spec-proposals/blob/weeman1337/voice-broadcast/proposals/3888-voice-broadcast.md)).
It's also unfortunate that the benefits of the MSC are reduced with content scanners and CDGs.  It’s also a bit unclear
whether voice/video broadcast would be better served via MSC3888 style behaviour.

However, for halving the transfer time for large videos and files (and the magic "zero latency" of being able to see
file transfers instantly start to download as they upload) it still feels like a worthwhile MSC.  Switching to GCM is
desirable too in terms of providing authenticated encryption and avoiding having to calculate out-of-band hashes for
file transfer.  Finally, implementing this MSC will force implementations to stream their file encryption/decryption
and avoid the temptation to load the whole file into RAM (which doesn't scale, especially in constrained environments
such as iOS Share Extensions).

## Dependencies

This MSC depends on [MSC2246](https://github.com/matrix-org/matrix-spec-proposals/pull/2246), which has now landed in
the spec. Extends [MSC3469](https://github.com/matrix-org/matrix-spec-proposals/pull/3469).

## Unstable prefixes

| Unstable prefix       | Stable prefix       |
| --------------------- | ------------------- |
| org.matrix.msc4016.v3 | v3                  |