matrix-doc/proposals/4083-delta-compressed-file-...

5.0 KiB

MSC4083: Delta-compressed E2EE file transfers

Problem

When collaborating on a large file of some kind, it's common to store that file in Matrix, and then need a way to express incremental changes to it. For instance, in Third Room, you might store a large glTF scene graph as a GLB file, and then want to express a small change to it (e.g. using the editor to transform part of the scene graph). Or you might want to store a change to a markdown or HTML file.

Currently, your only option is to save a whole new copy of the file - or invent your own delta-compression scheme at the application layer. Instead, we could make Matrix itself aware of delta-compression, letting the content repository help users efficiently collaborate around updates to binary files, regardless of what the file is.

Solution

When uploading a file, specify that it's a delta against a previous piece of content, using a given algorithm.

  • delta_base is the mxc URL of the content the delta applies to
  • delta_format is the file format of the binary diff
    • This MSC defines m.vcdiff.v1.gzip to describe gzipped RFC3284 compatible binary VCDIFF payloads, picked for computation efficiency rather than patch size (whereas bsdiff + bzip might provide better patch size at worse computation complexity; other MSCs are welcome to propose different diff formats).

Clients should upload a new snapshot of a piece of content if the sum of the deltas relative to the last snapshot is larger than 50% of the original piece of content

For instance:

POST /_matrix/media/v3/upload?delta_base=mxc://matrix.org/b4s3v3rs10n&delta_format=m.vcdiff.v1.gzip

returning:

{
  "content_uri": "mxc://matrix.org/n3wv3rs10n"
}

(or with the same parameters for MSC2246-style POST /_matrix/media/v3/create).

The server tracks the graph of which deltas apply to which files, so it can only hand the relevant deltas to clients when they download them.

For instance, when downloading a delta-compressed piece of content, a client might ask to pull in any delta dependencies it doesn't already have stored locally, relative to the last version that it has a full copy of:

GET /_matrix/media/v3/download/matrix/org/n3wv3rs10n?delta_base=mxc://matrix.org/b4s3v3rs10n

This would return an ordered multipart download of the deltas (once unencrypted, if needed) to apply to the base-version to get a copy of the new-version.

Alternatives

Track deltas on events rather than media repository

Alternatively we could store the delta info on the m.file event itself as a mixin. This would allow us to shift the task of tracking deltas purely to clients, and protect the delta info within the E2EE payload. However, this would then force the client to do many more roundtrips to spider the events (if needed) and files (if needed) one by one in order to calculate diffs, which would be O(N) latency with the number of diffs rather than O(1) for the above API. Given the traffic pattern of these requests would reveal the delta graph to the server anyway, it's not clear that it provides a sufficient advantage. This would look like this:

{
  "content": {
    "filename": "something-important.doc",
    "info": {
      "mimetype": "application/msword",
      "size": 46144
    },
    "msgtype": "m.file",
    "url": "mxc://example.org/n3wv3rs10n",
    "delta_base": "$1235135aksjgdkg",
    "delta_format": "m.vcdiff.v1.gzip"
  },
}

We could go even further down this path by defining an arbitrary CRDT for tracking these deltas, a bit like the (Saguaro CRDT-over-Matrix)[https://github.com/matrix-org/collaborative-documents/blob/main/docs/saguaro.md] proposal, with files decorating each event - effectively modelling the problem as a collaborative document problem (with binary diffs attached) rather than a binary file diffing problem.

Other alternatives

We could use HTTP PATCH rather than POST when sending diffs. This feels needlessly exotic, imo.

Rather than having a delta_format field, we could use the MIME type of the upload to indicate that it's a patch to a given underlying MIME type. However, Matrix doesn't currently have to parse MIME types anywhere, so it's more matrixy to destructure this in JSON.

For unencrypted files, the server could apply the diffs serverside as a convenience to clients who don't know how to apply the diffs themselves (or who don't have CPU to apply the diffs, or want to benefit from the server caching diff results). This could be proposed as a separate MSC.

Security considerations

This exposes the metadata of which file is a delta to which other file to the server.

DoS by too many deltas

DoS by using async uploads to create a cycle

Unstable prefix

Param Unstable prefixed param
delta_base org.matrix.msc4083.delta_base
delta_format org.matrix.msc4083.delta_format

Dependencies

None. Although MSC4016 was sketched out at the same time and the two are siblings.