matrix-doc/proposals/4083-delta-compressed-file-...

113 lines
5.0 KiB
Markdown

# MSC4083: Delta-compressed E2EE file transfers
## Problem
When collaborating on a large file of some kind, it's common to store that file in Matrix, and then need a way to
express incremental changes to it. For instance, in Third Room, you might store a large glTF scene graph as a GLB
file, and then want to express a small change to it (e.g. using the editor to transform part of the scene graph). Or
you might want to store a change to a markdown or HTML file.
Currently, your only option is to save a whole new copy of the file - or invent your own delta-compression scheme at
the application layer. Instead, we could make Matrix itself aware of delta-compression, letting the content repository
help users efficiently collaborate around updates to binary files, regardless of what the file is.
## Solution
When uploading a file, specify that it's a delta against a previous piece of content, using a given algorithm.
* `delta_base` is the mxc URL of the content the delta applies to
* `delta_format` is the file format of the binary diff
* This MSC defines `m.vcdiff.v1.gzip` to describe gzipped RFC3284 compatible binary VCDIFF payloads, picked for
computation efficiency rather than patch size (whereas bsdiff + bzip might provide better patch size at worse
computation complexity; other MSCs are welcome to propose different diff formats).
Clients should upload a new snapshot of a piece of content if the sum of the deltas relative to the last snapshot
is larger than 50% of the original piece of content
For instance:
`POST /_matrix/media/v3/upload?delta_base=mxc://matrix.org/b4s3v3rs10n&delta_format=m.vcdiff.v1.gzip`
returning:
```json
{
"content_uri": "mxc://matrix.org/n3wv3rs10n"
}
```
(or with the same parameters for MSC2246-style `POST /_matrix/media/v3/create`).
The server tracks the graph of which deltas apply to which files, so it can only hand the relevant deltas to clients
when they download them.
For instance, when downloading a delta-compressed piece of content, a client might ask to pull in any delta dependencies
it doesn't already have stored locally, relative to the last version that it has a full copy of:
`GET /_matrix/media/v3/download/matrix/org/n3wv3rs10n?delta_base=mxc://matrix.org/b4s3v3rs10n`
This would return an ordered multipart download of the deltas (once unencrypted, if needed) to apply to the base-version
to get a copy of the new-version.
## Alternatives
### Track deltas on events rather than media repository
Alternatively we could store the delta info on the `m.file` event itself as a mixin. This would allow us to shift the
task of tracking deltas purely to clients, and protect the delta info within the E2EE payload. However, this would then
force the client to do many more roundtrips to spider the events (if needed) and files (if needed) one by one in order
to calculate diffs, which would be O(N) latency with the number of diffs rather than O(1) for the above API. Given the
traffic pattern of these requests would reveal the delta graph to the server anyway, it's not clear that it provides a
sufficient advantage. This would look like this:
```json
{
"content": {
"filename": "something-important.doc",
"info": {
"mimetype": "application/msword",
"size": 46144
},
"msgtype": "m.file",
"url": "mxc://example.org/n3wv3rs10n",
"delta_base": "$1235135aksjgdkg",
"delta_format": "m.vcdiff.v1.gzip"
},
}
```
We could go even further down this path by defining an arbitrary CRDT for tracking these deltas, a bit like the
(Saguaro CRDT-over-Matrix)[https://github.com/matrix-org/collaborative-documents/blob/main/docs/saguaro.md] proposal,
with files decorating each event - effectively modelling the problem as a collaborative document problem (with binary
diffs attached) rather than a binary file diffing problem.
### Other alternatives
We could use HTTP PATCH rather than POST when sending diffs. This feels needlessly exotic, imo.
Rather than having a delta_format field, we could use the MIME type of the upload to indicate that it's a patch to a
given underlying MIME type. However, Matrix doesn't currently have to parse MIME types anywhere, so it's more matrixy
to destructure this in JSON.
For unencrypted files, the server could apply the diffs serverside as a convenience to clients who don't know
how to apply the diffs themselves (or who don't have CPU to apply the diffs, or want to benefit from the server caching
diff results). This could be proposed as a separate MSC.
## Security considerations
This exposes the metadata of which file is a delta to which other file to the server.
DoS by too many deltas
DoS by using async uploads to create a cycle
## Unstable prefix
| Param | Unstable prefixed param |
| ------------ | -------------------------------- |
| delta_base | org.matrix.msc4083.delta_base |
| delta_format | org.matrix.msc4083.delta_format |
## Dependencies
None. Although [MSC4016](https://github.com/matrix-org/matrix-spec-proposals/pull/4016) was sketched out at the same
time and the two are siblings.