matrix-doc/proposals/1597-id-grammar.md

11 KiB

Grammars for identifiers in the Matrix protocol

Background

Matrix uses client- or server-generated identifiers in a number of places. Historically the grammars for these have been underspecified, which leads to confusion about what is or is not a valid identifier with the possibility of incompatability between implementations.

This proposal presents tightly-specified grammars for a number of identifiers.

Common Identifiers

Spec

Proposal:

localpart may not include :. When parsing a Common Identifier, it should be split at the leftmost :.

Rationale: server names may contain multiple :s (think IPv6 literals), so the first colon is the only sane place to split them. This is a Known Thing, but I don't think we spell it out anywhere in the spec.

User IDs

User IDs are well-specified, however we should consider dropping / from the list of allowed characters, because HTTP proxies might rewrite /_matrix/client/r0/profile/@foo%25bar:matrix.org/displayname to /_matrix/client/r0/profile/@foo/bar:matrix.org/displayname, messing things up.

History: / was introduced with the intention of acting as a hierarchical namespacing character, particularly with consideration to the gitter protocol which uses it as a hierarchical separator. However, this was not as effective as hoped because @foo/bar:example.com looks like the ID is partitioned into @foo and bar:example.com.

Proposal:

Remove / from the list of allowed characters in User IDs.

/ will of course be maintained under the grammar of "historical user IDs". Sorting out that mess is a longer-term project.

Room IDs and Event IDs

Issue Spec

These currently have similar formats, though it is likely that event ids will be replaced with something else due to #1127.

Currently they are both specified as ?opaque_id:domain, without clues as to what the opaque_id should be.

Synapse uses: [A-Za-z]{18}. Dendrite uses (I think) [A-Za-z0-9]{16} via json.go. However, some server implementations/forks are known to generate event IDs (and possibly room IDs) using a wide alphabet, which means that there exist rooms that include unusual event IDs.

Proposal:

The opaque_id part must not be empty, and must consist entirely of the characters [0-9a-zA-Z.=_-].

The total length (including sigil and domain) must not exceed 255 characters.

This is only enforced for v2 rooms - servers and clients wishing to support v1 rooms should be more tolerant.

Key IDs (for federation, e2e, and identity servers)

These are always of the form <algorithm>:<tok>.

Valid algorithms are defined at https://matrix.org/docs/spec/client_server/unstable.html#key-algorithms, though we should define the alphabet for future algorithms.

Proposal:

Future algorithm identifiers will be assigned from the alphabet [a-z0-9_.] and will be at most 31 characters in length.

For federation keys, Synapse generates key ids as ed25519:a_[A-Za-z]{4}, though an HS admin can configure them manually to be anything without whitespace.

Key IDs end up in an Authorization header which looks like X-Matrix origin=origin.example.com,key="keyId",sig="ABCDEF...". The Synapse implementation splits on , and = without regard to quoting so this currently precludes the use of , or = in a key ID.

For e2e, device keys have a tok corresponding to the device id, whilst one-time keys are generated by libolm, which uses a base64-encoded 32-bit int, ie [A-Za-z0-9+/]{6}.

A key ID needs to be unique over the lifetime of the server (for federation) or the device (for e2e). However, they are used fairly widely, so making them long is unattractive as they could significantly increase the amount of data being transmitted. Let's limit the 'tok' part of the key to 31 characters too.

Proposal:

Key IDs use the following BNF grammar:

key_id         = algorithm ":" tok

algorithm      = 1*31 alg_chars

tok            = 1*31 tok_chars

alg_chars      = %x61-7a / %30-39 / "_" / "."
                    ; a-z 0-9 _ .

tok_chars      = ALPHA / DIGIT / "." / "=" / "_" / "-"
                    ; A-Z a-z 0-9 . = _ -

Note that enforcing this grammar will mean:

  • Making sure that synapse handles "=" characters in key IDs (easy).

  • Making libolm not put + and / characters in key IDs (easy enough, but there will be a bunch of malformed unique keys out there in the wild. Possibly they would just get thrown away. Servers may need to continue to tolerate + and / in e2e keys for a while.)

Opaque IDs

Issue

This is a class of identifier types where nobody is really meant to parse any part of the ID - they are just unique identifiers (with varying scopes of uniqueness). See below for discussion on what is currently in use.

I propose to specify the almost the same grammar for all of these, for simplicity and consistency.

Proposal:

Opaque IDs must be strings consisting entirely of the characters [0-9a-zA-Z.=_-]. Their length must not exceed 255 characters and they must not be empty.

For almost all of the current implementations I have looked at (listed below), the grammar above is a superset of the generated identifiers, and a subset of the understood identifiers. There should therefore be no backwards-compatibility problems with its introduction.

The exception is transaction IDs generated by some clients. I think that we'll just have to fix those clients and accept that old versions may not work with future servers.

Call IDs

Spec

These are only used within the body of m.call.* events, as far as I am aware. They should be unique within the lifetime of a room. (Some implementations currently treat them as globally unique, but that is considered an implementation bug.)

matrix-js-sdk uses c[0-9.]{32}. matrix-android-sdk uses c[0-9]{13}.

Additional proposal:

Call IDs should be long enough to make clashes unlikely.

Media IDs

Spec

These are generated by the server on upload, and then embedded in mxc:// URIs and used in the C-S API and the S-S API.

They must be URI-safe to be sensibly embedded in mxc:// URIs.

Synapse uses [A-Za-z]{24}, though it also uses [0-9A-Za-z_-]{27} for URL previews.

matrix-media-repo uses [A-Za-z0-9]{32}, via random.go.

Filter IDs

Spec

These are generated by the server and then used in the CS API. They are only required to be unique for a given user. { is already forbidden by the spec.

Synapse uses a stringified int.

Auth Session IDs

Spec

These are generated by the server during auth, and then used in the CS API. However, they need to be unique for a given server.

Synapse uses [A-Za-z]{24}.

Transaction IDs (for federation)

Spec

Generated by sending server. Needs to be unique for a given pair of servers.

Synapse uses a stringified int and accepts pretty much anything.

Transaction IDs (for C-S API)

Spec

These are generated by the client. They only need to be unique within the context of a single access_token/device.

Synapse doesn't appear to do any sanity-checking here currently.

matrix-js-sdk uses m[0-9]{13}.[0-9]{1,}. matrix-android-sdk uses a room ID plus a timestamp, hence kinda could be anything, but certainly will include a !.

Device IDs

Spec

These are normally generated by the server on login. It's possible for clients to present their own device_ids, but we're not aware of this feature being widely used.

They are used between users and across federation for E2E and to-device messages. They need to be unique for a particular user. They also appear in key IDs and must therefore be a subset of that grammar.

Synapse generates device IDs with [A-Z]{10}. It appears to do little sanity-checking of client-generated device IDs currently.

Additional proposal:

Device IDs must not exceed 31 characters in length.

Message IDs

These are used in the server-server API for Send-to-device messaging.

Synapse uses [A-Za-z]{16}, and accepts anything that fits in a postgres TEXT field. Ref: devicemessage.py.

Room Aliases

These are a complex topic and are discussed in MSC 1608.