matrix.org/static/jira/browse/SPEC-390

190 lines
9.0 KiB
Plaintext

---
summary: Grammar for user IDs
---
created: 2016-04-19 11:16:04.0
creator: richvdh
description: |-
The original intention for (user) MXIDs was that they would never be seen by the user, and that instead users would be identified by third-party identifiers (3pids) such as their email address. If that were the case, we could just apply the same rules as for Room IDs and Event IDs (SPEC-389).
However, for better or worse, we are in the situation where user IDs are exposed (not least for disambiguation of user display names), and people want to be able to create user IDs with non-ascii characters. This means we should at least consider having different rules for user ids.
id: '12633'
key: SPEC-390
number: '390'
priority: '1'
project: '10001'
reporter: richvdh
resolution: '1'
resolutiondate: 2016-07-14 16:31:07.0
status: '5'
type: '2'
updated: 2016-07-14 16:31:07.0
votes: '0'
watches: '4'
workflowId: '12733'
---
actions:
- author: richvdh
body: We also probably want user IDs to be case-insensitive (see SPEC-289), which means specification of how they should be canonicalised.
created: 2016-04-19 11:20:50.0
id: '12848'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 11:20:50.0
- author: richvdh
body: An alternative might be to *actually* make user IDs opaque, and instead have user aliases which mirror room aliases, as per SPEC-152.
created: 2016-04-19 11:22:52.0
id: '12849'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 11:22:52.0
- author: richvdh
body: |-
From [~jimmycuadra]:
{quote}
In Synapse, when a user ID is generated by the server, it consists of a string beginning with "-" followed by 18 random ASCII letters. There's also a check to ensure user ID local parts supplied in the registration request don't contain characters that need to be URL encoded. It's not clear which of these constraints, if any, are required by the spec and which are just implementation details of Synapse.
{quote}
created: 2016-04-19 12:06:26.0
id: '12851'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:07:03.0
- author: richvdh
body: |-
From [~markjh]:
{quote}
... the v2 filter API makes assumptions that '*" is not a valid character in a room id or a user ID so that it can use it as a wildcard.
{quote}
created: 2016-04-19 12:10:00.0
id: '12853'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:10:00.0
- author: richvdh
body: |-
From [~matthew]:
{quote}
My thoughts for room alises and user ids: "utf8, with a blacklist of explicitly disallowed characters (all whitespace, *, /, :, ., any others we want to reserve). you're not allowed to mix charsets (fsvo charset), and possibly deny other homomorph attacks eg l v I".
IDs are compared case insensitively.
oh, and no zero length ids
{quote}
created: 2016-04-19 12:16:00.0
id: '12857'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:16:00.0
- author: richvdh
body: |-
From [~eternaleye]:
{quote}
The zero-width joiner, in particular, is [needed to construct the letterforms of some languages that Unicode does not support sufficiently|https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name].
{quote}
created: 2016-04-19 12:26:02.0
id: '12859'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:26:02.0
- author: richvdh
body: |-
Also from [~eternaleye]:
{quote}
... would personally support a hard ban on anything outside of \[0-9A-Za-z_.-], and if people want to be silly with it then they can use punycode and render in the client. The dot supports corporate-style firstname.lastname (in countries where that's used), the underscore supports IRC-like names, the dash is needed for punycode, and the rest just are baseline.
Punycode also allows representing anything other networks do, without needing state, so IRC nicks with backticks could just get punycoded by the AS.
{quote}
created: 2016-04-19 12:30:01.0
id: '12861'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:30:01.0
- author: richvdh
body: |-
https://github.com/matrix-org/matrix-doc/blob/human-id-rules/drafts/human-id-rules.rst proposes:
{quote}
* MUST NOT contain a : or start with a @ or .
* MUST NOT contain one of the 107 blacklisted characters on this list: http://kb.mozillazine.org/Network.IDN.blacklist_chars
* After stripping " 0-9, +, -, \[, \], _, and the space character it MUST NOT contain characters from >1 language, defined by the exemplar characters on http://cldr.unicode.org/
When a homeserver receives an event which contains a userid which fails the above rules, it rewrites it as punycode (with an additional leading @) when sending it to clients.
Homeservers SHOULD NOT allow two user IDs that differ only by case. This SHOULD be applied based on the capitalisation rules in the CLDR dataset: http://cldr.unicode.org/. This check SHOULD be applied when the user ID is created, in order to prevent registration with the same name and different capitalisations, e.g. @foo:bar vs @Foo:bar vs @FOO:bar. Home servers MAY canonicalise the user ID to be completely lower-case if desired.
{quote}
created: 2016-04-19 12:48:56.0
id: '12862'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 12:48:56.0
- author: richvdh
body: |-
So https://github.com/matrix-org/matrix-doc/pull/3 covers this, and basically suggests that userids be utf-8, with a few constraints. There is also some diversion into how we can avoid homograph attacks.
At this point I'd like to ask if anyone ([~matthew], [~Kegan]? ) can give me a good reason why we should allow non-ascii in user ids. Sure we're using them at the moment for displayname disambiguation and (sometimes) sending invites, but this is more a failure to implement 3pids and displayname disambiguation properly. Currently, we are restricting usernames to \[a-z0-9_./-] and nobody seems too sad about it. Sticking with that restriction, and supporting richer character sets in displaynames and/or aliases, seems quite attractive.
created: 2016-04-19 14:41:53.0
id: '12868'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-04-19 14:41:53.0
- author: eternaleye
body: |-
I'll note that bridging generally wants to be able to injectively map foreign IDs to Matrix, so that it can avoid either keeping state regarding the mapping or potentially creating mxids that collide with each other. This can be resolved a number of ways:
- Allow truly arbitrary UTF-8. This brings with it various sources of pain, including normalization, case folding, locale-dependent collation, etc.
- Use the restricted charset above, and provide an informative (not normative) note that this is sufficient to support punycode, which is also an injective mapping (and nicely preserves the common case of mostly-ASCII IDs).
- Same, but instead suggest (informative, not normative) that bridges hash foreign IDs. This has downsides - either the IDs are long and strange, or they're short, strange, and at risk of colliding. In addition, the lack of uppercase makes base64 a non-starter, so these really will be *quite* long (hex). This is potentially mitigated by Display Names (for humans reading) and 3PIDs (for input).
Bridging is an especially meaningful concern as (in Synapse) it currently *bypasses* the constraints for human-created MXIDs. If the spec lays out a constraint, it's likely future server implementations will expect federation to respect it; addressing bridges early may help a great deal in avoiding pain down the road.
created: 2016-04-19 15:48:18.0
id: '12871'
issue: '12633'
type: comment
updateauthor: eternaleye
updated: 2016-04-19 15:48:18.0
- author: richvdh
body: |-
> Sure we're using them at the moment for displayname disambiguation and (sometimes) sending invites,
Worth noting we are also using them as the username in simple username/password authentication; again though this is probably more a failure to implement worthwhile 3pid auth.
created: 2016-06-01 09:53:23.0
id: '12934'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-06-01 09:53:23.0
- author: richvdh
body: |-
This came up on Matrix HQ today, in the context of how a bridge (specifically an XMPP<->Matrix bridge) should map between xmpp user ids and matrix user ids.
I've written up some thoughts at https://docs.google.com/document/d/1mQxZT8lcj7FbkXArsiGmYOwxgUAQuSC01Jlbo_hIgIg.
created: 2016-06-21 16:48:36.0
id: '13012'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-06-21 16:48:36.0
- author: richvdh
body: |-
This has now made it into the spec: http://matrix.org/docs/spec/intro.html#user-identifiers
... and there was much rejoicing.
created: 2016-07-14 16:31:07.0
id: '13058'
issue: '12633'
type: comment
updateauthor: richvdh
updated: 2016-07-14 16:31:07.0