pyzmq/docs/source/notes/unicode.md

8.6 KiB

(unicode)=

PyZMQ and Unicode

This describes early days of pyzmq development,
when we supported Python 2.5 and 3.1.
Much of this information is wildly outdated now.

PyZMQ is built with an eye towards an easy transition to Python 3, and part of that is dealing with unicode strings. This is an overview of some of what we found, and what it means for PyZMQ.

First, Unicode in Python 2 and 3

In Python < 3, a str object is really a C string with some sugar - a specific series of bytes with some fun methods like endswith() and split(). In 2.0, the unicode object was added, which handles different methods of encoding. In Python 3, however, the meaning of str changes. A str in Python 3 is a full unicode object, with encoding and everything. If you want a C string with some sugar, there is a new object called bytes, that behaves much like the 2.x str. The idea is that for a user, a string is a series of characters, not a series of bytes. For simple ascii, the two are interchangeable, but if you consider accents and non-Latin characters, then the character meaning of byte sequences can be ambiguous, since it depends on the encoding scheme. They decided to avoid the ambiguity by forcing users who want the actual bytes to specify the encoding every time they want to convert a string to bytes. That way, users are aware of the difference between a series of bytes and a collection of characters, and don't confuse the two, as happens in Python 2.x.

The problems (on both sides) come from the fact that regardless of the language design, users are mostly going to use str objects to represent collections of characters, and the behavior of that object is dramatically different in certain aspects between the 2.x bytes approach and the 3.x unicode approach. The unicode approach has the advantage of removing byte ambiguity - it's a list of characters, not bytes. However, if you really do want the bytes, it's very inefficient to get them. The bytes approach has the advantage of efficiency. A bytes object really is just a char* pointer with some methods to be used on it, so when interacting with, so interacting with C code, etc is highly efficient and straightforward. However, understanding a bytes object as a string with extended characters introduces ambiguity and possibly confusion.

To avoid ambiguity, hereafter we will refer to encoded C arrays as 'bytes' and abstract unicode objects as 'strings'.

Unicode Buffers

Since unicode objects have a wide range of representations, they are not stored as the bytes according to their encoding, but rather in a format called UCS (an older fixed-width Unicode format). On some platforms (macOS, Windows), the storage is UCS-2, which is 2 bytes per character. On most *ix systems, it is UCS-4, or 4 bytes per character. The contents of the buffer of a unicode object are not encoding dependent (always UCS-2 or UCS-4), but they are platform dependent. As a result of this, and the further insistence on not interpreting unicode objects as bytes without specifying encoding, str objects in Python 3 don't even provide the buffer interface. You simply cannot get the raw bytes of a unicode object without specifying the encoding for the bytes. In Python 2.x, you can get to the raw buffer, but the platform dependence and the fact that the encoding of the buffer is not the encoding of the object makes it very confusing, so this is probably a good move.

The efficiency problem here comes from the fact that simple ascii strings are 4x as big in memory as they need to be (on most Linux, 2x on other platforms). Also, to translate to/from C code that works with char*, you always have to copy data and encode/decode the bytes. This really is horribly inefficient from a memory standpoint. Essentially, Where memory efficiency matters to you, you should never ever use strings; use bytes. The problem is that users will almost always use str, and in 2.x they are efficient, but in 3.x they are not. We want to make sure that we don't help the user make this mistake, so we ensure that zmq methods don't try to hide what strings really are.

What This Means for PyZMQ

PyZMQ is a wrapper for a C library, so it really should use bytes, since a string is not a simple wrapper for char * like it used to be, but an abstract sequence of characters. The representations of bytes in Python are either the bytes object itself, or any object that provides the buffer interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer interface, but as they do not in Python 3, where pyzmq requires bytes, we specifically reject unicode objects.

The relevant methods here are socket.send/recv, socket.get/setsockopt, socket.bind/connect. The important consideration for send/recv and set/getsockopt is that when you put in something, you really should get the same object back with its partner method. We can easily coerce unicode objects to bytes with send/setsockopt, but the problem is that the pair method of recv/getsockopt will always be bytes, and there should be symmetry. We certainly shouldn't try to always decode on the retrieval side, because if users just want bytes, then we are potentially using up enormous amounts of excess memory unnecessarily, due to copying and larger memory footprint of unicode strings.

Still, we recognize the fact that users will quite frequently have unicode strings that they want to send, so we have added socket.<method>_string() wrappers. These methods simply wrap their bytes counterpart by encoding to/decoding from bytes around them, and they all take an encoding keyword argument that defaults to utf-8. Since encoding and decoding are necessary to translate between unicode and bytes, it is impossible to perform non-copying actions with these wrappers.

socket.bind/connect methods are different from these, in that they are strictly setters and there is not corresponding getter method. As a result, we feel that we can safely coerce unicode objects to bytes (always to utf-8) in these methods.

For cross-language symmetry (including Python 3), the `_unicode` methods
are now `_string`. Many languages have a notion of native strings, and
the use of `_unicode` was wedded too closely to the name of such objects
in Python 2.  For the time being, anywhere you see `_string`, `_unicode`
also works, and is the only option in pyzmq ≤ 2.1.11.

The Methods

Overview of the relevant methods:

.. py:function::    socket.bind(self, addr)

        `addr` is ``bytes`` or ``unicode``. If ``unicode``,
        encoded to utf-8 ``bytes``
.. py:function::    socket.connect(self, addr)

        `addr` is ``bytes`` or ``unicode``. If ``unicode``,
        encoded to utf-8 ``bytes``
.. py:function::    socket.send(self, object obj, flags=0, copy=True)

        `obj` is ``bytes`` or provides buffer interface.

        if `obj` is ``unicode``, raise ``TypeError``
.. py:function::    socket.recv(self, flags=0, copy=True)

        returns ``bytes`` if `copy=True`

        returns ``zmq.Message`` if `copy=False`:

            `message.buffer` is a buffer view of the ``bytes``

            `str(message)` provides the ``bytes``

            `unicode(message)` decodes `message.buffer` with utf-8
.. py:function::    socket.send_string(self, unicode s, flags=0, encoding='utf-8')

        takes a ``unicode`` string `s`, and sends the ``bytes``
        after encoding without an extra copy, via:

        `socket.send(s.encode(encoding), flags, copy=False)`
.. py:function::    socket.recv_string(self, flags=0, encoding='utf-8')

        always returns ``unicode`` string

        there will be a ``UnicodeError`` if it cannot decode the buffer

        performs non-copying `recv`, and decodes the buffer with `encoding`
.. py:function::    socket.setsockopt(self, opt, optval)

        only accepts ``bytes``  for `optval` (or ``int``, depending on `opt`)

        ``TypeError`` if ``unicode`` or anything else
.. py:function::    socket.getsockopt(self, opt)

        returns ``bytes`` (or ``int``), never ``unicode``
.. py:function::    socket.setsockopt_string(self, opt, unicode optval, encoding='utf-8')

        accepts ``unicode`` string for `optval`

        encodes `optval` with `encoding` before passing the ``bytes`` to
        `setsockopt`
.. py:function::    socket.getsockopt_string(self, opt, encoding='utf-8')

        always returns ``unicode`` string, after decoding with `encoding`

        note that `zmq.IDENTITY` is the only `sockopt` with a string value
        that can be queried with `getsockopt`