pyzmq/docs/unicode.rst

.. PyZMQ Unicode doc, by Min Ragan-Kelley, 2010

.. _unicode:

PyZMQ and Unicode
=================

PyZMQ is built with an eye towards an easy transition to Python 3, and part of
that is dealing with unicode objects. This is an overview of some of what we
found, and what it means for PyZMQ.

First, Unicode in Python 2 and 3
********************************

In Python < 3, a ``str`` object is really a C string with some sugar - a
specific series of bytes with some fun methods like ``endswith()`` and
``split()``. In 2.0, the ``unicode`` object was added, which handles different
methods of encoding. In Python 3, however, the meaning of ``str`` changes. A
``str`` in Python 3 is a full unicode object, with encoding and everything. If
you want a C string with some sugar, there is a new object called ``bytes``,
that behaves much like the 2.x ``str``. The idea is that for a user, a string is
a series of *characters*, not a series of bytes. For simple ascii, the two are
interchangeable, but if you consider accents and non-Latin characters, then the
character meaning of byte sequences can be ambiguous, since it depends on the
encoding scheme. They decided to avoid the ambiguity by forcing users who want
the actual bytes to specify the encoding every time they want to convert a
string to bytes. That way, users are aware of the difference between a series of
bytes and a collection of characters, and don't confuse the two, as happens in
Python 2.x.

The problems (on both sides) come from the fact that regardless of the language
design, users are mostly going to use ``str`` objects to represent collections
of characters, and the behavior of that object is dramatically different in
certain aspects between the 2.x ``bytes`` approach and the 3.x ``unicode``
approach. The ``unicode`` approach has the advantage of removing byte ambiguity
- it's a list of characters, not bytes. However, if you really do want the
bytes, it's very inefficient to get them. The ``bytes`` approach has the
advantage of efficiency. A ``bytes`` object really is just a char* pointer with
some methods to be used on it, so when interacting with, so interacting with C
code, etc is highly efficient and straightforward. However, understanding a
bytes object as a string with extended characters introduces ambiguity and
possibly confusion.

To avoid ambiguity, hereafter we will refer to encoded C arrays as 'bytes' and
abstract unicode objects as 'strings'.

Unicode Buffers
---------------

Since unicode objects have a wide range of representations, they are not stored
as the bytes according to their encoding, but rather in a format called UCS (an
older fixed-width Unicode format). On some platforms (OSX,Windows), the storage
is UCS-2, which is 2 bytes per character. On most \*ix systems, it is UCS-4, or
4 bytes per character. The contents of the *buffer* of a ``unicode`` object are
not encoding dependent (always UCS-2 or UCS-4), but they are *platform*
dependent. As a result of this, and the further insistence on not interpreting
``unicode`` objects as bytes without specifying encoding, ``str`` objects in
Python 3 don't even provide the buffer interface. You simply cannot get the raw
bytes of a ``unicode`` object without specifying the encoding for the bytes. In
Python 2.x, you can get to the raw buffer, but the platform dependence and the
fact that the encoding of the buffer is not the encoding of the object makes it
very confusing, so this is probably a good move.

The efficiency problem here comes from the fact that simple ascii strings are 4x
as big in memory as they need to be (on most Linux, 2x on other platforms).
Also, to translate to/from C code that works with char*, you always have to copy
data and encode/decode the bytes. This really is horribly inefficient from a
memory standpoint. Essentially, Where memory efficiency matters to you, you
should never ever use strings; use bytes. The problem is that users will almost
always use ``str``, and in 2.x they are efficient, but in 3.x they are not. We
want to make sure that we don't help the user make this mistake, so we ensure
that zmq methods don't try to hide what strings really are.

What This Means for PyZMQ
*************************

PyZMQ is a wrapper for a C library, so it really should use bytes, since a
string is not a simple wrapper for ``char *`` like it used to be, but an
abstract sequence of characters. The representations of bytes in Python are
either the ``bytes`` object itself, or any object that provides the buffer
interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer
interface, but as they do not in Python 3, where pyzmq requires bytes, we
specifically reject unicode objects.

The relevant methods here are **socket.send/recv**, **socket.get/setsockopt**,
**socket.bind/connect**. The important consideration for send/recv and
set/getsockopt is that when you put in something, you should really get the same
object back with its partner method. We can easily coerce unicode objects to
bytes with send/setsockopt, but the problem is that the pair method of
recv/getsockopt will always be bytes, and there should be symmetry. We certainly
shouldn't try to always decode on the retrieval side, because if users just want
bytes, then we are potentially using up enormous amounts of excess memory
unnecessarily, due to copying and larger memory footprint of unicode strings.

Still, we recognize the fact that users will quite frequently have unicode
strings that they want to send, so we have added ``socket.<method>_unicode()``
wrappers. These methods simply wrap their bytes counterpart by encoding
to/decoding from bytes around them, and they all take an `encoding` keyword
argument that defaults to utf-8. Since encoding and decoding are necessary to
translate between unicode and bytes, it is impossible to perform non-copying
actions with these wrappers.

``socket.bind/connect`` methods are different from these, in that they are
strictly setters and there is not corresponding getter method. As a result, we
feel that we can safely coerce unicode objects to bytes (always to utf-8) in
these methods.

The Methods
-----------

Overview of the relevant methods:

.. py:function::    socket.bind(self, addr)

        `addr` is ``bytes`` or ``unicode``. If ``unicode``,
        encoded to utf-8 ``bytes``

.. py:function::    socket.connect(self, addr)

        `addr` is ``bytes`` or ``unicode``. If ``unicode``,
        encoded to utf-8 ``bytes``

.. py:function::    socket.send(self, object obj, flags=0, copy=True)

        `obj` is ``bytes`` or provides buffer interface.

        if `obj` is ``unicode``, raise ``TypeError``

.. py:function::    socket.recv(self, flags=0, copy=True)

        returns ``bytes`` if `copy=True`

        returns ``zmq.Message`` if `copy=False`:

            `message.buffer` is a buffer view of the ``bytes``

            `str(message)` provides the ``bytes``

            `unicode(message)` decodes `message.buffer` with utf-8

.. py:function::    socket.send_unicode(self, unicode s, flags=0,
                                                encoding='utf-8')

        takes a ``unicode`` string `s`, and sends the ``bytes``
        after encoding without an extra copy, via:

        `socket.send(s.encode(encoding), flags, copy=False)`

.. py:function::    socket.recv_unicode(self, flags=0, encoding='utf-8')

        always returns ``unicode`` string

        there will be a ``UnicodeError`` if it cannot decode the buffer

        performs non-copying `recv`, and decodes the buffer with `encoding`

.. py:function::    socket.setsockopt(self, opt, optval)

        only accepts ``bytes``  for `optval` (or ``int``, depending on `opt`)

        ``TypeError`` if ``unicode`` or anything else

.. py:function::    socket.getsockopt(self, opt)

        returns ``bytes`` (or ``int``), never ``unicode``

.. py:function::    socket.setsockopt_unicode(self, opt, unicode optval,
                                                encoding='utf-8')

        accepts ``unicode`` string for `optval`

        encodes `optval` with `encoding` before passing the ``bytes`` to
        `setsockopt`

.. py:function::    socket.getsockopt_unicode(self, opt, encoding='utf-8')

        always returns ``unicode`` string, after decoding with `encoding`

        note that `zmq.IDENTITY` is the only `sockopt` with a string value
        that can be queried with `getsockopt`