mirror of https://github.com/zeromq/pyzmq.git
182 lines
8.5 KiB
ReStructuredText
182 lines
8.5 KiB
ReStructuredText
.. PyZMQ Unicode doc, by Min Ragan-Kelley, 2010
|
|
|
|
.. _unicode:
|
|
|
|
PyZMQ and Unicode
|
|
=================
|
|
|
|
PyZMQ is built with an eye towards an easy transition to Python 3, and part of
|
|
that is dealing with unicode objects. This is an overview of some of what we
|
|
found, and what it means for PyZMQ.
|
|
|
|
First, Unicode in Python 2 and 3
|
|
********************************
|
|
|
|
In Python < 3, a ``str`` object is really a C string with some sugar - a
|
|
specific series of bytes with some fun methods like ``endswith()`` and
|
|
``split()``. In 2.0, the ``unicode`` object was added, which handles different
|
|
methods of encoding. In Python 3, however, the meaning of ``str`` changes. A
|
|
``str`` in Python 3 is a full unicode object, with encoding and everything. If
|
|
you want a C string with some sugar, there is a new object called ``bytes``,
|
|
that behaves much like the 2.x ``str``. The idea is that for a user, a string is
|
|
a series of *characters*, not a series of bytes. For simple ascii, the two are
|
|
interchangeable, but if you consider accents and non-Latin characters, then the
|
|
character meaning of byte sequences can be ambiguous, since it depends on the
|
|
encoding scheme. They decided to avoid the ambiguity by forcing users who want
|
|
the actual bytes to specify the encoding every time they want to convert a
|
|
string to bytes. That way, users are aware of the difference between a series of
|
|
bytes and a collection of characters, and don't confuse the two, as happens in
|
|
Python 2.x.
|
|
|
|
The problems (on both sides) come from the fact that regardless of the language
|
|
design, users are mostly going to use ``str`` objects to represent collections
|
|
of characters, and the behavior of that object is dramatically different in
|
|
certain aspects between the 2.x ``bytes`` approach and the 3.x ``unicode``
|
|
approach. The ``unicode`` approach has the advantage of removing byte ambiguity
|
|
- it's a list of characters, not bytes. However, if you really do want the
|
|
bytes, it's very inefficient to get them. The ``bytes`` approach has the
|
|
advantage of efficiency. A ``bytes`` object really is just a char* pointer with
|
|
some methods to be used on it, so when interacting with, so interacting with C
|
|
code, etc is highly efficient and straightforward. However, understanding a
|
|
bytes object as a string with extended characters introduces ambiguity and
|
|
possibly confusion.
|
|
|
|
To avoid ambiguity, hereafter we will refer to encoded C arrays as 'bytes' and
|
|
abstract unicode objects as 'strings'.
|
|
|
|
Unicode Buffers
|
|
---------------
|
|
|
|
Since unicode objects have a wide range of representations, they are not stored
|
|
as the bytes according to their encoding, but rather in a format called UCS (an
|
|
older fixed-width Unicode format). On some platforms (OSX,Windows), the storage
|
|
is UCS-2, which is 2 bytes per character. On most \*ix systems, it is UCS-4, or
|
|
4 bytes per character. The contents of the *buffer* of a ``unicode`` object are
|
|
not encoding dependent (always UCS-2 or UCS-4), but they are *platform*
|
|
dependent. As a result of this, and the further insistence on not interpreting
|
|
``unicode`` objects as bytes without specifying encoding, ``str`` objects in
|
|
Python 3 don't even provide the buffer interface. You simply cannot get the raw
|
|
bytes of a ``unicode`` object without specifying the encoding for the bytes. In
|
|
Python 2.x, you can get to the raw buffer, but the platform dependence and the
|
|
fact that the encoding of the buffer is not the encoding of the object makes it
|
|
very confusing, so this is probably a good move.
|
|
|
|
The efficiency problem here comes from the fact that simple ascii strings are 4x
|
|
as big in memory as they need to be (on most Linux, 2x on other platforms).
|
|
Also, to translate to/from C code that works with char*, you always have to copy
|
|
data and encode/decode the bytes. This really is horribly inefficient from a
|
|
memory standpoint. Essentially, Where memory efficiency matters to you, you
|
|
should never ever use strings; use bytes. The problem is that users will almost
|
|
always use ``str``, and in 2.x they are efficient, but in 3.x they are not. We
|
|
want to make sure that we don't help the user make this mistake, so we ensure
|
|
that zmq methods don't try to hide what strings really are.
|
|
|
|
What This Means for PyZMQ
|
|
*************************
|
|
|
|
PyZMQ is a wrapper for a C library, so it really should use bytes, since a
|
|
string is not a simple wrapper for ``char *`` like it used to be, but an
|
|
abstract sequence of characters. The representations of bytes in Python are
|
|
either the ``bytes`` object itself, or any object that provides the buffer
|
|
interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer
|
|
interface, but as they do not in Python 3, where pyzmq requires bytes, we
|
|
specifically reject unicode objects.
|
|
|
|
The relevant methods here are **socket.send/recv**, **socket.get/setsockopt**,
|
|
**socket.bind/connect**. The important consideration for send/recv and
|
|
set/getsockopt is that when you put in something, you should really get the same
|
|
object back with its partner method. We can easily coerce unicode objects to
|
|
bytes with send/setsockopt, but the problem is that the pair method of
|
|
recv/getsockopt will always be bytes, and there should be symmetry. We certainly
|
|
shouldn't try to always decode on the retrieval side, because if users just want
|
|
bytes, then we are potentially using up enormous amounts of excess memory
|
|
unnecessarily, due to copying and larger memory footprint of unicode strings.
|
|
|
|
Still, we recognize the fact that users will quite frequently have unicode
|
|
strings that they want to send, so we have added ``socket.<method>_unicode()``
|
|
wrappers. These methods simply wrap their bytes counterpart by encoding
|
|
to/decoding from bytes around them, and they all take an `encoding` keyword
|
|
argument that defaults to utf-8. Since encoding and decoding are necessary to
|
|
translate between unicode and bytes, it is impossible to perform non-copying
|
|
actions with these wrappers.
|
|
|
|
``socket.bind/connect`` methods are different from these, in that they are
|
|
strictly setters and there is not corresponding getter method. As a result, we
|
|
feel that we can safely coerce unicode objects to bytes (always to utf-8) in
|
|
these methods.
|
|
|
|
The Methods
|
|
-----------
|
|
|
|
Overview of the relevant methods:
|
|
|
|
.. py:function:: socket.bind(self, addr)
|
|
|
|
`addr` is ``bytes`` or ``unicode``. If ``unicode``,
|
|
encoded to utf-8 ``bytes``
|
|
|
|
.. py:function:: socket.connect(self, addr)
|
|
|
|
`addr` is ``bytes`` or ``unicode``. If ``unicode``,
|
|
encoded to utf-8 ``bytes``
|
|
|
|
.. py:function:: socket.send(self, object obj, flags=0, copy=True)
|
|
|
|
`obj` is ``bytes`` or provides buffer interface.
|
|
|
|
if `obj` is ``unicode``, raise ``TypeError``
|
|
|
|
.. py:function:: socket.recv(self, flags=0, copy=True)
|
|
|
|
returns ``bytes`` if `copy=True`
|
|
|
|
returns ``zmq.Message`` if `copy=False`:
|
|
|
|
`message.buffer` is a buffer view of the ``bytes``
|
|
|
|
`str(message)` provides the ``bytes``
|
|
|
|
`unicode(message)` decodes `message.buffer` with utf-8
|
|
|
|
.. py:function:: socket.send_unicode(self, unicode s, flags=0,
|
|
encoding='utf-8')
|
|
|
|
takes a ``unicode`` string `s`, and sends the ``bytes``
|
|
after encoding without an extra copy, via:
|
|
|
|
`socket.send(s.encode(encoding), flags, copy=False)`
|
|
|
|
.. py:function:: socket.recv_unicode(self, flags=0, encoding='utf-8')
|
|
|
|
always returns ``unicode`` string
|
|
|
|
there will be a ``UnicodeError`` if it cannot decode the buffer
|
|
|
|
performs non-copying `recv`, and decodes the buffer with `encoding`
|
|
|
|
.. py:function:: socket.setsockopt(self, opt, optval)
|
|
|
|
only accepts ``bytes`` for `optval` (or ``int``, depending on `opt`)
|
|
|
|
``TypeError`` if ``unicode`` or anything else
|
|
|
|
.. py:function:: socket.getsockopt(self, opt)
|
|
|
|
returns ``bytes`` (or ``int``), never ``unicode``
|
|
|
|
.. py:function:: socket.setsockopt_unicode(self, opt, unicode optval,
|
|
encoding='utf-8')
|
|
|
|
accepts ``unicode`` string for `optval`
|
|
|
|
encodes `optval` with `encoding` before passing the ``bytes`` to
|
|
`setsockopt`
|
|
|
|
.. py:function:: socket.getsockopt_unicode(self, opt, encoding='utf-8')
|
|
|
|
always returns ``unicode`` string, after decoding with `encoding`
|
|
|
|
note that `zmq.IDENTITY` is the only `sockopt` with a string value
|
|
that can be queried with `getsockopt`
|
|
|