Encoding and Decoding in Python 3
This article was imported from my old, old, old site.
What
This is an article for people wanting to understand the following error with Python 3:
... UnicodeEncodeError: 'charmap' codec can't encode character ...
... UnicodeEncodeError: 'charmap' codec can't encode character ...
Part of an error usually encountered when printing to a console, but possible in many other text situations. If you are interested in fully understanding the root issue, instead of just copy-pasting SO posts, read on.
The Problem
Internally Python 3 always uses UTF-8 for its strings. However the source and or destination of a string may have another encoding.
Python 2 does not strictly store its strings as utf-8 so one may encounter all sorts of bizarre encoding decoding samples that are only applicable to Python 2. This article only concerns Python 3.
How print
Works
Print is a pretty basic function that tries to convert whatever you give it to a string, adds a newline, and the passes it on to sys.stdout.write
. It has other options, but weâre only interested in its default use at the moment.
The sys.stdout.write
function is more interesting:
The character encoding is platform-dependent. Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page. Under other platforms, the locale encoding is used (see locale.getpreferredencoding()).
This means that sys.stdout.write
will re-encode whatever you give it to suit the context.
As a result we are required to give it text that is suitable for re-encoding; if we donât we get the encoding error that inspired this article.
Since the documentation is horrifyingly difficult to traverse we can investigate a bit using python itself:
>>> import sys >>> type(sys.stdout) <class '_io.TextIOWrapper'>
>>> import sys >>> type(sys.stdout) <class '_io.TextIOWrapper'>
As you can see, sys.stdout
is actually a TextIOWrapper, which has some further relevant documentation.
>>> print(sys.stdout.errors) strict >>> print(sys.stdout.encoding) cp850 >>> print(sys.stdout.line_buffering) True
>>> print(sys.stdout.errors) strict >>> print(sys.stdout.encoding) cp850 >>> print(sys.stdout.line_buffering) True
Now we can see that errors is set to strict, which is the cause of the encoding error, it has detected my consoleâs encoding and it is using line-buffering, which means that the TextIOWrapper buffer is flushed on every newline character.
However, if we put all those lines into a script and then redirect the output to a file the file contains:
<class '_io.TextIOWrapper'> cp1252 strict False
<class '_io.TextIOWrapper'> cp1252 strict False
The encoding has changed and line buffering has turned off. The same happens when the script output is piped to another program. So much for predictability. We must be wary.
Simple Error Example With print
>>> import sys >>> sys.stdout.encoding 'cp850' >>> print('hello') hello
>>> import sys >>> sys.stdout.encoding 'cp850' >>> print('hello') hello
In the above example hello
is a utf-8 string, but the hello
printed to the console has been re-encoded to cp850.
This all works fine while the string only uses characters available in cp850.
In the next example an attempt is made to print the mathematical delta increment sign, which does not exist in cp850 (the character appears as a triangle ââ
â the escape in python is \u2206
):
>>> print('\u2206') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\py34\python-3.4.2.amd64\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2206' in position 0: character maps to <undefined>
>>> print('\u2206') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\py34\python-3.4.2.amd64\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2206' in position 0: character maps to <undefined>
Changing the Console for the Text
The only practical way to reliably display utf-8 text in a console is to change the console to utf-8 mode.
In the windows command prompt the consoleâs code page can be changed to Unicode (cp65001) before starting Python using chcp
(if using GitBash type chcp.com
instead):
>chcp 65001 Active code page: 65001 >python Python 3.4.2 ...
>chcp 65001 Active code page: 65001 >python Python 3.4.2 ...
After doing so you can work with Unicode to your heartâs content:
>>> import sys >>> sys.stdout.encoding 'cp65001' >>> print('\u2206') â
>>> import sys >>> sys.stdout.encoding 'cp65001' >>> print('\u2206') â
Doing same in Linux involves setting UTF-8 as part of your locale. This is still something I need to play with myself, so I canât provide more information.
Cleaning the Text for the Console
It is not always practical to change to console for the text.
One may want to reliably print whatever the console is capable of printing in the situation.
My solution is a very simple function that replaces characters not compatible with the current console.
Text cleaned in this manner will print without error (but incompatible characters will be replaced with ?
).
import sys def clean(text, mode='replace'): encoding = sys.stdout.encoding if(encoding is not 'utf-8'): text = text.encode(encoding, mode).decode(encoding) return text
import sys def clean(text, mode='replace'): encoding = sys.stdout.encoding if(encoding is not 'utf-8'): text = text.encode(encoding, mode).decode(encoding) return text
This causes the original value of the replaced characters to be lost, so if you need to accurately pipe, redirect or view the incompatible utf-8 characters, this is no good.
Also this is inefficient, since the text will be encoded again. You could simply skip the decode and pass the bytes to sys.stdout.buffer.write
instead of using print
(more on this later).
Letâs Grok Encode Decode
This is a series of little code samples that should clear up any misconceptions you might have about what the encode and decode functions do.
string.encode()
The string.encode('encoding')
function encodes the string into a bytes object with the specified encoding.
Bytes are raw data, the bytes object does not have encoding, the data stored in the bytes represent text of an encoding.
In Python 3 you encode from a utf-8-encoded-string to bytes and decode from bytes to a utf-8-encoded-string (I stress this point because seeing contradictory Python2 code can result in confusion).
>>> import sys >>> sys.stdin.encoding 'cp850' >>> encoded = 'hello'.encode('cp850') >>> print(encoded) b'hello' >>> decoded = encoded.decode('cp850') >>> print(decoded) hello >>> type(encoded) <class 'bytes'> >>> type(decoded) <class 'str'>
>>> import sys >>> sys.stdin.encoding 'cp850' >>> encoded = 'hello'.encode('cp850') >>> print(encoded) b'hello' >>> decoded = encoded.decode('cp850') >>> print(decoded) hello >>> type(encoded) <class 'bytes'> >>> type(decoded) <class 'str'>
string.decode()
The bytes.decode('encoding')
function converts a bytes object representing text into an utf-8 string.
However the bytes data could be being used to represent any encoding (including utf-8) so that encoding must be specified.
If you decode bytes representing utf-8 text, the data does not change, the underlying data is identical.
The degrees symbol Âș
exists in both Unicode and cp850 (everything should exist in Unicode).
The next code example it working in each encoding and the failures that occur when it is decoded with the wrong encoding.
You should be able to reproduce this example fine as long as your console is using one of these 2 encodings.
print(bytes)
When you ask Python to print a bytes object it will show the corresponding ASCII text to each byte if it is a printable character. If the corresponding ASCII is non-printable or the value is non-ASCII (above 127), it is shown as an escaped hex number, for example ASCII
0x01
(SOH) is escaped as\x01
. Most encodings are compatible with the printable characters of ASCII so interpreting a byte as the ASCII is a good guess of what is meant by it (but by no means guaranteed).
>>> print('\u00ba') Âș >>> degrees = '\u00ba' >>> print(degrees) Âș >>> degreesCp850 = degrees.encode('cp850') >>> print(degreesCp850) b'\xa7' >>> degreesUtf8 = degrees.encode('utf-8') >>> print(degreesUtf8) b'\xc2\xba' >>> print(degreesCp850.decode('cp850')) Âș >>> print(degreesCp850.decode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte >>> print(degreesUtf8.decode('utf-8')) Âș >>> print(degreesUtf8.decode('cp850')) âŹâ
>>> print('\u00ba') Âș >>> degrees = '\u00ba' >>> print(degrees) Âș >>> degreesCp850 = degrees.encode('cp850') >>> print(degreesCp850) b'\xa7' >>> degreesUtf8 = degrees.encode('utf-8') >>> print(degreesUtf8) b'\xc2\xba' >>> print(degreesCp850.decode('cp850')) Âș >>> print(degreesCp850.decode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 0: invalid start byte >>> print(degreesUtf8.decode('utf-8')) Âș >>> print(degreesUtf8.decode('cp850')) âŹâ
The next example uses a string containing only basic printable ASCII compatible characters. It demonstrates that using incorrect encoding may succeed when using only these basic characters. Most encodings are purposely designed to act like this, it creates a tolerance for bad encoding on simple text data.
>>> hello = 'hello' >>> helloCp850 = hello.encode('cp850') >>> helloUtf8 = hello.encode('utf-8') >>> print(helloCp850) b'hello' >>> print(helloUtf8) b'hello' >>> print(hellpCp850.decode('cp-850') ... ) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'hellpCp850' is not defined >>> print(helloCp850.decode('cp850')) hello >>> print(helloUtf8.decode('cp850')) hello >>> print(helloUtf8.decode('utf-8')) hello >>> print(helloCp850.decode('utf-8')) hello
>>> hello = 'hello' >>> helloCp850 = hello.encode('cp850') >>> helloUtf8 = hello.encode('utf-8') >>> print(helloCp850) b'hello' >>> print(helloUtf8) b'hello' >>> print(hellpCp850.decode('cp-850') ... ) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'hellpCp850' is not defined >>> print(helloCp850.decode('cp850')) hello >>> print(helloUtf8.decode('cp850')) hello >>> print(helloUtf8.decode('utf-8')) hello >>> print(helloCp850.decode('utf-8')) hello
Direct Output To stdout
The final method of working with utf-8 or any specified encoding is to send raw bytes directly to the stdout
buffer, skipping the automatic encoding. This is the only way to guarantee that your encoding is kept intact, useful when the intent is to have utf-8 (or anything else) output piped or redirected to other places unmodified.
As mentioned before sys.stdout.write(string)
encodes by default, however using sys.stdout.buffer.write(bytes)
delivers your bytes directly, unchanged.
In theory string.encode('utf-8)
doesnât actually do anything to the data itself, since string is is already utf-8, it just creates a bytes var to so that we can make direct use of the data. The is the equivalent of âoutputting the string directlyâ.
The following simple Python script:
import sys print('1 sys.stdout.encoding: ' + sys.stdout.encoding) sys.stdout.buffer.write('2 hello\n'.encode()) sys.stdout.buffer.write('3 delta triangle: \u2206\n'.encode()) sys.stdout.buffer.write('4 GPB currency symbol: ÂŁ\n'.encode()) sys.stdout.buffer.write('5 delta triangle: \u2206\n'.encode('cp850', 'replace')) sys.stdout.buffer.write('6 GPB currency symbol: ÂŁ\n'.encode('cp850'))
import sys print('1 sys.stdout.encoding: ' + sys.stdout.encoding) sys.stdout.buffer.write('2 hello\n'.encode()) sys.stdout.buffer.write('3 delta triangle: \u2206\n'.encode()) sys.stdout.buffer.write('4 GPB currency symbol: ÂŁ\n'.encode()) sys.stdout.buffer.write('5 delta triangle: \u2206\n'.encode('cp850', 'replace')) sys.stdout.buffer.write('6 GPB currency symbol: ÂŁ\n'.encode('cp850'))
Has the output:
1 sys.stdout.encoding: cp850 2 hello 3 delta triangle: ĂĂȘĂ„ 4 GPB currency symbol: âŹĂș 5 delta triangle: ? 6 GPB currency symbol: ÂŁ
1 sys.stdout.encoding: cp850 2 hello 3 delta triangle: ĂĂȘĂ„ 4 GPB currency symbol: âŹĂș 5 delta triangle: ? 6 GPB currency symbol: ÂŁ
In this example lines 2, 3 and 4 were econded to the default utf-8 before being output.
This works fine for pure ASCII characters (hello
) that are encoding-compatible with most code-pages but outputs garbage when it encounters anything beyond basic ASCII, as seen on output lines 3 and 4.
The delta triangle character does not exist in cp850 so it is output as a ?
when encoded correctly.
Note that the 'replace'
option was specified.
The default 'strict'
mode would have broke the script with the error that inspired this article.
Notably the British pound symbol ÂŁ
can be displayed in cp850 print('ÂŁ')
, however the encoding for this symbol is not the same as in utf-8 so it only appears when encoded correctly.
Pitfalls of Writing to stdout.buffer
Now lets have a look at what happens when we redirect the output of the previous example to a text file (example.py > x.txt
):
2 hello 3 delta triangle: ĂąËâ 4 GPB currency symbol: ĂÂŁ 5 delta triangle: ? 6 GPB currency symbol: Ć 1 sys.stdout.encoding: cp1252
2 hello 3 delta triangle: ĂąËâ 4 GPB currency symbol: ĂÂŁ 5 delta triangle: ? 6 GPB currency symbol: Ć 1 sys.stdout.encoding: cp1252
The output is in the wrong order, the text output with print
appears last.
By default stdout
is a TextIOWrapper which inherits from TextIOBase.
It has its own text buffer separate to the underlying BufferedWriter (sys.stdout.buffer
).
This extra middle buffer must be flushed (sys.stdout.flush()
) before writing directly to the underlying buffer.
Since Python 3.3 print
has a flush argument that will do the flush for you.
Iâm not sure if sys.stdout.flush()
triggers sys.stdout.buffer.flush()
, feel free to investigate further.
The Python documentation on this mentions line-buffering:
When interactive, standard streams are line-buffered. Otherwise, they are block-buffered like regular text files. You can override this value with the
-u
command-line option.
However this seems to have nothing to do with this issue, the python interpreter and the console seems to always output immediately, regardless of the lack of line endings and files output out of order even with line endings.
_CHUNK_SIZE
hack
Yet another sneaky method is to set the undocumented chunk size to one: sys.stdout._CHUNK_SIZE = 1
.
This is the variable in TextIOWrapper that controls the frequency of text buffer flushes.
However be careful, this requires at-least 2 characters to be sent at a time to guarantee a flush; it only triggers a flush when there are more than chunk_size
characters pending to be written. The minimum allowed value of chunk_size
is 1, which means the minimum pending characters must be 2 to flush. It handles the entire passed string at once, so strings that have a length of 2 or more will flush fully instantly.
This is a handy hack if you have a lot of code lacking necessary flushes (and none of those string are smaller than 2 characters).
I canât say I fully understand the code, so donât use this hack on any critical code.
Other Methods of Changing Output
As mentioned in the documentation the output encoding can be changed using the PYTHONIOENCODING
environment variable.
This approach is not recommended because it can have side effects, it may break other scripts.
Detach For Pure Binary
It is possible to detach the TextIOWrapper from stdout, however all this does is break print
completely and change sys.stdout.write()
so that it accepts bytes instead of string. This is only useful in porting awkward code from Python 2.
import sys # Detach returns the underlying stream, which is a _io.BufferedWriter # This is unlikely to be a good idea. sys.stdout = sys.stdout.detach()
import sys # Detach returns the underlying stream, which is a _io.BufferedWriter # This is unlikely to be a good idea. sys.stdout = sys.stdout.detach()
Detach and Replace With Encoder
A more interesting idea than just detaching is to replace stdout with a utf-8 encoder:
import sys import codecs sys.stdout = codecs.getwriter('utf-8')(sys.stdout.detach)
import sys import codecs sys.stdout = codecs.getwriter('utf-8')(sys.stdout.detach)
The codecs.getwriter()
method returns a function for initialising a writer.
The detached stream is passed as a parameter to create a writer that is then set as stdout.
Now all print
and write
calls will be encoded as utf-8
regardless of the context.
This has the same output effect sys.stdout.buffer.write(mystring.encode())
, non ASCII text becomes garbled in the console, but piped and redirected text is utf-8 so you can fully work in utf-8.
Keep in mind that stdout.buffer
does not exist anymore, since stdout is now a encodings.utf_8.StreamWriter
.
This is useful when you have a script with many print statements and you want to force utf-8 output for all of them. This is especially when the output will be redirected to file, and you want that file to be utf-8.
A Pitfall When Piping
Be wary, if you pipe utf-8 to another script or program without changing the console codepage to utf-8, the second scriptâs stdin wonât magically change to utf-8. If the second program is a python script it will detect the code-page from the console the same as always and the input will be decoded accordingly. You will have to force its stdin into utf-8 mode.
Last Words
It is annoying that everything doesnât work in unicode by default, but this is legacy.
The 256 glyph VGA compatible text mode is still the universal fallback / startup mode for computers, meaning that the terminal will always need to be compatible with a basic old code page first.
- If you need Unicode you will have to specify it at both ends, in and out, or you risk failure.
- For visible console output, stick to ASCII.
-t3hmun