6.2. Locale Encoding

  • ASCII

  • ASCII Extended

  • ISO-8859

  • Windows Encoding

  • Unicode Encoding

6.2.1. ASCII

  • ASCII - American Standard Code for Information Interchange

  • 7-bit encoding

  • From 0b0000000 to 0b1111111 (0 to 127)

../../_images/encoding-ascii.png

6.2.2. ASCII Extended

  • 8-bit encoding

  • From 0b00000000 to 0b11111111 (0 to 255)

Extended ASCII is a repertoire of character encodings that include (most of) the original 96 ASCII character set, plus up to 128 additional characters. There is no formal definition of "extended ASCII", and even use of the term is sometimes criticized, because it can be mistakenly interpreted to mean that the American National Standards Institute (ANSI) had updated its ANSI X3.4-1986 standard to include more characters, or that the term identifies a single unambiguous encoding, neither of which is the case. [1]

There are several different variations of the 8-bit ASCII table. The table below is according to Windows-1252 (CP-1252) which is a superset of ISO 8859-1, also called ISO Latin-1, in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 128 to 159 range [3].

../../_images/encoding-extended.png

6.2.3. ISO-8859

  • ISO - International Organization for Standardization

  • ISO-8859 - character encoding standard

  • ISO-8859-1 - Western European (Latin-1)

  • ISO-8859-2 - Central European (Latin-2)

  • ISO-8859-3 - South European (Latin-3)

  • ISO-8859-4 - North European (Latin-4)

  • ISO-8859-5 - Latin/Cyrillic

  • ISO-8859-6 - Latin/Arabic

  • ISO-8859-7 - Latin/Greek

  • ISO-8859-8 - Latin/Hebrew

  • ISO-8859-9 - Turkish (Latin-5)

  • ISO-8859-10 - Nordic (Latin-6)

  • ISO-8859-11 - Latin/Thai

  • ISO-8859-12 - Latin/Devanagari (abandoned)

  • ISO-8859-13 - Baltic Rim (Latin-7)

  • ISO-8859-14 - Celtic (Latin-8)

  • ISO-8859-15 - Latin-9 - A revision of 8859-1 with removed little-used symbols, replacing them with the euro sign € and the letters Š, š, Ž, ž, Œ, œ, and Ÿ)

  • ISO-8859-16 - South-Eastern European (Latin-10) - Intended for Albanian, Croatian, Hungarian, Italian, Polish, Romanian and Slovene, but also Finnish, French, German and Irish Gaelic

../../_images/encoding-iso-8859.png
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='iso-8859-2') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: ISO-8859 text
$ cat /tmp/myfile.txt
cze��

6.2.4. Windows Encoding

  • Windows is registered trademark of Microsoft

  • windows-1250 is called cp1250

  • CP - Code Page

  • cp42 – Windows Symbol

  • cp874 – Windows Thai

  • cp1250 – Windows Central Europe

  • cp1251 – Windows Cyrillic

  • cp1252 – Windows Western

  • cp1253 – Windows Greek

  • cp1254 – Windows Turkish

  • cp1255 – Windows Hebrew

  • cp1256 – Windows Arabic

  • cp1257 – Windows Baltic

  • cp1258 – Windows Vietnamese

  • https://en.wikipedia.org/wiki/Windows_code_page

These code pages are used by Microsoft in its own Windows operating system. Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes from ISO 6429 mentioned by ISO 8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252 [2].

Microsoft recommends new applications use UTF-8 or UCS-2/UTF-16 instead of these code pages [2].

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='cp1250') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Non-ISO extended-ASCII text
$ cat /tmp/myfile.txt
cze��

6.2.5. Unicode Encoding

>>> text = 'cześć'
>>> text.encode()
b'cze\xc5\x9b\xc4\x87'
>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt', encoding='utf-8') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć

6.2.6. UTF-32

  • Fixed-length encoding

  • 4 bytes per character

  • Supports all Unicode characters

>>> text = 'cześć'
>>> text.encode('utf-32')
b'\xff\xfe\x00\x00c\x00\x00\x00z\x00\x00\x00e\x00\x00\x00[\x01\x00\x00\x07\x01\x00\x00'

6.2.7. UTF-16

  • Fixed-length encoding

  • 2 bytes per character

  • Supports all Unicode characters

>>> text = 'cześć'
>>> text.encode('utf-16')
b'\xff\xfec\x00z\x00e\x00[\x01\x07\x01'

6.2.8. UTF-8

  • Variable-length encoding

  • 1 to 4 bytes per character

  • Supports all Unicode characters

  • Most common encoding for web pages

  • Compatible with ASCII

>>> text = 'cześć'
>>> text.encode('utf-8')
b'cze\xc5\x9b\xc4\x87'

6.2.9. Default

  • UTF-8

>>> text = 'cześć'
>>>
>>> with open('/tmp/myfile.txt', mode='wt') as file:
...     file.write(text + '\n')
6
$ file /tmp/myfile.txt
/tmp/myfile.txt: Unicode text, UTF-8 text
$ cat /tmp/myfile.txt
cześć

6.2.10. References