User:ArrowHead294/UTF-8 extensions

This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to 231 − 1 = 0x7FFFFFFF without using FE and FF.

Code point ↔ UTF-8 conversion
Code point Bytes Eight-byte UTF-8
First Last 1 2 3 4 5 6 First Last
U+0000 U+007F 0yyyzzzz 00 7F
U+0080
U+0400
U+03FF
U+07FF
110xxxyy 10yyzzzz C2 80
D0 80
CF BF
DF BF
U+0800 U+FFFF 1110wwww 10xxxxyy 10yyzzzz E0 A0 80 EF BF BF
U+010000
U+110000
U+10FFFF
U+1FFFFF
11110uvv 10vvwwww 10xxxxyy 10yyzzzz F0 90 80 80
F4 90 80 80
F4 8F BF BF
F7 BF BF BF
U+200000 U+3FFFFFF 111110tt 10uuuuvv 10vvwwww 10xxxxyy 10yyzzzz F8 88 80 80 80 FB BF BF BF BF
U+4000000 U+7FFFFFFF 1111110s 10sstttt 10uuuuvv 10vvwwww 10xxxxyy 10yyzzzz FC 84 80 80 80 80 FD BF BF BF BF BF