User:ArrowHead294/UTF-8 extensions
This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to 231 − 1 = 0x7FFFFFFF without using FE and FF.
| Code point | Bytes | Eight-byte UTF-8 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| First | Last | 1 | 2 | 3 | 4 | 5 | 6 | First | Last |
U+0000
|
U+007F
|
0yyyzzzz
|
00
|
7F
| |||||
U+0080U+0400
|
U+03FFU+07FF
|
110xxxyy
|
10yyzzzz
|
C2 80D0 80
|
CF BFDF BF
| ||||
U+0800
|
U+FFFF
|
1110wwww
|
10xxxxyy
|
10yyzzzz
|
E0 A0 80
|
EF BF BF
| |||
U+010000U+110000
|
U+10FFFFU+1FFFFF
|
11110uvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
F0 90 80 80F4 90 80 80
|
F4 8F BF BFF7 BF BF BF
| ||
U+200000
|
U+3FFFFFF
|
111110tt
|
10uuuuvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
F8 88 80 80 80
|
FB BF BF BF BF
| |
U+4000000
|
U+7FFFFFFF
|
1111110s
|
10sstttt
|
10uuuuvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
FC 84 80 80 80 80
|
FD BF BF BF BF BF
|