User:ArrowHead294/UTF-8 extensions: Difference between revisions
Jump to navigation
Jump to search
ArrowHead294 (talk | contribs) No edit summary |
ArrowHead294 (talk | contribs) No edit summary |
||
| (4 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to {{nowrap|2<sup>31</sup> − 1}} = 0x7FFFFFFF without using <code>FE</code> and <code>FF</code>. | |||
{| class="wikitable" | {| class="wikitable" | ||
|+ style="font-size: 105%;" | Code point ↔ UTF-8 conversion | |+ style="font-size: 105%;" | Code point ↔ UTF-8 conversion | ||
|- | |- | ||
! colspan="2" | Code point | ! colspan="2" style="border-right: 4px solid black;" | Code point | ||
! colspan="6" | Bytes | ! colspan="6" style="border-right: 4px solid black;" | Bytes | ||
! colspan="2" | Eight-byte UTF-8 | |||
|- | |- | ||
! First | ! First | ||
! Last | ! style="border-right: 4px solid black;" | Last | ||
! 1 | ! 1 | ||
! 2 | ! 2 | ||
| Line 12: | Line 15: | ||
! 4 | ! 4 | ||
! 5 | ! 5 | ||
! 6 | ! style="border-right: 4px solid black;" | 6 | ||
! First | |||
! Last | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+0000}} | | style="text-align: right;" | {{plaincode|U+0000}} | ||
| style="text-align: right;" | {{plaincode|U+007F}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+007F}} | ||
| {{plaincode|0''yyyzzzz''}} | | {{plaincode|0''yyyzzzz''}} | ||
| colspan="5" style="background: darkgray;" | | | colspan="5" style="background: darkgray; border-right: 4px solid black;" | | ||
| {{plaincode|00}} | |||
| {{plaincode|7F}} | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+0080}} | | style="text-align: right;" | {{plaincode|U+0080}}<br />{{plaincode|U+0400}} | ||
| style="text-align: right;" | {{plaincode|U+07FF}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+03FF}}<br />{{plaincode|U+07FF}} | ||
| {{plaincode|110''xxxyy''}} | | {{plaincode|110''xxxyy''}} | ||
| {{plaincode|10''yyzzzz''}} | | {{plaincode|10''yyzzzz''}} | ||
| colspan="4" style="background: darkgray;" | | | colspan="4" style="background: darkgray; border-right: 4px solid black;" | | ||
| {{plaincode|C2 80}}<br />{{plaincode|D0 80}} | |||
| {{plaincode|CF BF}}<br />{{plaincode|DF BF}} | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+0800}} | | style="text-align: right;" | {{plaincode|U+0800}} | ||
| style="text-align: right;" | {{plaincode|U+FFFF}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+FFFF}} | ||
| {{plaincode|1110''wwww''}} | | {{plaincode|1110''wwww''}} | ||
| {{plaincode|10''xxxxyy''}} | | {{plaincode|10''xxxxyy''}} | ||
| {{plaincode|10''yyzzzz''}} | | {{plaincode|10''yyzzzz''}} | ||
| colspan="3" style="background: darkgray;" | | | colspan="3" style="background: darkgray; border-right: 4px solid black;" | | ||
| {{plaincode|E0 A0 80}} | |||
| {{plaincode|EF BF BF}} | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+010000}} | | style="text-align: right;" | {{plaincode|U+010000}}<br />{{plaincode|U+110000}} | ||
| style="text-align: right;" | {{plaincode|U+1FFFFF}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+10FFFF}}<br />{{plaincode|U+1FFFFF}} | ||
| {{plaincode|11110''uvv''}} | | {{plaincode|11110''uvv''}} | ||
| {{plaincode|10''vvwwww''}} | | {{plaincode|10''vvwwww''}} | ||
| {{plaincode|10''xxxxyy''}} | | {{plaincode|10''xxxxyy''}} | ||
| {{plaincode|10''yyzzzz''}} | | {{plaincode|10''yyzzzz''}} | ||
| colspan="2" style="background: darkgray;" | | | colspan="2" style="background: darkgray; border-right: 4px solid black;" | | ||
| {{plaincode|F0 90 80 80}}<br />{{plaincode|F4 90 80 80}} | |||
| {{plaincode|F4 8F BF BF}}<br />{{plaincode|F7 BF BF BF}} | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+200000}} | | style="text-align: right;" | {{plaincode|U+200000}} | ||
| style="text-align: right;" | {{plaincode|U+3FFFFFF}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+3FFFFFF}} | ||
| {{plaincode|111110''tt''}} | | {{plaincode|111110''tt''}} | ||
| {{plaincode|10''uuuuvv''}} | | {{plaincode|10''uuuuvv''}} | ||
| Line 47: | Line 60: | ||
| {{plaincode|10''xxxxyy''}} | | {{plaincode|10''xxxxyy''}} | ||
| {{plaincode|10''yyzzzz''}} | | {{plaincode|10''yyzzzz''}} | ||
| style="background: darkgray;" | | | style="background: darkgray; border-right: 4px solid black;" | | ||
| {{plaincode|F8 88 80 80 80}} | |||
| {{plaincode|FB BF BF BF BF}} | |||
|- | |- | ||
| style="text-align: right;" | {{plaincode|U+4000000}} | | style="text-align: right;" | {{plaincode|U+4000000}} | ||
| style="text-align: right;" | {{plaincode|U+7FFFFFFF}} | | style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+7FFFFFFF}} | ||
| {{plaincode|1111110''s''}} | | {{plaincode|1111110''s''}} | ||
| {{plaincode|10''sstttt''}} | | {{plaincode|10''sstttt''}} | ||
| Line 56: | Line 71: | ||
| {{plaincode|10''vvwwww''}} | | {{plaincode|10''vvwwww''}} | ||
| {{plaincode|10''xxxxyy''}} | | {{plaincode|10''xxxxyy''}} | ||
| {{plaincode|10''yyzzzz''}} | | style="border-right: 4px solid black;" | {{plaincode|10''yyzzzz''}} | ||
| {{plaincode|FC 84 80 80 80 80}} | |||
| {{plaincode|FD BF BF BF BF BF}} | |||
|} | |} | ||
Latest revision as of 18:39, 23 February 2026
This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to 231 − 1 = 0x7FFFFFFF without using FE and FF.
| Code point | Bytes | Eight-byte UTF-8 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| First | Last | 1 | 2 | 3 | 4 | 5 | 6 | First | Last |
U+0000
|
U+007F
|
0yyyzzzz
|
00
|
7F
| |||||
U+0080U+0400
|
U+03FFU+07FF
|
110xxxyy
|
10yyzzzz
|
C2 80D0 80
|
CF BFDF BF
| ||||
U+0800
|
U+FFFF
|
1110wwww
|
10xxxxyy
|
10yyzzzz
|
E0 A0 80
|
EF BF BF
| |||
U+010000U+110000
|
U+10FFFFU+1FFFFF
|
11110uvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
F0 90 80 80F4 90 80 80
|
F4 8F BF BFF7 BF BF BF
| ||
U+200000
|
U+3FFFFFF
|
111110tt
|
10uuuuvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
F8 88 80 80 80
|
FB BF BF BF BF
| |
U+4000000
|
U+7FFFFFFF
|
1111110s
|
10sstttt
|
10uuuuvv
|
10vvwwww
|
10xxxxyy
|
10yyzzzz
|
FC 84 80 80 80 80
|
FD BF BF BF BF BF
|