User:ArrowHead294/UTF-8 extensions: Difference between revisions

From Xenharmonic Wiki
Jump to navigation Jump to search
ArrowHead294 (talk | contribs)
No edit summary
ArrowHead294 (talk | contribs)
No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to {{nowrap|2<sup>31</sup> − 1}} =&nbsp;0x7FFFFFFF without using <code>FE</code> and <code>FF</code>.
{| class="wikitable"
{| class="wikitable"
|+ style="font-size: 105%;" | Code point ↔ UTF-8 conversion
|+ style="font-size: 105%;" | Code point ↔ UTF-8 conversion
|-
|-
! colspan="2" | Code point
! colspan="2" style="border-right: 4px solid black;" | Code point
! colspan="6" | Bytes
! colspan="6" style="border-right: 4px solid black;" | Bytes
! colspan="2" | Eight-byte UTF-8
|-
|-
! First
! First
! Last
! style="border-right: 4px solid black;" | Last
! 1
! 1
! 2
! 2
Line 12: Line 15:
! 4
! 4
! 5
! 5
! 6
! style="border-right: 4px solid black;" | 6
! First
! Last
|-
|-
| style="text-align: right;" | {{plaincode|U+0000}}
| style="text-align: right;" | {{plaincode|U+0000}}
| style="text-align: right;" | {{plaincode|U+007F}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+007F}}
| {{plaincode|0''yyyzzzz''}}
| {{plaincode|0''yyyzzzz''}}
| colspan="5" style="background: darkgray;" |  
| colspan="5" style="background: darkgray; border-right: 4px solid black;" |  
| {{plaincode|00}}
| {{plaincode|7F}}
|-
|-
| style="text-align: right;" | {{plaincode|U+0080}}
| style="text-align: right;" | {{plaincode|U+0080}}<br />{{plaincode|U+0400}}
| style="text-align: right;" | {{plaincode|U+07FF}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+03FF}}<br />{{plaincode|U+07FF}}
| {{plaincode|110''xxxyy''}}
| {{plaincode|110''xxxyy''}}
| {{plaincode|10''yyzzzz''}}
| {{plaincode|10''yyzzzz''}}
| colspan="4" style="background: darkgray;" |  
| colspan="4" style="background: darkgray; border-right: 4px solid black;" |  
| {{plaincode|C2&nbsp;80}}<br />{{plaincode|D0&nbsp;80}}
| {{plaincode|CF&nbsp;BF}}<br />{{plaincode|DF&nbsp;BF}}
|-
|-
| style="text-align: right;" | {{plaincode|U+0800}}
| style="text-align: right;" | {{plaincode|U+0800}}
| style="text-align: right;" | {{plaincode|U+FFFF}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+FFFF}}
| {{plaincode|1110''wwww''}}
| {{plaincode|1110''wwww''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''yyzzzz''}}
| {{plaincode|10''yyzzzz''}}
| colspan="3" style="background: darkgray;" |  
| colspan="3" style="background: darkgray; border-right: 4px solid black;" |  
| {{plaincode|E0&nbsp;A0&nbsp;80}}
| {{plaincode|EF&nbsp;BF&nbsp;BF}}
|-
|-
| style="text-align: right;" | {{plaincode|U+010000}}
| style="text-align: right;" | {{plaincode|U+010000}}<br />{{plaincode|U+110000}}
| style="text-align: right;" | {{plaincode|U+1FFFFF}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+10FFFF}}<br />{{plaincode|U+1FFFFF}}
| {{plaincode|11110''uvv''}}
| {{plaincode|11110''uvv''}}
| {{plaincode|10''vvwwww''}}
| {{plaincode|10''vvwwww''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''yyzzzz''}}
| {{plaincode|10''yyzzzz''}}
| colspan="2" style="background: darkgray;" |  
| colspan="2" style="background: darkgray; border-right: 4px solid black;" |  
| {{plaincode|F0&nbsp;90&nbsp;80&nbsp;80}}<br />{{plaincode|F4&nbsp;90&nbsp;80&nbsp;80}}
| {{plaincode|F4&nbsp;8F&nbsp;BF&nbsp;BF}}<br />{{plaincode|F7&nbsp;BF&nbsp;BF&nbsp;BF}}
|-
|-
| style="text-align: right;" | {{plaincode|U+200000}}
| style="text-align: right;" | {{plaincode|U+200000}}
| style="text-align: right;" | {{plaincode|U+3FFFFFF}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+3FFFFFF}}
| {{plaincode|111110''tt''}}
| {{plaincode|111110''tt''}}
| {{plaincode|10''uuuuvv''}}
| {{plaincode|10''uuuuvv''}}
Line 47: Line 60:
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''yyzzzz''}}
| {{plaincode|10''yyzzzz''}}
| style="background: darkgray;" |  
| style="background: darkgray; border-right: 4px solid black;" |  
| {{plaincode|F8&nbsp;88&nbsp;80&nbsp;80&nbsp;80}}
| {{plaincode|FB&nbsp;BF&nbsp;BF&nbsp;BF&nbsp;BF}}
|-
|-
| style="text-align: right;" | {{plaincode|U+4000000}}
| style="text-align: right;" | {{plaincode|U+4000000}}
| style="text-align: right;" | {{plaincode|U+7FFFFFFF}}
| style="border-right: 4px solid black; text-align: right;" | {{plaincode|U+7FFFFFFF}}
| {{plaincode|1111110''s''}}
| {{plaincode|1111110''s''}}
| {{plaincode|10''sstttt''}}
| {{plaincode|10''sstttt''}}
Line 56: Line 71:
| {{plaincode|10''vvwwww''}}
| {{plaincode|10''vvwwww''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''xxxxyy''}}
| {{plaincode|10''yyzzzz''}}
| style="border-right: 4px solid black;" | {{plaincode|10''yyzzzz''}}
| {{plaincode|FC&nbsp;84&nbsp;80&nbsp;80&nbsp;80&nbsp;80}}
| {{plaincode|FD&nbsp;BF&nbsp;BF&nbsp;BF&nbsp;BF&nbsp;BF}}
|}
|}

Latest revision as of 18:39, 23 February 2026

This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to 231 − 1 = 0x7FFFFFFF without using FE and FF.

Code point ↔ UTF-8 conversion
Code point Bytes Eight-byte UTF-8
First Last 1 2 3 4 5 6 First Last
U+0000 U+007F 0yyyzzzz 00 7F
U+0080
U+0400
U+03FF
U+07FF
110xxxyy 10yyzzzz C2 80
D0 80
CF BF
DF BF
U+0800 U+FFFF 1110wwww 10xxxxyy 10yyzzzz E0 A0 80 EF BF BF
U+010000
U+110000
U+10FFFF
U+1FFFFF
11110uvv 10vvwwww 10xxxxyy 10yyzzzz F0 90 80 80
F4 90 80 80
F4 8F BF BF
F7 BF BF BF
U+200000 U+3FFFFFF 111110tt 10uuuuvv 10vvwwww 10xxxxyy 10yyzzzz F8 88 80 80 80 FB BF BF BF BF
U+4000000 U+7FFFFFFF 1111110s 10sstttt 10uuuuvv 10vvwwww 10xxxxyy 10yyzzzz FC 84 80 80 80 80 FD BF BF BF BF BF