User:ArrowHead294/UTF-8 extensions

This is a table illustrating how Unicode code points are converted into UTF-8, specifically, what code points correspond to two-, three-, four-, five-, and six-byte sequences. A break is made in the four-byte sequences as UTF-8 is currently restricted to U+10FFFF to match the constraints of UTF-16, but extensions are shown anyways to show how UTF-8 is capable of encoding up to 2³¹ − 1 = 0x7FFFFFFF without using FE and FF.

Code point ↔ UTF-8 conversion
Code point		Bytes						Eight-byte UTF-8
First	Last	1	2	3	4	5	6	First	Last
`U+0000`	`U+007F`	`0yyyzzzz`						`00`	`7F`
`U+0080` `U+0400`	`U+03FF` `U+07FF`	`110xxxyy`	`10yyzzzz`					`C2 80` `D0 80`	`CF BF` `DF BF`
`U+0800`	`U+FFFF`	`1110wwww`	`10xxxxyy`	`10yyzzzz`				`E0 A0 80`	`EF BF BF`
`U+010000` `U+110000`	`U+10FFFF` `U+1FFFFF`	`11110uvv`	`10vvwwww`	`10xxxxyy`	`10yyzzzz`			`F0 90 80 80` `F4 90 80 80`	`F4 8F BF BF` `F7 BF BF BF`
`U+200000`	`U+3FFFFFF`	`111110tt`	`10uuuuvv`	`10vvwwww`	`10xxxxyy`	`10yyzzzz`		`F8 88 80 80 80`	`FB BF BF BF BF`
`U+4000000`	`U+7FFFFFFF`	`1111110s`	`10sstttt`	`10uuuuvv`	`10vvwwww`	`10xxxxyy`	`10yyzzzz`	`FC 84 80 80 80 80`	`FD BF BF BF BF BF`

User:ArrowHead294/UTF-8 extensions

Navigation menu

Search