2005-08-29

All the character codes in the world

This is not a proposal to change standards in any respect. It's just a thought-out (well, somewhat) approach for people who have to represent character codes as opposed to characters, and have 32 bits to play with.

The intent is to represent all the codes of all the registered character sets, present and future, as individual unsigned 31-bit integers. All further numbers in this post, except 94, 96, and 2022, are base 16.

Unicode codes are mapped onto the integers 0-10FFFF in the obvious way. The registered character sets of ISO 2022 are represented by codes above 2000000.

The detailed roadmap is as follows:

  • 00000000-0010FFFF: Unicode
  • 00110000-1FFFFFFF: reserved
  • 20000000-2003FFFF: ISO 2022 94-char, 96-char, C0, and C1 character sets
  • 20040000-2093FFFF: ISO 2022 94x94/96x96-char character sets
  • 20940000-5693FFFF: ISO 2022 94x94x94/96x96x96-char character sets
  • 56940000-7FFFFFFF: reserved

Definitions for ISO 2022 character sets:

  • Every character set has an ISO-specified value between 40 and 7E, called F.
  • Some character sets have an ISO-specified value between 21 and 2F, called I. If I is not present, it is deemed for our purposes to 20.
  • Individual characters in one-byte character sets have a value between 20 and 7F, called H.
  • Individual characters in two-byte character sets have two values between 20 and 7F, called H and L.
  • Individual characters in three-byte character sets have three values between 20 and 7F, called H, M, and L.

Values:

  • The value of a character in Unicode is its scalar value.
  • The value of a character in a 94-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H.
  • The value of a character in a 96-bit character set is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H + 80.
  • The value of a character in a 94x94-char or 96x96-char character set is 20040000 + (I - 20) * 90000 (F - 40) * 2400 + (H - 20) * 60 + (L - 20).
  • The value of a character in a 94x94x94-char or 96x96x96-char character set is 20940000 + (I - 20) * 3600000 + (F - 40) * D8000 + (H - 20) * 2400 + (M - 20) * 60 + L.

This scheme was inspired by a related scheme by Markus Kuhn.

No comments: