[8-bit encodings: ASCII, KOI-8R and CP1251] The first encoding tables created in the United States did not use the eighth bit in a byte. The text was represented as a sequence of bytes, but the eighth bit was not taken into account (it was used for official purposes).

The table has become a generally accepted standard ASCII(American Standard Code for Information Interchange). The first 32 characters of the ASCII table (00 to 1F) were used for non-printing characters. They were designed to control a printing device, etc. The rest - from 20 to 7F - are regular (printable) characters.

Table 1 - ASCII encoding

DecHexOctCharDescription
0 0 000 null
1 1 001 start of heading
2 2 002 start of text
3 3 003 end of text
4 4 004 end of transmission
5 5 005 inquiry
6 6 006 acknowledge
7 7 007 bell
8 8 010 backspace
9 9 011 horizontal tab
10 A 012 new line
11 B 013 vertical tab
12 C 014 new page
13 D 015 carriage return
14 E 016 shift out
15 F 017 shift in
16 10 020 data link escape
17 11 021 device control 1
18 12 022 device control 2
19 13 023 device control 3
20 14 024 device control 4
21 15 025 negative acknowledge
22 16 026 synchronous idle
23 17 027 end of trans. block
24 18 030 cancel
25 19 031 end of medium
26 1A 032 substitute
27 1B 033 escape
28 1C 034 file separator
29 1D 035 group separator
30 1E 036 record separator
31 1F 037 unit separator
32 20 040 space
33 21 041 !
34 22 042 "
35 23 043 #
36 24 044 $
37 25 045 %
38 26 046 &
39 27 047 "
40 28 050 (
41 29 051 )
42 2A 052 *
43 2B 053 +
44 2C 054 ,
45 2D 055 -
46 2E 056 .
47 2F 057 /
48 30 060 0
49 31 061 1
50 32 062 2
51 33 063 3
52 34 064 4
53 35 065 5
54 36 066 6
55 37 067 7
56 38 070 8
57 39 071 9
58 3A 072 :
59 3B 073 ;
60 3C 074 <
61 3D 075 =
62 3E 076 >
63 3F 077 ?
DecHexOctChar
64 40 100 @
65 41 101 A
66 42 102 B
67 43 103 C
68 44 104 D
69 45 105 E
70 46 106 F
71 47 107 G
72 48 110 H
73 49 111 I
74 4A 112 J
75 4B 113 K
76 4C 114 L
77 4D 115 M
78 4E 116 N
79 4F 117 O
80 50 120 P
81 51 121 Q
82 52 122 R
83 53 123 S
84 54 124 T
85 55 125 U
86 56 126 V
87 57 127 W
88 58 130 X
89 59 131 Y
90 5A 132 Z
91 5B 133 [
92 5C 134 \
93 5D 135 ]
94 5E 136 ^
95 5F 137 _
96 60 140 `
97 61 141 a
98 62 142 b
99 63 143 c
100 64 144 d
101 65 145 e
102 66 146 f
103 67 147 g
104 68 150 h
105 69 151 i
106 6A 152 j
107 6B 153 k
108 6C 154 l
109 6D 155 m
110 6E 156 n
111 6F 157 o
112 70 160 p
113 71 161 q
114 72 162 r
115 73 163 s
116 74 164 t
117 75 165 u
118 76 166 v
119 77 167 w
120 78 170 x
121 79 171 y
122 7A 172 z
123 7B 173 {
124 7C 174 |
125 7D 175 }
126 7E 176 ~
127 7F 177 DEL

As you can easily see, this encoding contains only Latin letters, and those that are used in the English language. There are also arithmetic and other service symbols. But there are neither Russian letters, nor even special Latin ones for German or French. This is easy to explain - the encoding was developed specifically as an American standard. As computers began to be used throughout the world, other characters needed to be encoded.

To do this, it was decided to use the eighth bit in each byte. This made 128 more values ​​available (from 80 to FF) that could be used to encode characters. The first of the eight-bit tables is “extended ASCII” ( Extended ASCII) - included various variants of Latin characters used in some languages ​​of Western Europe. It also contained other additional symbols, including pseudographics.

Pseudographic characters allow you to provide some semblance of graphics by displaying only text characters on the screen. For example, the file management program FAR Manager works using pseudographics.

There were no Russian letters in the Extended ASCII table. Russia (formerly the USSR) and other countries created their own encodings that made it possible to represent specific “national” characters in 8-bit text files - Latin letters of the Polish and Czech languages, Cyrillic (including Russian letters) and other alphabets.

In all encodings that have become widespread, the first 127 characters (that is, the byte value with the eighth bit equal to 0) are the same as ASCII. So an ASCII file works in either of these encodings; letters English language they are presented equally.

Organization ISO(International Standardization Organization) adopted a group of standards ISO 8859. It defines 8-bit encodings for different language groups. So, ISO 8859-1 is an Extended ASCII table for the USA and Western Europe. And ISO 8859-5 is a table for the Cyrillic alphabet (including Russian).

However, for historical reasons, the ISO 8859-5 encoding did not take root. In reality, the following encodings are used for the Russian language:

Code Page 866 ( CP866), aka “DOS”, aka “alternative GOST encoding”. Widely used until the mid-90s; now used to a limited extent. Practically not used for distributing texts on the Internet.
- KOI-8. Developed in the 70-80s. It is a generally accepted standard for transmitting email messages on the Russian Internet. Widely used in operating systems Unix family, including Linux. The KOI-8 version, designed for Russian, is called KOI-8R; There are versions for other Cyrillic languages ​​(for example, KOI8-U is a version for the Ukrainian language).
- Code Page 1251, CP1251,Windows-1251. Developed by Microsoft to support the Russian language in Windows.

The main advantage of the CP866 was the preservation of pseudo-graphics characters in the same places as in Extended ASCII; therefore, foreign text programs, for example, the famous Norton Commander, could work without changes. The CP866 is now used for Windows programs running in text windows or full-screen text mode, including FAR Manager.

Texts in CP866 have been quite rare in recent years (but it is used to encode Russian file names in Windows). Therefore, we will dwell in more detail on two other encodings - KOI-8R and CP1251.



As you can see, in the CP1251 encoding table, Russian letters are arranged in alphabetical order (with the exception, however, of the letter E). Thanks to this location computer programs It's very easy to sort alphabetically.

But in KOI-8R the order of Russian letters seems random. But in reality this is not the case.

In many older programs, the 8th bit was lost when processing or transmitting text. (Now such programs are practically “extinct”, but in the late 80s - early 90s they were widespread). To get a 7-bit value from an 8-bit value, just subtract 8 from the most significant digit; for example, E1 becomes 61.

Now compare KOI-8R with the ASCII table (Table 1). You will find that Russian letters are placed in clear correspondence with Latin ones. If the eighth bit disappears, lowercase Russian letters turn into uppercase Latin letters, and uppercase Russian letters turn into lowercase Latin letters. So, E1 in KOI-8 is the Russian “A”, while 61 in ASCII is the Latin “a”.

So, KOI-8 allows you to maintain the readability of Russian text when the 8th bit is lost. “Hello everyone” becomes “pRIWET WSEM”.

Recently, both the alphabetical order of characters in the encoding table and readability with the loss of the 8th bit have lost their decisive importance. Eighth bit in modern computers is not lost during transmission or processing. And alphabetical sorting is done taking into account the encoding, and not by simply comparing codes. (By the way, the CP1251 codes are not completely arranged alphabetically - the letter E is not in its place).

Due to the fact that there are two common encodings, when working with the Internet (mail, browsing Web sites), you can sometimes see a meaningless set of letters instead of Russian text. For example, “I AM SBYUFEMHEL.” These are just the words “with respect”; but they were encoded in CP1251 encoding, and the computer decoded the text using the KOI-8 table. If the same words were, on the contrary, encoded in KOI-8, and the computer decoded the text using the CP1251 table, the result would be “U KHBTSEOYEN”.

Sometimes it happens that a computer deciphers Russian-language letters using a table that is not intended for the Russian language. Then, instead of Russian letters, a meaningless set of symbols appears (for example, Latin letters of Eastern European languages); they are often called “crocozybras”.

In most cases, modern programs cope with determining the encodings of Internet documents ( emails and Web pages) independently. But sometimes they “misfire”, and then you can see strange sequences of Russian letters or “krokozyabry”. As a rule, in such a situation, to display real text on the screen, it is enough to select the encoding manually in the program menu.

Information from the page http://open-office.edusite.ru/TextProcessor/p5aa1.html was used for this article.

Material taken from the site:

By the way, on our website you can convert any text into decimal, hexadecimal, binary code using the Online Code Calculator.

ASCII table

ASCII (American Standard Code for Information Interchange)

Summary table of ASCII codes

ASCII Windows Character Code Table (Win-1251)

Symbol

specialist. Tabulation

specialist. LF (Carriage Return)

specialist. CR( New line)

clutch SP (Space)

Symbol

Extended ASCII Code Table

Formatting symbols.

Backspace (Return one character). Indicates that the print mechanism or display cursor is moving back one position.

Horizontal Tabulation. Indicates the movement of the print engine or display cursor to the next prescribed "tab stop".

Line Feed. Indicates the movement of the print mechanism or display cursor to the beginning of the next line (down one line).

Vertical Tabulation. Indicates the movement of the print engine or display cursor to the next group of lines.

Form Feed. Indicates the movement of the print engine or display cursor to the starting position of the next page, form, or screen.

Carriage Return. Indicates the movement of the print mechanism or display cursor to the home (leftmost) position of the current line.

Data transfer.

Start of Heading. Used to define the start of a header, which may contain routing information or an address.

Start of Text. Shows the beginning of the text and at the same time the end of the title.

End of Text. Applies when ending text that began with the STX character.

Inquiry. Request for identification data (such as "Who are you?") from a remote station.

Acknowledge. The receiving device transmits this character to the sender as confirmation of successful reception of the data.

Negative Acknowledgment. The receiving device transmits this character to the sender in case of denial (failure) of data reception.

Synchronous/Idle. Used in synchronized transmission systems. When there is no data transmission, the system continuously sends SYN symbols to ensure synchronization.

End of Transmission Block. Indicates the end of a data block for communication purposes. Used to split large amounts of data into separate blocks.

Dividing marks when transmitting information.

Other symbols.

Null. (No character - no data). Used for transmission when there is no data.

Bell (Call). Used to control alarm devices.

Shift Out. Indicates that all subsequent codewords must be interpreted according to the external character set before the arrival of the SI character.

Shift In. Indicates that subsequent code combinations must be interpreted according to standard set characters.

Data Link Escape. Changing the meaning of the following characters. Used for additional control or for transmitting an arbitrary combination of bits.

DC1, DC2, DC3, DC4

Device Controls. Symbols for operating auxiliary devices (special functions).

Cancel. Indicates that data that precedes this character in a message or block should be ignored (usually if an error is detected).

End of Medium. Indicates the physical end of a tape or other storage medium

Substitute. Used to replace an erroneous or invalid character.

Escape (Expansion). Used to expand code by indicating that a subsequent character has an alternative meaning.

Space. A non-printing character to separate words or move the print engine or display cursor forward one position.

Delete. Used to remove (erase) the previous character in a message

The set of characters with which text is written is called alphabet.

The number of characters in the alphabet is its power.

Formula for determining the amount of information: N=2b,

where N is the power of the alphabet (number of characters),

b – number of bits (information weight of the symbol).

The alphabet, with a capacity of 256 characters, can accommodate almost all the necessary characters. This alphabet is called sufficient.

Because 256 = 2 8, then the weight of 1 character is 8 bits.

The unit of measurement 8 bits was given the name 1 byte:

1 byte = 8 bits.

The binary code of each character in computer text takes up 1 byte of memory.

How is text information represented in computer memory?

The convenience of byte-by-byte character encoding is obvious because a byte is the smallest addressable part of memory and, therefore, the processor can access each character separately when processing text. On the other hand, 256 characters is quite a sufficient number to represent a wide variety of symbolic information.

Now the question arises, which eight-bit binary code to assign to each character.

It is clear that this is a conditional matter; you can come up with many encoding methods.

All characters of the computer alphabet are numbered from 0 to 255. Each number corresponds to an eight-bit binary code from 00000000 to 11111111. This code is simply the serial number of the character in the binary number system.

A table in which all characters of the computer alphabet are assigned serial numbers is called an encoding table.

For different types Computers use different encoding tables.

The table has become the international standard for PCs ASCII(read aski) (American Standard Code for Information Interchange).

The ASCII code table is divided into two parts.

Only the first half of the table is the international standard, i.e. symbols with numbers from 0 (00000000), up to 127 (01111111).

ASCII encoding table structure

Serial number

Code

Symbol

0 - 31

00000000 - 00011111

Symbols with numbers from 0 to 31 are usually called control symbols.
Their function is to control the process of displaying text on the screen or printing, feeding sound signal, text markup, etc.

32 - 127

00100000 - 01111111

Standard part of the table (English). This includes lowercase and uppercase letters of the Latin alphabet, decimal numbers, punctuation marks, all kinds of brackets, commercial and other symbols.
Character 32 is a space, i.e. empty position in the text.
All others are reflected by certain signs.

128 - 255

10000000 - 11111111

Alternative part of the table (Russian).
The second half of the ASCII code table, called the code page (128 codes, starting from 10000000 and ending with 11111111), can have different options, each option has its own number.
The code page is primarily used to accommodate national alphabets other than Latin. In Russian national encodings, characters from the Russian alphabet are placed in this part of the table.

First half of the ASCII code table


Please note that in the encoding table, letters (uppercase and lowercase) are arranged in alphabetical order, and numbers are ordered in ascending order. This observance of lexicographic order in the arrangement of symbols is called the principle of sequential coding of the alphabet.

For letters of the Russian alphabet, the principle of sequential coding is also observed.

Second half of the ASCII code table


Unfortunately, there are currently five different Cyrillic encodings (KOI8-R, Windows. MS-DOS, Macintosh and ISO). Because of this, problems often arise with transferring Russian text from one computer to another, from one software system to another.

Chronologically, one of the first standards for encoding Russian letters on computers was KOI8 ("Information Exchange Code, 8-bit"). This encoding was used back in the 70s on computers of the ES computer series, and from the mid-80s it began to be used in the first Russified versions operating system UNIX.

From the early 90s, the time of dominance of the MS DOS operating system, the CP866 encoding remains ("CP" means "Code Page", "code page").

Apple computers running operating systems Mac systems OS, use their own Mac encoding.

In addition, the International Standards Organization (ISO) has approved another encoding called ISO 8859-5 as a standard for the Russian language.

The most common encoding currently used is Microsoft Windows, abbreviated CP1251.

Since the late 90s, the problem of standardizing character encoding has been solved by the introduction of a new international standard called Unicode. This is a 16-bit encoding, i.e. it allocates 2 bytes of memory for each character. Of course, this increases the amount of memory occupied by 2 times. But such a code table allows the inclusion of up to 65536 characters. The complete specification of the Unicode standard includes all the existing, extinct and artificially created alphabets of the world, as well as many mathematical, musical, chemical and other symbols.

Let's try using an ASCII table to imagine what words will look like in the computer's memory.

Internal representation of words in computer memory

Sometimes it happens that a text consisting of letters of the Russian alphabet received from another computer cannot be read - some kind of “abracadabra” is visible on the monitor screen. This happens because computers use different character encodings for the Russian language.

Unicode (Unicode in English) is a character encoding standard. Simply put, this is a table of correspondence between text characters ( , letters, punctuation elements) binary codes. The computer only understands the sequence of zeros and ones. So that it knows what exactly it should display on the screen, it is necessary to assign each character its own unique number. In the eighties, characters were encoded in one byte, that is, eight bits (each bit is a 0 or 1). Thus, it turned out that one table (aka encoding or set) can only accommodate 256 characters. This may not be enough even for one language. Therefore, many different encodings appeared, confusion with which often led to some strange gibberish appearing on the screen instead of readable text. A single standard was required, which is what Unicode became. The most used encoding is UTF-8 (Unicode Transformation Format), which uses 1 to 4 bytes to represent a character.

Symbols

Characters in Unicode tables are numbered with hexadecimal numbers. For example, Cyrillic capital letter M is designated U+041C. This means that it stands at the intersection of row 041 and column C. You can simply copy it and then paste it somewhere. In order not to rummage through a multi-kilometer list, you should use the search. When you go to the symbol page, you will see its Unicode number and how it is written in different fonts. You can enter the sign itself into the search bar, even if a square is drawn instead, at least to find out what it was. Also, on this site there are special (and random) sets of the same type of icons, collected from different sections, for ease of use.

The Unicode standard is international. It includes characters from almost all scripts of the world. Including those that are no longer used. Egyptian hieroglyphs, Germanic runes, Mayan writing, cuneiform and alphabets of ancient states. Designations of weights and measures, musical notation, and mathematical concepts are also presented.

The Unicode Consortium itself does not invent new characters. Those icons that find their use in society are added to the tables. For example, the ruble sign was actively used for six years before it was added to Unicode. Emoji pictograms (emoticons) were also first widely used in Japan before they were included in the encoding. But trademarks and company logos are not added in principle. Even such common ones as the Apple apple or the Windows flag. To date, about 120 thousand characters are encoded in version 8.0.

As you know, a computer stores information in binary form, representing it as a sequence of ones and zeros. To translate information into a form convenient for human perception, each unique sequence of numbers is replaced by its corresponding symbol when displayed.

One of the systems for correlating binary codes with printed and control characters is

At today's level of development computer technology the user is not required to know the code of each specific character. However, a general understanding of how coding is carried out is extremely useful, and for some categories of specialists, even necessary.

Creating ASCII

The encoding was originally developed in 1963 and then updated twice over the course of 25 years.

In the original version, the ASCII character table included 128 characters; later an extended version appeared, where the first 128 characters were saved, and previously missing characters were assigned to codes with the eighth bit involved.

For many years, this encoding was the most popular in the world. In 2006, Latin 1252 took the leading position, and from the end of 2007 to the present, Unicode has firmly held the leading position.

Computer representation of ASCII

Each ASCII character has its own code, consisting of 8 characters representing a zero or a one. The minimum number in this representation is zero (eight zeros in the binary system), which is the code of the first element in the table.

Two codes in the table were reserved for switching between standard US-ASCII and its national variant.

After ASCII began to include not 128, but 256 characters, an encoding variant became widespread, in which the original version of the table was stored in the first 128 codes with the 8th bit zero. National written characters were stored in the upper half of the table (positions 128-255).

The user does not need to know the ASCII character codes directly. To the developer software Usually it is enough to know the number of the element in the table in order, if necessary, to calculate its code using the binary system.

Russian language

After the development of encodings for the Scandinavian languages, Chinese, Korean, Greek, etc. in the early 70s, the Soviet Union began creating its own version. Soon, a version of the 8-bit encoding called KOI8 was developed, preserving the first 128 ASCII character codes and allocating the same number of positions for letters of the national alphabet and additional characters.

Before the introduction of Unicode, KOI8 dominated the Russian segment of the Internet. There were encoding options for both the Russian and Ukrainian alphabet.

ASCII problems

Since the number of elements even in the extended table did not exceed 256, there was no possibility of accommodating several different scripts in one encoding. In the 90s, the “crocozyabr” problem appeared on the Runet, when texts typed in Russian ASCII characters were displayed incorrectly.

The problem was a code mismatch various options ASCII to each other. Let us remember that various characters could be located in positions 128-255, and when changing one Cyrillic encoding to another, all letters of the text were replaced with others having an identical number in a different version of the encoding.

Current Status

With the advent of Unicode, the popularity of ASCII began to decline sharply.

The reason for this lies in the fact that the new encoding made it possible to accommodate characters from almost all written languages. In this case, the first 128 ASCII characters correspond to the same characters in Unicode.

In 2000, ASCII was the most popular encoding on the Internet and was used on 60% of web pages indexed by Google. By 2012, the share of such pages had dropped to 17%, and Unicode (UTF-8) took the place of the most popular encoding.

So ASCII is an important part of history information technology, however, its use in the future seems unpromising.