HTML charset

The HTML charset, also known as the character encoding, is a code that defines the way in which characters are represented in an HTML or XML document. It tells the browser or other software how to interpret the bytes of a document as characters.

The most commonly used charset for HTML documents is UTF-8 (Unicode Transformation Format 8-bit). UTF-8 is an international standard that can represent all characters from the Unicode character set, which includes a wide range of characters from many different languages and scripts.

You can specify the charset of an HTML document by including a <meta> tag in the <head> section of the document, like this:

<meta charset="UTF-8">

You can also specify the charset in the HTTP headers when the page is sent from the server to the browser.

It’s important to specify the correct charset in your HTML document to ensure that characters are displayed correctly for all users. If the charset is not specified, or if it is specified incorrectly, characters may be displayed incorrectly or not at all.

It’s also important to make sure that the charset used in your HTML document is the same as the charset used in the editor or tool you are using to create the document.

What is Character Encoding?

ASCII was the first character encoding standard(also called character set). It defines 127 different alpha numeric characters that could be used on the internet.

ASCII supported numbers (0-9), English letters (A-Z), and some special characters like ! $ + – ( ) @ < > .
ANSI (Windows-1252) was the original Windows character set. It supported 256 different character codes.
ISO-8859-1 was the default character set for HTML 4. It also supported 256 different character codes.
Because ANSI and ISO was limited, the default character encoding was changed to UTF-8 in HTML5.
UTF-8 (Unicode) covers almost all of the characters and symbols in the world.
All HTML 4 processors also support UTF-8.
The HTML charset Attribute
To display an HTML page correctly, a web browser must know the character set used in the page.
This is specified in the <meta> tag:


For HTML4:

<meta http-equiv= "Content- Type" content= "text/htm l;charset= ISO- 8859-1">

For HTML5:

<meta charset= "UTF-8">


If a browser detects ISO-8859-1 in a web page, it defaults to ANSI, because ANSI is identical to ISO-8859-1 except that ANSI has 32 extra characters.
Differences Between Character Sets
The following table displays the differences between the character sets described above:

Sr.NoCharacter Set & Description
1UTF-8A Unicode Translation Format that comes in 8-bit units that is, it comes in bytes. A character in UTF8 can be from 1 to 4 bytes long, making UTF8 variable width.
2UTF-16A Unicode Translation Format that comes in 16-bit units that is, it comes in shorts. It can be 1 or 2 shorts long, making UTF16 variable width.
3UTF-32A Unicode Translation Format that comes in 32-bit units that is, it comes in longs. It is a fixed-width format and is always 1 “long” in length.

HTML charset
  1. The HTML charset, also known as the character encoding, is a code that defines the way in which characters are represented in an HTML or XML document.

See also

Character encoding on W3C

HTML charset