Vovsoft Logo
Vovsoft Facebook Page Vovsoft Twitter Account
Menu
Home » Blog Posts » Difference between ANSI and UTF-8

Difference between ANSI and UTF-8

Date Last updated 2 days ago
*****
Rate this blog post

ANSI vs UTF-8

TLDR:
There is no difference between ANSI and UTF-8 if you are going to use only English characters (Western/U.S. systems). If you don't want emojis and characters from other languages ​​to be corrupted, you should use UTF-8.


What is ANSI?

ANSI encoding is a generic term used to refer to the standard code page on a system.

It is more properly referred to as Windows-1252 on U.S. and Western European systems. (It can represent certain other Windows code pages on other systems.) This is essentially an extension of the ASCII character set in that it includes all the ASCII characters with an additional 128 character codes. This difference is due to the fact that "ANSI" encoding is 8-bit rather than 7-bit as ASCII is (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0).

The name "ANSI" is a misnomer, since it doesn't correspond to any actual ANSI standard, but the name has stuck.

⚠️ ANSI encoding does not support emojis and most characters of world languages!


What is UTF-8?

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format 8-bit.

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters.

✔️ UTF-8 is the dominant encoding for the World Wide Web and internet technologies. It supports emojis and almost all characters of world languages.


What is UTF-8-BOM?

The UTF-8 BOM (Byte Order Mark) is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader (software) to more reliably guess a file as being encoded in UTF-8. Those bytes, if present, must be ignored when extracting the string from the file/stream. The BOM, when correctly used, is invisible to users. BOM use is optional.


UTF-8 vs UTF-16 vs UTF-32

  • UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes.
  • UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Microsoft Excel uses UTF-16 in CSV files.
  • UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on (by software). Rarely used.

 

Conclusion

If you don't want emojis and characters from foreign languages ​​to be corrupted, you should use UTF-8 to be on the safe side (except when it is necessary to use other encoding).



Continue Reading


Leave a Comment