TLDR:
There is no difference between ANSI and UTF-8 if you are going to use only English characters (Western/U.S. systems). If you don't want emojis and characters from other languages to be corrupted, you should use UTF-8.
ANSI encoding is a generic term used to refer to the standard code page on a system.
It is more properly referred to as Windows-1252 on U.S. and Western European systems. (It can represent certain other Windows code pages on other systems.) This is essentially an extension of the ASCII character set in that it includes all the ASCII characters with an additional 128 character codes. This difference is due to the fact that "ANSI" encoding is 8-bit rather than 7-bit as ASCII is (ASCII is almost always encoded nowadays as 8-bit bytes with the MSB set to 0).
The name "ANSI" is a misnomer, since it doesn't correspond to any actual ANSI standard, but the name has stuck.
⚠️ ANSI encoding does not support emojis and most characters of world languages!
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format 8-bit.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters.
✔️ UTF-8 is the dominant encoding for the World Wide Web and internet technologies. It supports emojis and almost all characters of world languages.
The UTF-8 BOM (Byte Order Mark) is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader (software) to more reliably guess a file as being encoded in UTF-8. Those bytes, if present, must be ignored when extracting the string from the file/stream. The BOM, when correctly used, is invisible to users. BOM use is optional.
Name | UTF-8 | UTF-8-BOM | UTF-16BE | UTF-16LE | UTF-32BE | UTF-32LE |
---|---|---|---|---|---|---|
Smallest code point | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 |
Largest code point | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF |
Code unit size | 8 bits | 8 bits | 16 bits | 16 bits | 32 bits | 32 bits |
Byte order | N/A | BOM | big-endian | little-endian | big-endian | little-endian |
Fewest bytes per character | 1 | 1 | 2 | 2 | 4 | 4 |
Most bytes per character | 4 | 4 | 4 | 4 | 4 | 4 |
If you don't want emojis and characters from foreign languages to be corrupted, you should use UTF-8 to be on the safe side (except when it is necessary to use other encoding).