Unicode and Non-Unicode String Data Types in SQL Server
Nov 19, 2021 by Robert Gravelle
SQL Server provides a number of data types that support all types of data that you may want to store. As you may have guessed, data type is an attribute that specifies the type of data that a column can store. It can be an integer, character string, monetary, date and time, and so on. One data type that causes some confusion among database designers and developers are those for storing character strings. A character string is a series of characters manipulated as a group. In the context of relational databases, character string data types are those which allow you to store either fixed-length (char) or variable-length data (varchar). Moreover, SQL Server splits its string types into two broad categories: Unicode and non-Unicode. These equate to nchar, nvarchar, and ntext for Unicode types and char, varchar/varchar (max) and text for non-Unicode. In today’s blog, we’ll compare the two categories to decide when to use one over the other.
Nchar is short for “NATIONAL CHARACTER”, nvarchar stands for “NATIONAL CHARACTER VARYING”, and ntext is the ISO synonym for “NATIONAL TEXT”. Originally intended for pre-Unicode multibyte encodings like JIS encoding for Asian characters. The idea was that VARCHAR would continue to be utilized for ASCII, with NVARCHAR being employed for non-ASCII characters.
This use-case was designed when the Internet was still in its infancy and before the Unicode project had taken off. In those days, Asian languages in particular, employed their own specific – and mutually incompatible – encodings, with GB for mainland Chinese, JIS/SJIS for Japanese, BIG5 in Hong Kong and Taiwan, CNS in Taiwan, etc. However, all of that changed with the emergence of the Unicode project encodings, as database vendors realized that it was easier to just allow VARCHAR itself to support multibyte character encodings, and use Character Sets and Collations to deal with specific encodings. For instance, you can use UTF-8 to encode any character you need in any language your applications need to support. Thus, the need for a whole group of character data types that were specific to “NATIONAL CHARACTER” soon faded away.
Today, in many modern DB engines, “NVARCHAR” and “NATIONAL CHARACTER VARYING” are really just aliases for VARCHAR, with the actual implementation being virtually (if not exactly) identical. Having said that, SQL Server does treat the two differently. As stated in the docs:
The key difference between varchar and nvarchar is the way they are stored, varchar is stored as regular 8-bit data(1 byte per character) and nvarchar stores data at 2 bytes per character. Due to this reason, nvarchar can hold up to 4000 characters and it takes double the space as SQL varchar.
As specified above, the biggest concern when deciding between types is the amount of storage used. For example, nvarchar uses 2 bytes per character, whereas varchar uses 1. Thus, nvarchar(4000) uses the same amount of storage space as varchar(8000). Hence, if you have requirements to store UNICODE or multilingual data, nvarchar is the best choice. Varchar stores ASCII data and should be your data type of choice for normal use. Another consideration is that joining a VARCHAR column to a NVARCHAR (and vice-versa) in queries can lead to a considerable performance hit.
In today’s blog, we compared SQL Server’s Unicode and Non-Unicode String Data Types to decide when to use one over the other.