Chapter 8: Unicode Support

Contents

8.1 What is Unicode

From MSDN: "Unicode is a 16-bit, fixed-width character encoding standard that encompasses virtually all of the characters commonly used on computers today. This includes most of the world's written languages, plus publishing characters, mathematical and technical symbols, and punctuation marks."

From Unicode.org: "Computers ... store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters... Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."

For example, the basic Latin letter "A" has the code Hex 0041 (65), the Russian letter has the code Hex 0416 (1046), and the Chinese character has the code Hex 32A5 (12965).

For more information on Unicode, visit www.unicode.org.

8.2 What is UTF-8

UTF-8 (Unicode Transformation Format, 8-bit encoding form) is the recommended format to be used to send Unicode-based data across networks, in particular the Internet. UTF-8 represents a Unicode value as a sequence of 1, 2, or 3 bytes.

Unicode characters in the range Hex 0000 to 007F are encoded simply as bytes 00 to 7F. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. Therefore, the Unicode 0041 ("A") in UTF-8 is Hex 41.

Unicode characters in the range Hex 0080 to 07FF are encoded as a sequence of two bytes: 110xxxxx 10xxxxxx (The xxx bit positions are filled with the bits of the character code number in binary representation.) For example, the Unicode 0416 (), or Binary 00000100 00010110, is encoded as 11010000 10010110, or Hex D0 96.

Unicode characters in the range Hex 0800 to FFFF are encoded as a sequence of three bytes: 1110xxxx 10xxxxxx 10xxxxxx. For example the Unicode 32A5 () is encoded as Hex E3 8A A5.

8.3 Using UTF-8 with AspUpload

Starting with version 3.0, AspUpload is capable of converting UTF-8 encoded text fields and file names back into Unicode strings.

If you anticipate using Unicode characters in text data or the names of files you are uploading, you should instruct your browser to POST all the information in the UTF-8 format. This is done by including the following tag in the header of your page:

<HEAD>
...
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</HEAD>

On the AspUpload side, you must enable UTF-8 translation by setting the property Upload.CodePage to 65001 (a Win32-defined value for CP_UTF8):

<%
Set Upload = Server.CreateObject("Persits.Upload")
Upload.CodePage = 65001
...
Upload.Save "c:\upload"
%>

The Upload.CodePage property can also be set to valid code page values such as 1251 (Cyrillic), 1255 (Hebrew), 1256 (Arabic), etc. Every time the CodePage property is set, AspUpload will attempt to translate the text data and file names into Unicode using the specified code page by invoking the Win32 function MultiByteToWideChar.

The code samples unicode.asp and unicode_upload.asp demonstrate AspUpload's Unicode support. Both files are shown here:

unicode.asp

<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<h3>File and Text Items</h3>
   <FORM METHOD="POST" ENCTYPE="multipart/form-data" ACTION="unicode_upload.asp">
      File 1:<INPUT TYPE=FILE NAME="FILE1"><BR>
      Description 1:<INPUT TYPE=TEXT NAME="DESCR1"><BR>
   <INPUT TYPE=SUBMIT VALUE="Upload!">
   </FORM>
</BODY>
</HTML>

unicode_upload.asp

<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
</HEAD>
<BODY>
<%
Set Upload = Server.CreateObject("Persits.Upload")

' Enable UTF-8 translation
Upload.CodePage = 65001
Upload.Save "c:\upload"
%>

Files:<BR>
<%
For Each File in Upload.Files
   Response.Write File.Name & "= " & Server.HTMLEncode(File.Path) & " (" & File.Size &" bytes)<BR>"
Next
%>

<P>
Other items:<BR>
<%
For Each Item in Upload.Form
   Response.Write Item.Name & "= " & Server.HTMLEncode(Item.Value) & "<BR>"
Next
%>
</BODY>
</HTML>

Note that this script uses Server.HTMLEncode on file names and text items. This converts Unicode strings to a format understandable by a browser, such as:

&#1055;&#1077;&#1088;&#1089;&#1080;&#1094;

Click the link below to run this code sample: