Internalization


Unicode

With this issue, we will be starting a series on internationalization using Java. Internationalization refers to constructing applications such that they can operate using other languages, currencies, dates and times, and so on, with only minimal changes required.

The first aspect of this topic is Unicode, the character set used by Java. Unicode represents characters as 16 unsigned bits, with values in the range 0 - 65535. Every character uses two bytes. Printable ASCII characters have an easy mapping to Unicode, by adding a 0 high byte to the ASCII low byte. So, for example, a space is 0x0020.

Characters can be converted to integers without any cast, while converting the other way requires a cast (and may not always make sense, because an integer is 32 bits). For example, this program:
        public class test1 {
                public static void main(String args[])
                {
                        char c = '\uffff';
                        int i = c;
                        System.out.println(i);
                }
        }

prints 65535. Note that \uNNNN is used to represent arbitrary Unicode characters in hex format.

Unicode can be used to express program identifiers, in addition to its use to express the values of specified characters. For example, this program:
        public class test2 {
                public static void main(String args[])
                {
                        int x\u0430 = 37;
                        System.out.println(x\u0430);
                }
        }

is legal. \u0430 is a letter (Cyrillic small letter A), and therefore can be part of an identifier. Unicode translation takes place early in the Java compilation process. Another aspect of this is that:
        public class test3 {
                public static void main(String args[])
                {
                        char c = '\u000D';
                }
        }

is not a valid program, even though some Java compilers accept it. Section 3.10.4 in the Java Language Specification says that a carriage return (0x000D) may not appear as part of a character literal, and the early Unicode translation means that a literal carriage return does in fact appear, that is, Unicode translation results in:
        char c = '<actual carriage return>';

\r needs to be used as a substitute for \u000D in this example.

java.lang.Character is a class for manipulating Java characters. It has methods for classifying characters, for example as to whether they are digits or letters. There is also a Web site http://www.unicode.org

that presents a lot of detail on how Unicode works. Note also that Java support for Unicode doesn't necessarily mean that a tool using Java (like a Web browser) will have the necessary fonts to display all Unicode characters.

In the next issue we will be looking at some additional aspects of Java character support.


UFT-8 Encoding

In the last issue we saw how Java represents character data with the Unicode character set. Each character requires two bytes or 16 bits.

The ability to support a wide range of different characters is desirable, but potentially wasteful when targeting the many systems that use 8-bit characters, and that have huge volumes of 8-bit textual data stored in databases. To deal with this problem, Java supports the UTF-8 encoding as part of the DataInputStream and DataOutputStream classes. With this encoding, characters are represented as follows:
        \u0000 - \u007F         1 byte          0xxxxxxx

        \u0080 - \u07FF         2 bytes         110xxxxx 10xxxxxx

        \u0800 - \uFFFF         3 bytes         1110xxxx 10xxxxxx 10xxxxxx

So the printable ASCII character set is represented as itself, that is, using only one byte per character (except for the null byte, which is encoded using a two-byte format to avoid embedded nulls in strings). This encoding is thus quite efficient when the bulk of the characters are ASCII.

An example of using UTF-8 looks like this:
        import java.io.*;

        public class utf {
                public static void main(String args[])
                {
                        String tmp = "tmpfile";
                        try {
                                FileOutputStream fos =
                                    new FileOutputStream(tmp);
                                DataOutputStream dos =
                                    new DataOutputStream(fos);
                                dos.writeUTF("testing\n");
                                dos.close();
                                FileInputStream fis =
                                    new FileInputStream(tmp);
                                DataInputStream dis =
                                    new DataInputStream(fis);
                                String instr = dis.readUTF();
                                dis.close();
                                System.out.print(instr);
                        }
                        catch (Throwable e) {
                                System.err.println(e);
                        }
                }
        }

This example writes a string to a file and then reads it back. The string length is written out as a two-byte value before the actual string, so at most 65535 bytes of UTF-8 encoding are supported per string.


Character Encoding

We saw earlier how Java represents characters as 16-bit unsigned values. This supports the representation of a variety of character sets.

But typically Unicode is not used to actually store characters in a disk file. For example, in the United States, 8-bit ASCII is widely used instead, both for text and binary data. We can say that ASCII represents a "local encoding", and there needs to be some way to convert a local encoding into Unicode and back.

One way that this is done is via I/O stream readers and writers. For example, consider this application:
        import java.io.*;

        public class encode {
                public static void main(String args[])
                {
                        String infile = args[0];
                        String outfile = args[1];

                        String encin = "8859_1";  // Latin-1
                        String encout = "8859_5"; // Cyrillic

                        try {
                                InputStream istr =
                                    new FileInputStream(infile);
                                InputStreamReader ird =
                                    new InputStreamReader(istr, encin);
                                BufferedReader br =
                                    new BufferedReader(ird);

                                OutputStream ostr =
                                    new FileOutputStream(outfile);
                                OutputStreamWriter owr =
                                    new OutputStreamWriter(ostr, encout);
                                BufferedWriter bw =
                                    new BufferedWriter(owr);

                                char buf[] = new char[4096];
                                int len;
                                while ((len = br.read(buf)) != -1)
                                        bw.write(buf, 0, len);

                                br.close();
                                bw.flush();
                                bw.close();
                        }
                        catch (Throwable e) {
                                System.err.println(e);
                        }
                }
        }

This program copies its input file to its output file, applying different encodings to each. That is, input bytes are decoded using the 8859_1 encoding (ISO nomenclature for the Latin-1 character set) and output bytes are encoded using the 8859_5 encoding (Cyrillic). So input bytes will be converted to 16 bits via filling the top byte with 0, and output characters (16 bits) will be converted to whatever representation Cyrillic uses.

This approach is more expensive than doing low-level byte I/O, but is important if you're concerned about supporting character sets other than the default one for your locale.


Locales

A "locale" is a means of encapsulating information about a particular country or geographical region or language or culture. In Java the Locale class is used to represent such information. Objects of the class do not provide internationalization behavior in and of themselves, but are used by various other classes as a means of identifying what behavior is desired. Examples might be varying date/time or currency formats.

A simple example of Locale usage is this:

        import java.util.Locale;

        public class testlocale {

                public static void main(String args[])
                {
                        Locale def = Locale.getDefault();
                        System.out.println("Default locale = " + def);

                        Locale ger = Locale.GERMAN;
                        System.out.println(def.getDisplayName(ger));
                }

        }

This program first retrieves the default locale and displays it. Then it displays the default locale, using German as the display language. Output is this:
        Default locale = en_US
        Englisch (Vereinigte Staaten)

There are a set of standard locales defined in Locale. You can also create your own locales, based on strings representing a country, language, and local variant. A variant is specific to a particular implementation and platform.


Local Customs

We saw in a previous issue how locales could be represented using the java.util.Locale class. A locale is a representation of a distinct culture or language or region or set of customs.

To see how locales are used in practice, consider the problem of writing a program that handles calendar dates and currency formats in a locale-independent way. That is, the program should always do the right thing, no matter where it's executed.

An example of handling local customs would be this:
        import java.util.*;
        import java.text.*;

        public class date {

                public static void main(String args[])
                {
                        Date now = new Date();

                        DateFormat df_fr =
                            DateFormat.getDateInstance(DateFormat.LONG,
                            Locale.FRANCE);
                        DateFormat df_us =
                            DateFormat.getDateInstance(DateFormat.LONG,
                            Locale.US);

                        System.out.println(df_fr.format(now));
                        System.out.println(df_us.format(now));

                        NumberFormat pr_fr =
                            NumberFormat.getCurrencyInstance(Locale.FRANCE);
                        NumberFormat pr_us =
                            NumberFormat.getCurrencyInstance(Locale.US);

                        double d = 123.45;
                        System.out.println(pr_fr.format(d));
                        System.out.println(pr_us.format(d));
                }

        }

In this program, we get the current date, and then set up a couple of DateFormat objects, that encapsulate the desired format of the date (SHORT for mm/dd/yy or LONG to have the date spelled out), along with the locale the date is targeted for. We also obtain a NumberFormat object to format currency values.

These objects then have particular values applied to them, for example, the current date, or the value 123.45 in local units of currency (such as francs or dollars). The output of the program is:
        13 avril 1998
        April 13, 1998
        123,45 F
        $123.45

The above example could also be used without specifying particular locales. In such a case, the default locale is used. Note that this approach, using default locales, is not at all the same as assuming particular customs. For example, if I say:
        double d = 123.45;

        System.out.println("$" + d);

then I am assuming particular currency-formatting customs. Making such assumptions may be acceptable, as long as you realize that your application will not work correctly in some other environment that has a different set of customs.


Message Formatting

If you've used the language C very much, you will be familiar with the ubiquitous printf() library function:
        printf("%s %3d %-4ld\n", a, b, c);

for doing formatted output. A related function sprintf() does formatting into a string. C++ has these facilities along with additional ones for doing formatting.

What about formatting in Java? A 1.1 package called java.text is used for this purpose. To see how it works, consider a simple example of formatting error messages in two different styles:
        Error at line 23 of file Test.java: ...

        File Test.java, line 23: ...

An example of this type of formatting is:
        import java.text.*;

        public class format {

                static String fmt1 = "Error at line {1} of file {0}: {2}";
                static String fmt2 = "File {0}, line {1}: {2}";

                public static String format(String fmt, String file,
                    int ln, String msg)
                {
                        Object values[] = new Object[3];
                        values[0] = file;
                        values[1] = new Integer(ln);
                        values[2] = msg;

                        return MessageFormat.format(fmt, values);
                }

                public static void main(String args[])
                {
                        String err1 = format(fmt1, "Test.java", 37, "msg #1");
                        System.out.println(err1);

                        String err2 = format(fmt2, "Test.java", 47, "msg #2");
                        System.out.println(err2);
                }

        }

MessageFormat.format() takes a format string, along with a vector of Objects that contain the values to be formatted. Primitive types like integers are represented via wrappers (Integer in this case). Entries in the format string like "{1}" are replaced with the corresponding value.

The output of this program is:
        Error at line 37 of file Test.java: msg #1

        File Test.java, line 47: msg #2

The format string itself might be read from a property file (see issue #025) or a resource bundle, and thus formatting can be customized on a per-locale basis. In other words, you'd read in the format string from a resource bundle, and then use it along with a set of object values to create an actual string to display in an application.