With this issue, we will be starting a series on internationalization
using Java. Internationalization refers to constructing applications
such that they can operate using other languages, currencies, dates and
times, and so on, with only minimal changes required.
The first aspect of this topic is Unicode, the character set used by
Java. Unicode represents characters as 16 unsigned bits, with values in
the range 0 - 65535. Every character uses two bytes. Printable ASCII
characters have an easy mapping to Unicode, by adding a 0 high byte to
the ASCII low byte. So, for example, a space is 0x0020.
Characters can be converted to integers without any cast, while
converting the other way requires a cast (and may not always make sense,
because an integer is 32 bits). For example, this program:
public class test1 { public static void main(String args[]) { char c = '\uffff'; int i = c; System.out.println(i); } } |
prints 65535. Note that \uNNNN is used to represent arbitrary Unicode
characters in hex format.
Unicode can be used to express program identifiers, in addition to its
use to express the values of specified characters. For example, this
program:
public class test2 { public static void main(String args[]) { int x\u0430 = 37; System.out.println(x\u0430); } } |
is legal. \u0430 is a letter (Cyrillic small letter A), and therefore
can be part of an identifier. Unicode translation takes place early in
the Java compilation process. Another aspect of this is that:
public class test3 { public static void main(String args[]) { char c = '\u000D'; } } |
is not a valid program, even though some Java compilers accept it.
Section 3.10.4 in the Java Language Specification says that a carriage
return (0x000D) may not appear as part of a character literal, and the
early Unicode translation means that a literal carriage return does in
fact appear, that is, Unicode translation results in:
char c = '<actual carriage return>'; |
\r needs to be used as a substitute for \u000D in this example.
java.lang.Character is a class for manipulating Java characters. It has methods for classifying characters, for example as to whether they are digits or letters. There is also a Web site http://www.unicode.org
that presents a lot of detail on how Unicode works. Note also that Java
support for Unicode doesn't necessarily mean that a tool using Java
(like a Web browser) will have the necessary fonts to display all
Unicode characters.
In the next issue we will be looking at some additional aspects of Java
character support.
In the last issue we saw how Java represents character data with the
Unicode character set. Each character requires two bytes or 16 bits.
The ability to support a wide range of different characters is
desirable, but potentially wasteful when targeting the many systems that
use 8-bit characters, and that have huge volumes of 8-bit textual data
stored in databases. To deal with this problem, Java supports the UTF-8
encoding as part of the DataInputStream and DataOutputStream classes.
With this encoding, characters are represented as follows:
\u0000 - \u007F 1 byte 0xxxxxxx \u0080 - \u07FF 2 bytes 110xxxxx 10xxxxxx \u0800 - \uFFFF 3 bytes 1110xxxx 10xxxxxx 10xxxxxx |
So the printable ASCII character set is represented as itself, that is,
using only one byte per character (except for the null byte, which is
encoded using a two-byte format to avoid embedded nulls in strings).
This encoding is thus quite efficient when the bulk of the characters
are ASCII.
An example of using UTF-8 looks like this:
import java.io.*; public class utf { public static void main(String args[]) { String tmp = "tmpfile"; try { FileOutputStream fos = new FileOutputStream(tmp); DataOutputStream dos = new DataOutputStream(fos); dos.writeUTF("testing\n"); dos.close(); FileInputStream fis = new FileInputStream(tmp); DataInputStream dis = new DataInputStream(fis); String instr = dis.readUTF(); dis.close(); System.out.print(instr); } catch (Throwable e) { System.err.println(e); } } } |
This example writes a string to a file and then reads it back. The
string length is written out as a two-byte value before the actual
string, so at most 65535 bytes of UTF-8 encoding are supported per
string.
We saw earlier how Java represents characters as 16-bit unsigned values.
This supports the representation of a variety of character sets.
But typically Unicode is not used to actually store characters in a disk
file. For example, in the United States, 8-bit ASCII is widely used
instead, both for text and binary data. We can say that ASCII
represents a "local encoding", and there needs to be some way to convert
a local encoding into Unicode and back.
One way that this is done is via I/O stream readers and writers. For
example, consider this application:
import java.io.*; public class encode { public static void main(String args[]) { String infile = args[0]; String outfile = args[1]; String encin = "8859_1"; // Latin-1 String encout = "8859_5"; // Cyrillic try { InputStream istr = new FileInputStream(infile); InputStreamReader ird = new InputStreamReader(istr, encin); BufferedReader br = new BufferedReader(ird); OutputStream ostr = new FileOutputStream(outfile); OutputStreamWriter owr = new OutputStreamWriter(ostr, encout); BufferedWriter bw = new BufferedWriter(owr); char buf[] = new char[4096]; int len; while ((len = br.read(buf)) != -1) bw.write(buf, 0, len); br.close(); bw.flush(); bw.close(); } catch (Throwable e) { System.err.println(e); } } } |
This program copies its input file to its output file, applying
different encodings to each. That is, input bytes are decoded using the
8859_1 encoding (ISO nomenclature for the Latin-1 character set) and
output bytes are encoded using the 8859_5 encoding (Cyrillic). So input
bytes will be converted to 16 bits via filling the top byte with 0, and
output characters (16 bits) will be converted to whatever representation
Cyrillic uses.
This approach is more expensive than doing low-level byte I/O, but is
important if you're concerned about supporting character sets other than
the default one for your locale.
A "locale" is a means of encapsulating information about a particular
country or geographical region or language or culture. In Java the
Locale class is used to represent such information. Objects of the
class do not provide internationalization behavior in and of themselves,
but are used by various other classes as a means of identifying what
behavior is desired. Examples might be varying date/time or currency
formats.
A simple example of Locale usage is this:
import java.util.Locale; public class testlocale { public static void main(String args[]) { Locale def = Locale.getDefault(); System.out.println("Default locale = " + def); Locale ger = Locale.GERMAN; System.out.println(def.getDisplayName(ger)); } } |
This program first retrieves the default locale and displays it. Then
it displays the default locale, using German as the display language.
Output is this:
Default locale = en_US Englisch (Vereinigte Staaten) |
There are a set of standard locales defined in Locale. You can also
create your own locales, based on strings representing a country,
language, and local variant. A variant is specific to a particular
implementation and platform.
We saw in a previous issue how locales could be represented using the
java.util.Locale class. A locale is a representation of a distinct
culture or language or region or set of customs.
To see how locales are used in practice, consider the problem of writing
a program that handles calendar dates and currency formats in a
locale-independent way. That is, the program should always do the right
thing, no matter where it's executed.
An example of handling local customs would be this:
import java.util.*; import java.text.*; public class date { public static void main(String args[]) { Date now = new Date(); DateFormat df_fr = DateFormat.getDateInstance(DateFormat.LONG, Locale.FRANCE); DateFormat df_us = DateFormat.getDateInstance(DateFormat.LONG, Locale.US); System.out.println(df_fr.format(now)); System.out.println(df_us.format(now)); NumberFormat pr_fr = NumberFormat.getCurrencyInstance(Locale.FRANCE); NumberFormat pr_us = NumberFormat.getCurrencyInstance(Locale.US); double d = 123.45; System.out.println(pr_fr.format(d)); System.out.println(pr_us.format(d)); } } |
In this program, we get the current date, and then set up a couple of
DateFormat objects, that encapsulate the desired format of the date
(SHORT for mm/dd/yy or LONG to have the date spelled out), along with
the locale the date is targeted for. We also obtain a NumberFormat
object to format currency values.
These objects then have particular values applied to them, for example,
the current date, or the value 123.45 in local units of currency (such
as francs or dollars). The output of the program is:
13 avril 1998 April 13, 1998 123,45 F $123.45 |
The above example could also be used without specifying particular
locales. In such a case, the default locale is used. Note that this
approach, using default locales, is not at all the same as assuming
particular customs. For example, if I say:
double d = 123.45; System.out.println("$" + d); |
then I am assuming particular currency-formatting customs. Making such
assumptions may be acceptable, as long as you realize that your
application will not work correctly in some other environment that has a
different set of customs.
If you've used the language C very much, you will be familiar with the
ubiquitous printf() library function:
printf("%s %3d %-4ld\n", a, b, c); |
for doing formatted output. A related function sprintf() does
formatting into a string. C++ has these facilities along with
additional ones for doing formatting.
What about formatting in Java? A 1.1 package called java.text is used for
this purpose. To see how it works, consider a simple example of
formatting error messages in two different styles:
Error at line 23 of file Test.java: ... File Test.java, line 23: ... |
An example of this type of formatting is:
import java.text.*; public class format { static String fmt1 = "Error at line {1} of file {0}: {2}"; static String fmt2 = "File {0}, line {1}: {2}"; public static String format(String fmt, String file, int ln, String msg) { Object values[] = new Object[3]; values[0] = file; values[1] = new Integer(ln); values[2] = msg; return MessageFormat.format(fmt, values); } public static void main(String args[]) { String err1 = format(fmt1, "Test.java", 37, "msg #1"); System.out.println(err1); String err2 = format(fmt2, "Test.java", 47, "msg #2"); System.out.println(err2); } } |
MessageFormat.format() takes a format string, along with a vector of
Objects that contain the values to be formatted. Primitive types like
integers are represented via wrappers (Integer in this case). Entries
in the format string like "{1}" are replaced with the corresponding
value.
The output of this program is:
Error at line 37 of file Test.java: msg #1 File Test.java, line 47: msg #2 |
The format string itself might be read from a property file (see issue
#025) or a resource bundle, and thus formatting can be customized on a
per-locale basis. In other words, you'd read in the format string from
a resource bundle, and then use it along with a set of object values to
create an actual string to display in an application.