Reading and Writing i18n Character Data I never thought too much about reading and writing double byte character data from files. I thought it would just happen. After all strings in Java are represented as UTF-16 encoded characters, so if I simply read/write the data, from the file all will be well, right? Well, yes and no. (huh?) The real answer is that it depends on the default character set encoding of the underlying operating system. So, if your operating system is setup to use UTF-8, or UTF-16 then all is well. You can continue reading and writing your files as if they contain only ASCII characters. Unfortunately, it is likely that your system will be setup to use something else other then UTF-8, for example ISO-8859-1 which represents Latin alphabet, default on MSWindows... Reading UTF-8 Data When reading double-byte character data the most important thing is to make sure you do not assume that one byte represents one character (after all it is not called double-byte for nothing :) ). So how do you read double byte data? Well, it is simple. You need a Reader and you need to make sure that you use UTF-8 as the character set when instantiating the Reader. From there it is simple. Consider a case when you want to read a short file into a StringBuilder. Example is shown in figure 1. StringBuilder readFile(string fileName) throws IOException { if(fileName == null) { throw new IllegalArgumentException("file name is null'); } File f = new File(fileName); StringBuilder buf = new StringBuilder(); BufferedReader br = null; char c[] = new char[1024]; int len = -1; try { //instantiate the reader correctly br = new BufferedReader( new InputStreamReader( new FileInputStream(f), "UTF-8")); //simply read characters while((len = br.read(c)) > -1) { buf.append(c, 0, len); } } finally { if(br != null) { br.close(); } } return buf; } figure 1 - reading a UTF-8 file Writing UTF-8 Data Writing UTF-8 Data is as simple as reading it as long as you use the writer instantiated by specifying UTF-8 as the character set. Simple example is shown in figure 2. void writeFile(String data, String fileName) throws IOException { if(data == null) { throw new IllegalArgumentException("data is null'); } if(fileName == null) { throw new IllegalArgumentException("file name is null'); } File f = new File(fileName); BufferedWriter bw = null; try { //instantiate the writer specifying the character set bw = new BufferedWriter( new OutputStreamWriter( new FileOutputStream(f), "UTF-8")); //simply write bw.write(data, 0, data.length()); } finally { if(bw != null) { bw.close(); } } } figure 2. writing UTF-8 data to a file. Detacting File Encoding --- under construction -- |