Comparing Strings
Applications that sort through text perform frequent string comparisons. For example, a report generator performs string comparisons when sorting a list of strings in alphabetical order.
If your application audience is limited to people who speak English, you can probably perform string comparisons with the String.compareTo
method. The String.compareTo
method performs a binary comparison of the Unicode characters within the two strings. For most languages, however, this binary comparison cannot be relied on to sort strings, because the Unicode values do not correspond to the relative order of the characters.
Fortunately the Collator
class allows your application to perform string comparisons for different languages. In this section, you'll learn how to use the Collator
class when sorting text.
Performing Locale-Independent Comparisons
Collation rules define the sort sequence of strings. These rules vary with locale, because various natural languages sort words differently. You can use the predefined collation rules provided by the Collator
class to sort strings in a locale-independent manner.
To instantiate the Collator
class invoke the getInstance
method. Usually, you create a Collator
for the default Locale
, as in the following example:
Collator myDefaultCollator = Collator.getInstance();
You can also specify a particular Locale
when you create a Collator
, as follows:
Collator myFrenchCollator = Collator.getInstance(Locale.FRENCH);
The getInstance
method returns a RuleBasedCollator
, which is a concrete subclass of Collator
. The RuleBasedCollator
contains a set of rules that determine the sort order of strings for the locale you specify. These rules are predefined for each locale. Because the rules are encapsulated within the RuleBasedCollator
, your program won't need special routines to deal with the way collation rules vary with language.
You invoke the Collator.compare
method to perform a locale-independent string comparison. The compare
method returns an integer less than, equal to, or greater than zero when the first string argument is less than, equal to, or greater than the second string argument. The following table contains some sample calls to Collator.compare
:
Example | Return Value | Explanation |
---|---|---|
myCollator.compare("abc", "def") |
-1 |
"abc" is less than "def" |
myCollator.compare("rtf", "rtf") |
0 |
the two strings are equal |
myCollator.compare("xyz", "abc") |
1 |
"xyz" is greater than "abc" |
You use the compare
method when performing sort operations. The sample program called CollatorDemo
uses the compare
method to sort an array of English and French words. This program shows what can happen when you sort the same list of words with two different collators:
Collator fr_FRCollator = Collator.getInstance(new Locale("fr","FR"));
Collator en_USCollator = Collator.getInstance(new Locale("en","US"));
The method for sorting, called sortStrings
, can be used with any Collator
. Notice that the sortStrings
method invokes the compare
method:
public static void sortStrings(Collator collator, String[] words) {
String tmp;
for (int i = 0; i < words.length; i++) {
for (int j = i + 1; j < words.length; j++) {
if (collator.compare(words[i], words[j]) > 0) {
tmp = words[i];
words[i] = words[j];
words[j] = tmp;
}
}
}
}
The English Collator
sorts the words as follows:
peach
péché
pêche
sin
According to the collation rules of the French language, the preceding list is in the wrong order. In French péché should follow pêche in a sorted list. The French Collator
sorts the array of words correctly, as follows:
peach
pêche
péché
sin
Customizing Collation Rules
The previous section discussed how to use the predefined rules for a locale to compare strings. These collation rules determine the sort order of strings. If the predefined collation rules do not meet your needs, you can design your own rules and assign them to a RuleBasedCollator
object.
Customized collation rules are contained in a String
object that is passed to the RuleBasedCollator
constructor. Here's a simple example:
String simpleRule = "< a < b < c < d";
RuleBasedCollator simpleCollator = new RuleBasedCollator(simpleRule);
For the simpleCollator
object in the previous example, a
is less than b
, which is less that c
, and so forth. The simpleCollator.compare
method references these rules when comparing strings. The full syntax used to construct a collation rule is more flexible and complex than this simple example. For a full description of the syntax, refer to the API documentation for the RuleBasedCollator
class.
The example that follows sorts a list of Spanish words with two collators. Full source code for this example is in RulesDemo.java
.
The RulesDemo
program starts by defining collation rules for English and Spanish. The program will sort the Spanish words in the traditional manner. When sorting by the traditional rules, the letters ch and ll and their uppercase equivalents each have their own positions in the sort order. These character pairs compare as if they were one character. For example, ch sorts as a single letter, following cz in the sort order. Note how the rules for the two collators differ:
String englishRules = (
"< a,A < b,B < c,C < d,D < e,E < f,F " +
"< g,G < h,H < i,I < j,J < k,K < l,L " +
"< m,M < n,N < o,O < p,P < q,Q < r,R " +
"< s,S < t,T < u,U < v,V < w,W < x,X " +
"< y,Y < z,Z");
String smallnTilde = new String("\u00F1"); // ñ
String capitalNTilde = new String("\u00D1"); // Ñ
String traditionalSpanishRules = (
"< a,A < b,B < c,C " +
"< ch, cH, Ch, CH " +
"< d,D < e,E < f,F " +
"< g,G < h,H < i,I < j,J < k,K < l,L " +
"< ll, lL, Ll, LL " +
"< m,M < n,N " +
"< " + smallnTilde + "," + capitalNTilde + " " +
"< o,O < p,P < q,Q < r,R " +
"< s,S < t,T < u,U < v,V < w,W < x,X " +
"< y,Y < z,Z");
The following lines of code create the collators and invoke the sort routine:
try {
RuleBasedCollator enCollator = new RuleBasedCollator(englishRules);
RuleBasedCollator spCollator =
new RuleBasedCollator(traditionalSpanishRules);
sortStrings(enCollator, words);
printStrings(words);
System.out.println();
sortStrings(spCollator, words);
printStrings(words);
} catch (ParseException pe) {
System.out.println("Parse exception for rules");
}
The sort routine, called sortStrings
, is generic. It will sort any array of words according to the rules of any Collator
object:
public static void sortStrings(Collator collator, String[] words) {
String tmp;
for (int i = 0; i < words.length; i++) {
for (int j = i + 1; j < words.length; j++) {
if (collator.compare(words[i], words[j]) > 0) {
tmp = words[i];
words[i] = words[j];
words[j] = tmp;
}
}
}
}
When sorted with the English collation rules, the array of words is as follows:
chalina
curioso
llama
luz
Compare the preceding list with the following, which is sorted according to the traditional Spanish rules of collation:
curioso
chalina
luz
llama
Improving Collation Performance
Sorting long lists of strings is often time consuming. If your sort algorithm compares strings repeatedly, you can speed up the process by using the CollationKey
class.
A CollationKey
object represents a sort key for a given String
and Collator
. Comparing two CollationKey
objects involves a bitwise comparison of sort keys and is faster than comparing String
objects with the Collator.compare
method. However, generating CollationKey
objects requires time. Therefore if a String
is to be compared just once, Collator.compare
offers better performance.
The example that follows uses a CollationKey
object to sort an array of words. Source code for this example is in KeysDemo.java
.
The KeysDemo
program creates an array of CollationKey
objects in the main
method. To create a CollationKey
, you invoke the getCollationKey
method on a Collator
object. You cannot compare two CollationKey
objects unless they originate from the same Collator
. The main
method is as follows:
static public void main(String[] args) {
Collator enUSCollator = Collator.getInstance(new Locale("en","US"));
String [] words = {
"peach",
"apricot",
"grape",
"lemon"
};
CollationKey[] keys = new CollationKey[words.length];
for (int k = 0; k < keys.length; k ++) {
keys[k] = enUSCollator. getCollationKey(words[k]);
}
sortArray(keys);
printArray(keys);
}
The sortArray
method invokes the CollationKey.compareTo
method. The compareTo
method returns an integer less than, equal to, or greater than zero if the keys[i]
object is less than, equal to, or greater than the keys[j]
object. Note that the program compares the CollationKey
objects, not the String
objects from the original array of words. Here is the code for the sortArray
method:
public static void sortArray(CollationKey[] keys) {
CollationKey tmp;
for (int i = 0; i < keys.length; i++) {
for (int j = i + 1; j < keys.length; j++) {
if (keys[i].compareTo(keys[j]) > 0) {
tmp = keys[i];
keys[i] = keys[j];
keys[j] = tmp;
}
}
}
}
The KeysDemo
program sorts an array of CollationKey
objects, but the original goal was to sort an array of String
objects. To retrieve the String
representation of each CollationKey
, the program invokes getSourceString
in the displayWords
method, as follows:
static void displayWords(CollationKey[] keys) {
for (int i = 0; i < keys.length; i++) {
System.out.println(keys[i].getSourceString());
}
}
The displayWords
method prints the following lines:
apricot
grape
lemon
peach