Comparing Strings

Domains: Java

Applications that sort through text perform frequent string comparisons. For example, a report generator performs string comparisons when sorting a list of strings in alphabetical order.

If your application audience is limited to people who speak English, you can probably perform string comparisons with the String.compareTo method. The String.compareTomethod performs a binary comparison of the Unicode characters within the two strings. For most languages, however, this binary comparison cannot be relied on to sort strings, because the Unicode values do not correspond to the relative order of the characters.

Fortunately the Collator class allows your application to perform string comparisons for different languages. In this section, you'll learn how to use the Collator class when sorting text.

Performing Locale-Independent Comparisons

Collation rules define the sort sequence of strings. These rules vary with locale, because various natural languages sort words differently. You can use the predefined collation rules provided by the Collator class to sort strings in a locale-independent manner.

To instantiate the Collator class invoke the getInstance method. Usually, you create a Collator for the default Locale, as in the following example:

			Collator myDefaultCollator = Collator.getInstance();

You can also specify a particular Locale when you create a Collator, as follows:

			Collator myFrenchCollator = Collator.getInstance(Locale.FRENCH);

The getInstance method returns a RuleBasedCollator, which is a concrete subclass of Collator. The RuleBasedCollator contains a set of rules that determine the sort order of strings for the locale you specify. These rules are predefined for each locale. Because the rules are encapsulated within the RuleBasedCollator, your program won't need special routines to deal with the way collation rules vary with language.

You invoke the Collator.compare method to perform a locale-independent string comparison. The compare method returns an integer less than, equal to, or greater than zero when the first string argument is less than, equal to, or greater than the second string argument. The following table contains some sample calls to Collator.compare:

Example Return Value Explanation
myCollator.compare("abc", "def") -1 "abc" is less than "def"
myCollator.compare("rtf", "rtf") 0 the two strings are equal
myCollator.compare("xyz", "abc") 1 "xyz" is greater than "abc"

You use the compare method when performing sort operations. The sample program called CollatorDemo uses the compare method to sort an array of English and French words. This program shows what can happen when you sort the same list of words with two different collators:

			Collator fr_FRCollator = Collator.getInstance(new Locale("fr","FR"));
Collator en_USCollator = Collator.getInstance(new Locale("en","US"));

The method for sorting, called sortStrings, can be used with any Collator. Notice that the sortStrings method invokes the compare method:

			public static void sortStrings(Collator collator, String[] words) {
    String tmp;
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) { 
            if (collator.compare(words[i], words[j]) > 0) {
                tmp = words[i];
                words[i] = words[j];
                words[j] = tmp;
            }
        }
    }
}

The English Collator sorts the words as follows:

			peach
péché
pêche
sin

According to the collation rules of the French language, the preceding list is in the wrong order. In French péché should follow pêche in a sorted list. The French Collator sorts the array of words correctly, as follows:

			peach
pêche
péché
sin

Customizing Collation Rules

The previous section discussed how to use the predefined rules for a locale to compare strings. These collation rules determine the sort order of strings. If the predefined collation rules do not meet your needs, you can design your own rules and assign them to a RuleBasedCollator object.

Customized collation rules are contained in a String object that is passed to the RuleBasedCollator constructor. Here's a simple example:

		String simpleRule = "< a < b < c < d";
RuleBasedCollator simpleCollator =  new RuleBasedCollator(simpleRule);

For the simpleCollator object in the previous example, a is less than b, which is less that c, and so forth. The simpleCollator.compare method references these rules when comparing strings. The full syntax used to construct a collation rule is more flexible and complex than this simple example. For a full description of the syntax, refer to the API documentation for the RuleBasedCollator class.

The example that follows sorts a list of Spanish words with two collators. Full source code for this example is in RulesDemo.java.

The RulesDemo program starts by defining collation rules for English and Spanish. The program will sort the Spanish words in the traditional manner. When sorting by the traditional rules, the letters ch and ll and their uppercase equivalents each have their own positions in the sort order. These character pairs compare as if they were one character. For example, ch sorts as a single letter, following cz in the sort order. Note how the rules for the two collators differ:

		String englishRules = (
    "< a,A < b,B < c,C < d,D < e,E < f,F " +
    "< g,G < h,H < i,I < j,J < k,K < l,L " +
    "< m,M < n,N < o,O < p,P < q,Q < r,R " +
    "< s,S < t,T < u,U < v,V < w,W < x,X " +
    "< y,Y < z,Z");

String smallnTilde = new String("\u00F1");    // ñ
String capitalNTilde = new String("\u00D1");  // Ñ

String traditionalSpanishRules = (
    "< a,A < b,B < c,C " +
    "< ch, cH, Ch, CH " +
    "< d,D < e,E < f,F " +
    "< g,G < h,H < i,I < j,J < k,K < l,L " +
    "< ll, lL, Ll, LL " +
    "< m,M < n,N " +
    "< " + smallnTilde + "," + capitalNTilde + " " +
    "< o,O < p,P < q,Q < r,R " +
    "< s,S < t,T < u,U < v,V < w,W < x,X " +
    "< y,Y < z,Z");

The following lines of code create the collators and invoke the sort routine:

		try {
    RuleBasedCollator enCollator = new RuleBasedCollator(englishRules);
    RuleBasedCollator spCollator =
        new RuleBasedCollator(traditionalSpanishRules);

    sortStrings(enCollator, words);
    printStrings(words);
    System.out.println();

    sortStrings(spCollator, words);
    printStrings(words);
} catch (ParseException pe) {
    System.out.println("Parse exception for rules");
}

The sort routine, called sortStrings, is generic. It will sort any array of words according to the rules of any Collator object:

		public static void sortStrings(Collator collator, String[] words) {
    String tmp;
    for (int i = 0; i < words.length; i++) {
        for (int j = i + 1; j < words.length; j++) {
            if (collator.compare(words[i], words[j]) > 0) {
                tmp = words[i];
                words[i] = words[j];
                words[j] = tmp;
            }
        }
    }
}

When sorted with the English collation rules, the array of words is as follows:

		chalina
curioso
llama
luz

Compare the preceding list with the following, which is sorted according to the traditional Spanish rules of collation:

		curioso
chalina
luz
llama

Improving Collation Performance

Sorting long lists of strings is often time consuming. If your sort algorithm compares strings repeatedly, you can speed up the process by using the CollationKey class.

CollationKey object represents a sort key for a given String and Collator. Comparing two CollationKey objects involves a bitwise comparison of sort keys and is faster than comparing String objects with the Collator.compare method. However, generating CollationKey objects requires time. Therefore if a String is to be compared just once, Collator.compare offers better performance.

The example that follows uses a CollationKey object to sort an array of words. Source code for this example is in KeysDemo.java.

The KeysDemo program creates an array of CollationKey objects in the main method. To create a CollationKey, you invoke the getCollationKey method on a Collatorobject. You cannot compare two CollationKey objects unless they originate from the same Collator. The main method is as follows:

		static public void main(String[] args) {
    Collator enUSCollator = Collator.getInstance(new Locale("en","US"));
    String [] words = {
        "peach",
        "apricot",
        "grape",
        "lemon"
    };

    CollationKey[] keys = new CollationKey[words.length];

    for (int k = 0; k < keys.length; k ++) {
        keys[k] = enUSCollator. getCollationKey(words[k]);
    }

    sortArray(keys);
    printArray(keys);
}

The sortArray method invokes the CollationKey.compareTo method. The compareTo method returns an integer less than, equal to, or greater than zero if the keys[i]object is less than, equal to, or greater than the keys[j] object. Note that the program compares the CollationKey objects, not the String objects from the original array of words. Here is the code for the sortArray method:

		public static void sortArray(CollationKey[] keys) {
    CollationKey tmp;

    for (int i = 0; i < keys.length; i++) {
        for (int j = i + 1; j < keys.length; j++) {
            if (keys[i].compareTo(keys[j]) > 0) {
                tmp = keys[i];
                keys[i] = keys[j];
                keys[j] = tmp; 
            }
        }
    }
}

The KeysDemo program sorts an array of CollationKey objects, but the original goal was to sort an array of String objects. To retrieve the String representation of each CollationKey, the program invokes getSourceString in the displayWords method, as follows:

		static void displayWords(CollationKey[] keys) {
    for (int i = 0; i < keys.length; i++) {
        System.out.println(keys[i].getSourceString());
    }
}

The displayWords method prints the following lines:

		apricot
grape
lemon
peach

Similar pages

Page structure
Terms

Object

Class

Java

Sorting Algorithm

Exceptions

long

Subclass

Processes