Detecting Text Boundaries

Domains: Java

Applications that manipulate text need to locate boundaries within the text. For example, consider some of the common functions of a word processor: highlighting a character, cutting a word, moving the cursor to the next sentence, and wrapping a word at a line ending. To perform each of these functions, the word processor must be able to detect the logical boundaries in the text. Fortunately you don't have to write your own routines to perform boundary analysis. Instead, you can take advantage of the methods provided by the BreakIterator class.

About the BreakIterator Class

The BreakIterator class is locale-sensitive, because text boundaries vary with language. For example, the syntax rules for line breaks are not the same for all languages. To determine which locales the BreakIterator class supports, invoke the getAvailableLocales method, as follows:

		Locale[] locales = BreakIterator.getAvailableLocales();

You can analyze four kinds of boundaries with the BreakIterator class: character, word, sentence, and potential line break. When instantiating a BreakIterator, you invoke the appropriate factory method:

getCharacterInstance
getWordInstance
getSentenceInstance
getLineInstance

Each instance of BreakIterator can detect just one type of boundary. If you want to locate both character and word boundaries, for example, you create two separate instances.

A BreakIterator has an imaginary cursor that points to the current boundary in a string of text. You can move this cursor within the text with the previous and the next methods. For example, if you've created a BreakIterator with getWordInstance, the cursor moves to the next word boundary in the text every time you invoke the next method. The cursor-movement methods return an integer indicating the position of the boundary. This position is the index of the character in the text string that would follow the boundary. Like string indexes, the boundaries are zero-based. The first boundary is at 0, and the last boundary is the length of the string. The following figure shows the word boundaries detected by the next and previous methods in a line of text:

This figure has been reduced to fit on the page.

You should use the BreakIterator class only with natural-language text. To tokenize a programming language, use the StreamTokenizer class.The sections that follow give examples for each type of boundary analysis. The coding examples are from the source code file named BreakIteratorDemo.java.

Character Boundaries

You need to locate character boundaries if your application allows the end user to highlight individual characters or to move a cursor through text one character at a time. To create a BreakIterator that locates character boundaries, you invoke the getCharacterInstance method, as follows:

			BreakIterator characterIterator =
    BreakIterator.getCharacterInstance(currentLocale);

This type of BreakIterator detects boundaries between user characters, not just Unicode characters.

A user character may be composed of more than one Unicode character. For example, the user character ü can be composed by combining the Unicode characters \u0075 (u) and \u00a8 (¨). This isn't the best example, however, because the character ü may also be represented by the single Unicode character \u00fc. We'll draw on the Arabic language for a more realistic example.

In Arabic the word for house is:

This word contains three user characters, but it is composed of the following six Unicode characters:

			String house = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";

The Unicode characters at positions 1, 3, and 5 in the house string are diacritics. Arabic requires diacritics because they can alter the meanings of words. The diacritics in the example are nonspacing characters, since they appear above the base characters. In an Arabic word processor you cannot move the cursor on the screen once for every Unicode character in the string. Instead you must move it once for every user character, which may be composed by more than one Unicode character. Therefore you must use a BreakIterator to scan the user characters in the string.

The sample program BreakIteratorDemo, creates a BreakIterator to scan Arabic characters. The program passes this BreakIterator, along with the String object created previously, to a method named listPositions:

			BreakIterator arCharIterator = BreakIterator.getCharacterInstance(
                                   new Locale ("ar","SA"));
listPositions (house, arCharIterator);

The listPositions method uses a BreakIterator to locate the character boundaries in the string. Note that the BreakIteratorDemo assigns a particular string to the BreakIterator with the setText method. The program retrieves the first character boundary with the first method and then invokes the next method until the constant BreakIterator.DONE is returned. The code for this routine is as follows:

			static void listPositions(String target, BreakIterator iterator) {
                
    iterator.setText(target);
    int boundary = iterator.first();

    while (boundary != BreakIterator.DONE) {
        System.out.println (boundary);
        boundary = iterator.next();
    }
}

The listPositions method prints out the following boundary positions for the user characters in the string house. Note that the positions of the diacritics (1, 3, 5) are not listed:

Word Boundaries

You invoke the getWordIterator method to instantiate a BreakIterator that detects word boundaries:

		BreakIterator wordIterator =
    BreakIterator.getWordInstance(currentLocale);

You'll want to create such a BreakIterator when your application needs to perform operations on individual words. These operations might be common word- processing functions, such as selecting, cutting, pasting, and copying. Or, your application may search for words, and it must be able to distinguish entire words from simple strings.

When a BreakIterator analyzes word boundaries, it differentiates between words and characters that are not part of words. These characters, which include spaces, tabs, punctuation marks, and most symbols, have word boundaries on both sides.

The example that follows, which is from the program BreakIteratorDemo, marks the word boundaries in some text. The program creates the BreakIterator and then calls the markBoundaries method:

		Locale currentLocale = new Locale ("en","US");

BreakIterator wordIterator =
    BreakIterator.getWordInstance(currentLocale);

String someText = "She stopped. " +
    "She said, \"Hello there,\" and then went " +
    "on.";

markBoundaries(someText, wordIterator);

The markBoundaries method is defined in BreakIteratorDemo.java. This method marks boundaries by printing carets (^) beneath the target string. In the code that follows, notice the while loop where markBoundaries scans the string by calling the next method:

		static void markBoundaries(String target, BreakIterator iterator) {

    StringBuffer markers = new StringBuffer();
    markers.setLength(target.length() + 1);
    for (int k = 0; k < markers.length(); k++) {
        markers.setCharAt(k,' ');
    }

    iterator.setText(target);
    int boundary = iterator.first();

    while (boundary != BreakIterator.DONE) {
        markers.setCharAt(boundary,'^');
        boundary = iterator.next();
    }

    System.out.println(target);
    System.out.println(markers);
}

The output of the markBoundaries method follows. Note where the carets (^) occur in relation to the punctuation marks and spaces:

		She stopped.  She said, "Hello there," and then
^  ^^      ^^ ^  ^^   ^^^^    ^^    ^^^^  ^^   ^

went on.
^   ^^ ^^

The BreakIterator class makes it easy to select words from within text. You don't have to write your own routines to handle the punctuation rules of various languages; the BreakIterator class does this for you.

The extractWords method in the following example extracts and prints words for a given string. Note that this method uses Character.isLetterOrDigit to avoid printing "words" that contain space characters.

		static void extractWords(String target, BreakIterator wordIterator) {

    wordIterator.setText(target);
    int start = wordIterator.first();
    int end = wordIterator.next();

    while (end != BreakIterator.DONE) {
        String word = target.substring(start,end);
        if (Character.isLetterOrDigit(word.charAt(0))) {
            System.out.println(word);
        }
        start = end;
        end = wordIterator.next();
    }
}

The BreakIteratorDemo program invokes extractWords, passing it the same target string used in the previous example. The extractWords method prints out the following list of words:

		She
stopped
She
said
Hello
there
and
then
went
on

Sentence Boundaries

You can use a BreakIterator to determine sentence boundaries. You start by creating a BreakIterator with the getSentenceInstance method:

		BreakIterator sentenceIterator =
    BreakIterator.getSentenceInstance(currentLocale);

To show the sentence boundaries, the program uses the markBoundaries method, which is discussed in the section Word Boundaries. The markBoundaries method prints carets (^) beneath a string to indicate boundary positions. Here are some examples:

		She stopped.  She said, "Hello there," and then went on.
^             ^                                         ^

He's vanished!  What will we do?  It's up to us.
^               ^                 ^             ^

Please add 1.5 liters to the tank.

Line Boundaries

Applications that format text or that perform line wrapping must locate potential line breaks. You can find these line breaks, or boundaries, with a BreakIterator that has been created with the getLineInstance method:

		BreakIterator lineIterator =
    BreakIterator.getLineInstance(currentLocale);

This BreakIterator determines the positions in a string where text can break to continue on the next line. The positions detected by the BreakIterator are potential line breaks. The actual line breaks displayed on the screen may not be the same.

The two examples that follow use the markBoundaries method of BreakIteratorDemo.java to show the line boundaries detected by a BreakIterator. The markBoundaries method indicates line boundaries by printing carets (^) beneath the target string.

According to a BreakIterator, a line boundary occurs after the termination of a sequence of whitespace characters (space, tab, new line). In the following example, note that you can break the line at any of the boundaries detected:

		She stopped.  She said, "Hello there," and then went on.
^   ^         ^   ^     ^      ^     ^ ^   ^    ^    ^  ^

Potential line breaks also occur immediately after a hyphen:

		There are twenty-four hours in a day.
^     ^   ^      ^    ^     ^  ^ ^   ^

The next example breaks a long string of text into fixed-length lines with a method called formatLines. This method uses a BreakIterator to locate the potential line breaks. The formatLines method is short, simple, and, thanks to the BreakIterator, locale-independent. Here is the source code:

		static void formatLines(
    String target, int maxLength,
    Locale currentLocale) {

    BreakIterator boundary = BreakIterator.
        getLineInstance(currentLocale);
    boundary.setText(target);
    int start = boundary.first();
    int end = boundary.next();
    int lineLength = 0;

    while (end != BreakIterator.DONE) {
        String word = target.substring(start,end);
        lineLength = lineLength + word.length();
        if (lineLength >= maxLength) {
            System.out.println();
            lineLength = word.length();
        }
        System.out.print(word);
        start = end;
        end = boundary.next();
    }
}

The BreakIteratorDemo program invokes the formatLines method as follows:

		String moreText =
    "She said, \"Hello there,\" and then " +
    "went on down the street. When she stopped " +
    "to look at the fur coats in a shop + "
    "window, her dog growled. \"Sorry Jake,\" " +
    "she said. \"I didn't know you would take " +
    "it personally.\"";

formatLines(moreText, 30, currentLocale);

The output from this call to formatLines is:

		She said, "Hello there," and
then went on down the
street. When she stopped to
look at the fur coats in a
shop window, her dog
growled. "Sorry Jake," she
said. "I didn't know you
would take it personally."

In addition to the eight primitive data types listed above, the Java programming language also provides special support for character strings via the java.lang.String class. Enclosing your character string within double quotes will automatically create a new String object; for example, String s = "this is a string";..

An Iterator is an object that enables you to traverse through a collection and to remove elements from the collection selectively, if desired. You get an Iterator for a collection by calling its iterator method.. Use Iterator instead of the for-each construct when you need to: remove the current element; iterate over multiple collections in parallel..

Detecting Text Boundaries

About the BreakIterator Class

Character Boundaries

Word Boundaries

Sentence Boundaries

Line Boundaries

Similar pages

Catching and Handling Exceptions

Custom Collection Implementations

Algorithms

String

Class

int

Java

Iterator

Output

Object

short

long