Detecting Text Boundaries
Applications that manipulate text need to locate boundaries within the text. For example, consider some of the common functions of a word processor: highlighting a character, cutting a word, moving the cursor to the next sentence, and wrapping a word at a line ending. To perform each of these functions, the word processor must be able to detect the logical boundaries in the text. Fortunately you don't have to write your own routines to perform boundary analysis. Instead, you can take advantage of the methods provided by the BreakIterator
class.
About the BreakIterator Class
The BreakIterator
class is locale-sensitive, because text boundaries vary with language. For example, the syntax rules for line breaks are not the same for all languages. To determine which locales the BreakIterator
class supports, invoke the getAvailableLocales
method, as follows:
Locale[] locales = BreakIterator.getAvailableLocales();
You can analyze four kinds of boundaries with the BreakIterator
class: character, word, sentence, and potential line break. When instantiating a BreakIterator
, you invoke the appropriate factory method:
-
getCharacterInstance
-
getWordInstance
-
getSentenceInstance
-
getLineInstance
Each instance of BreakIterator
can detect just one type of boundary. If you want to locate both character and word boundaries, for example, you create two separate instances.
A BreakIterator
has an imaginary cursor that points to the current boundary in a string of text. You can move this cursor within the text with the previous
and the next
methods. For example, if you've created a BreakIterator
with getWordInstance
, the cursor moves to the next word boundary in the text every time you invoke the next
method. The cursor-movement methods return an integer indicating the position of the boundary. This position is the index of the character in the text string that would follow the boundary. Like string indexes, the boundaries are zero-based. The first boundary is at 0, and the last boundary is the length of the string. The following figure shows the word boundaries detected by the next
and previous
methods in a line of text:
This figure has been reduced to fit on the page.
You should use the BreakIterator
class only with natural-language text. To tokenize a programming language, use the StreamTokenizer
class.The sections that follow give examples for each type of boundary analysis. The coding examples are from the source code file named BreakIteratorDemo.java
.
Character Boundaries
You need to locate character boundaries if your application allows the end user to highlight individual characters or to move a cursor through text one character at a time. To create a BreakIterator
that locates character boundaries, you invoke the getCharacterInstance
method, as follows:
BreakIterator characterIterator =
BreakIterator.getCharacterInstance(currentLocale);
This type of BreakIterator
detects boundaries between user characters, not just Unicode characters.
A user character may be composed of more than one Unicode character. For example, the user character ü can be composed by combining the Unicode characters \u0075 (u) and \u00a8 (¨). This isn't the best example, however, because the character ü may also be represented by the single Unicode character \u00fc. We'll draw on the Arabic language for a more realistic example.
In Arabic the word for house is:
This word contains three user characters, but it is composed of the following six Unicode characters:
String house = "\u0628" + "\u064e" + "\u064a" + "\u0652" + "\u067a" + "\u064f";
The Unicode characters at positions 1, 3, and 5 in the house
string are diacritics. Arabic requires diacritics because they can alter the meanings of words. The diacritics in the example are nonspacing characters, since they appear above the base characters. In an Arabic word processor you cannot move the cursor on the screen once for every Unicode character in the string. Instead you must move it once for every user character, which may be composed by more than one Unicode character. Therefore you must use a BreakIterator
to scan the user characters in the string.
The sample program BreakIteratorDemo
, creates a BreakIterator
to scan Arabic characters. The program passes this BreakIterator
, along with the String
object created previously, to a method named listPositions
:
BreakIterator arCharIterator = BreakIterator.getCharacterInstance(
new Locale ("ar","SA"));
listPositions (house, arCharIterator);
The listPositions
method uses a BreakIterator
to locate the character boundaries in the string. Note that the BreakIteratorDemo
assigns a particular string to the BreakIterator
with the setText
method. The program retrieves the first character boundary with the first
method and then invokes the next
method until the constant BreakIterator.DONE
is returned. The code for this routine is as follows:
static void listPositions(String target, BreakIterator iterator) {
iterator.setText(target);
int boundary = iterator.first();
while (boundary != BreakIterator.DONE) {
System.out.println (boundary);
boundary = iterator.next();
}
}
The listPositions
method prints out the following boundary positions for the user characters in the string house
. Note that the positions of the diacritics (1, 3, 5) are not listed:
0
2
4
6
Word Boundaries
You invoke the getWordIterator
method to instantiate a BreakIterator
that detects word boundaries:
BreakIterator wordIterator =
BreakIterator.getWordInstance(currentLocale);
You'll want to create such a BreakIterator
when your application needs to perform operations on individual words. These operations might be common word- processing functions, such as selecting, cutting, pasting, and copying. Or, your application may search for words, and it must be able to distinguish entire words from simple strings.
When a BreakIterator
analyzes word boundaries, it differentiates between words and characters that are not part of words. These characters, which include spaces, tabs, punctuation marks, and most symbols, have word boundaries on both sides.
The example that follows, which is from the program BreakIteratorDemo
, marks the word boundaries in some text. The program creates the BreakIterator
and then calls the markBoundaries
method:
Locale currentLocale = new Locale ("en","US");
BreakIterator wordIterator =
BreakIterator.getWordInstance(currentLocale);
String someText = "She stopped. " +
"She said, \"Hello there,\" and then went " +
"on.";
markBoundaries(someText, wordIterator);
The markBoundaries
method is defined in BreakIteratorDemo.java
. This method marks boundaries by printing carets (^) beneath the target string. In the code that follows, notice the while
loop where markBoundaries
scans the string by calling the next
method:
static void markBoundaries(String target, BreakIterator iterator) {
StringBuffer markers = new StringBuffer();
markers.setLength(target.length() + 1);
for (int k = 0; k < markers.length(); k++) {
markers.setCharAt(k,' ');
}
iterator.setText(target);
int boundary = iterator.first();
while (boundary != BreakIterator.DONE) {
markers.setCharAt(boundary,'^');
boundary = iterator.next();
}
System.out.println(target);
System.out.println(markers);
}
The output of the markBoundaries
method follows. Note where the carets (^) occur in relation to the punctuation marks and spaces:
She stopped. She said, "Hello there," and then
^ ^^ ^^ ^ ^^ ^^^^ ^^ ^^^^ ^^ ^
went on.
^ ^^ ^^
The BreakIterator
class makes it easy to select words from within text. You don't have to write your own routines to handle the punctuation rules of various languages; the BreakIterator
class does this for you.
The extractWords
method in the following example extracts and prints words for a given string. Note that this method uses Character.isLetterOrDigit
to avoid printing "words" that contain space characters.
static void extractWords(String target, BreakIterator wordIterator) {
wordIterator.setText(target);
int start = wordIterator.first();
int end = wordIterator.next();
while (end != BreakIterator.DONE) {
String word = target.substring(start,end);
if (Character.isLetterOrDigit(word.charAt(0))) {
System.out.println(word);
}
start = end;
end = wordIterator.next();
}
}
The BreakIteratorDemo
program invokes extractWords
, passing it the same target string used in the previous example. The extractWords
method prints out the following list of words:
She
stopped
She
said
Hello
there
and
then
went
on
Sentence Boundaries
You can use a BreakIterator
to determine sentence boundaries. You start by creating a BreakIterator
with the getSentenceInstance
method:
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(currentLocale);
To show the sentence boundaries, the program uses the markBoundaries
method, which is discussed in the section Word Boundaries. The markBoundaries
method prints carets (^) beneath a string to indicate boundary positions. Here are some examples:
She stopped. She said, "Hello there," and then went on.
^ ^ ^
He's vanished! What will we do? It's up to us.
^ ^ ^ ^
Please add 1.5 liters to the tank.
Line Boundaries
Applications that format text or that perform line wrapping must locate potential line breaks. You can find these line breaks, or boundaries, with a BreakIterator
that has been created with the getLineInstance
method:
BreakIterator lineIterator =
BreakIterator.getLineInstance(currentLocale);
This BreakIterator
determines the positions in a string where text can break to continue on the next line. The positions detected by the BreakIterator
are potential line breaks. The actual line breaks displayed on the screen may not be the same.
The two examples that follow use the markBoundaries
method of BreakIteratorDemo.java
to show the line boundaries detected by a BreakIterator
. The markBoundaries
method indicates line boundaries by printing carets (^) beneath the target string.
According to a BreakIterator
, a line boundary occurs after the termination of a sequence of whitespace characters (space, tab, new line). In the following example, note that you can break the line at any of the boundaries detected:
She stopped. She said, "Hello there," and then went on.
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Potential line breaks also occur immediately after a hyphen:
There are twenty-four hours in a day.
^ ^ ^ ^ ^ ^ ^ ^ ^
The next example breaks a long string of text into fixed-length lines with a method called formatLines
. This method uses a BreakIterator
to locate the potential line breaks. The formatLines
method is short, simple, and, thanks to the BreakIterator
, locale-independent. Here is the source code:
static void formatLines(
String target, int maxLength,
Locale currentLocale) {
BreakIterator boundary = BreakIterator.
getLineInstance(currentLocale);
boundary.setText(target);
int start = boundary.first();
int end = boundary.next();
int lineLength = 0;
while (end != BreakIterator.DONE) {
String word = target.substring(start,end);
lineLength = lineLength + word.length();
if (lineLength >= maxLength) {
System.out.println();
lineLength = word.length();
}
System.out.print(word);
start = end;
end = boundary.next();
}
}
The BreakIteratorDemo
program invokes the formatLines
method as follows:
String moreText =
"She said, \"Hello there,\" and then " +
"went on down the street. When she stopped " +
"to look at the fur coats in a shop + "
"window, her dog growled. \"Sorry Jake,\" " +
"she said. \"I didn't know you would take " +
"it personally.\"";
formatLines(moreText, 30, currentLocale);
The output from this call to formatLines
is:
She said, "Hello there," and
then went on down the
street. When she stopped to
look at the fur coats in a
shop window, her dog
growled. "Sorry Jake," she
said. "I didn't know you
would take it personally."