Free Online Courses for Software Developers - MrBool
× Please, log in to give us a feedback. Click here to login
×

You must be logged to download. Click here to login

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

How to find the correct string with Java

In this article we will discuss about the how the java language helps in text search in any language.

Introduction:

Text search is a common algorithm which taught in any engineering institution as an introductory programming algorithm. Most of these algorithms use tables which has the instruction set for every characters. This approach is good for traditional character set e.g. ASCII or ISO where there are only 128 or 256 possible characters. Java uses unicode character set which has 65,535 distinct characters which covers almost all modern languages in the world including Chinese characters. Some accented letters, are treated as minor variants on the letter that is accented, e.g the character "é" in the word "café", is treated as a variant on "e". Some accented characters e.g "Å" in Danish is treated as a separate letter that sorts near the end of the alphabet after "Z" and "Æ". Another example is the German character "ä" , "ë" or "ö" treated as "ae", "ee" or "oe" respectively. The German "ß" is another character and spelled as "AE" or "ss".

Use of Collator class:

Java 1.1 introduced the Collator and RuleBasedCollator classes in the java.text package. The Collator is an abstract class which provides methods for comparing texts. The RuleBasedCollator is a concrete subclass of Collator which implements rule driven algorithms for comparison. The following code shows a simple implementation of the Collator.

Listing 1: Sample implementation of Collator

package com.home.searchText;

import java.text.Collator;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Locale;

public class SampleTextSearch {

	public static void main(String[] args) {

		// This data is based on an example in Java Class Libraries,

		List<String> wordList = Arrays.asList( "Äbc", "äbc", "Àbc", "àbc", "Abc", "abc", "ABC" ) ;

		System.out.println( "Sort results are based on the  'Collation Strength' values " );
		System.out.println(wordList + " - Original Data");
		sort(wordList, Strength.Primary);
		sort(wordList, Strength.Secondary);
		sort(wordList, Strength.Tertiary);

		System.out.println(EMPTY_LINE);
		System.out.println("Tertiary Collation - Case sensitive : ");
		List<String> wordsForCase = Arrays.asList("cache", "CACHE", "Cache");
		System.out.println(wordsForCase + " - Original Data");
		sort(wordsForCase, Strength.Primary);
		sort(wordsForCase, Strength.Secondary);
		sort(wordsForCase, Strength.Tertiary);

		System.out.println(EMPTY_LINE);
		System.out.println("Secondary Collation - Accent sensitive.");
		System.out.println("Compare with no accents present: ");
		compare("abc", "ABC", Strength.Primary);
		compare("abc", "ABC", Strength.Secondary);
		compare("abc", "ABC", Strength.Tertiary);

		System.out.println(EMPTY_LINE);
		System.out.println("Compare with accents present: ");
		compare("abc", "ÀBC", Strength.Primary);
		compare("abc", "ÀBC", Strength.Secondary);
		compare("abc", "ÀBC", Strength.Tertiary);
	}

	// PRIVATE //
	private static final String EMPTY_LINE = "";
	private static final Locale TEST_LOCALE = Locale.FRANCE;

	/** Transform some Collator 'int' consts into an equivalent enum. */
	private enum Strength {
		Primary(Collator.PRIMARY), // base char
		Secondary(Collator.SECONDARY), // base char + accent
		Tertiary(Collator.TERTIARY), // base char + accent + case
		Identical(Collator.IDENTICAL); // base char + accent + case + bits

		int getStrength() {
			return fStrength;
		}

		private int fStrength;

		private Strength(int aStrength) {
			fStrength = aStrength;
		}
	}

	private static void sort(List<String> aWords, Strength aStrength) {
		Collator collator = Collator.getInstance(TEST_LOCALE);
		collator.setStrength(aStrength.getStrength());
		Collections.sort(aWords, collator);
		System.out.println(aWords.toString() + " " + aStrength);
	}

	private static void compare(String str1, String str2, Strength aStrength) {
		Collator collator = Collator.getInstance(TEST_LOCALE);
		collator.setStrength(aStrength.getStrength());
		int comparison = collator.compare(str1, str2);
		if (comparison == 0) {
			System.out.println("String : " + str1 + " and " + str2 + " are same with Strength - " + aStrength);
		} else {
			System.out.println("String : " + str1 + " and " + str2 + " are different with Strength- " + aStrength);
		}
	}
}

The above program produces the following output:

Sort results are based on the ‘Collation Strength' values:

[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] - Original Data
[Äbc, äbc, Àbc, àbc, Abc, abc, ABC] Primary
[Abc, abc, ABC, Àbc, àbc, Äbc, äbc] Secondary
[abc, Abc, ABC, àbc, Àbc, äbc, Äbc] Tertiary

Tertiary Collation - Case sensitive:

[cache, CACHE, Cache] - Original Data
[cache, CACHE, Cache] Primary
[cache, CACHE, Cache] Secondary
[cache, Cache, CACHE] Tertiary

Secondary Collation - Accent sensitive.

Compare with no accents present:

String : abc and ABC are same with Strength - Primary
String : abc and ABC are same with Strength - Secondary
String : abc and ABC are different with Strength- Tertiary

Compare with accents present:

String : abc and ÀBC are same with Strength - Primary
String : abc and ÀBC are different with Strength- Secondary
String : abc and ÀBC are different with Strength- Tertiary

In actual implementation, the collator.getInstance should be called only when used. If we do not provide any locale as a parameter to it the getInsance method determines the current locale, loads the appropriate rules and returns the collator object which has all the default properties.

The Rule based Collator:

The rule based collator has a compare method which has a lot of bookkeeping activities to be done. The strcmp function in C language, which does a byte-to-byte string comparison would simply fail while comparing the different special European characters e.g "ß", ä" , "ë" or "ö" . The RuleBasedCollator first translates the input strings into a series of collation elements, which correspond to single entities in the input string. In English, each character in the input string maps to a collation element, but the character "Æ" produces two elements while the Spanish "ch" produces just one element.

This translation is done by a utility class CollationElementIterator, which has a set of mapping tables built from the locale value passed to the collator’s constructor. CollationElementIterator is a public class which we can use to do public searches. A sample implementation of CollationElementIterator is shown below:

Listing 2: Code snippet for ColationElementIterator implementation

RuleBasedCollator c = (RuleBasedCollator)Collator.getInstance(); 

        CollationElementIterator iter = c.getCollationElementIterator("Foo");

        int element; 

        while ((element = iter.next()) != CollationElementIterator.NULLORDER) { 

                System.out.println("Collation element is: " + 

                                Integer.toString(e,16) ); 

        }

As we can see here creating a collation element is simple. It is the int variable 'element' which explains where the character or the group of characters falls in the sorting sequence. As we have seen in listing 1, each Collation element can be broken down into three components:

  • Primary : These corresponds to the base alphabet letter e.g 'A' or 'B' and so on
  • Secondary : These corresponds to the accents e.g 'á' or 'é' or other European language characters
  • Tertiary: This represents the case of the character so 'a' and 'A' are different as per this component.

Improvement in java 1.2:

Though these concepts were matured enough in java 1.1 but required some improvements in terms of performances. Java 1.2 made those enhancements to the international classes. One significant change is that in Java 1.2 the CollectionElementIterator is modified to make the text search faster.

Conclusion:

Let us summarize to what we discussed above. The following bullets list out our discussion in brief:

  • In Java we have the unicode character set which has 65, 535 distinct characters.
  • Java text search covers almost all international languages of the modern era.
  • Java 1.1 introduced the classes :
    • java.text.Collator and
    • java.text.RuleBasedCollator.
  • These classes are used to search texts in any international language of modern era.
  • Any Collation element is broken down into three components :
    • Primary.
    • Secondary.
    • Tertiary.
  • Java 1.2 improved the performance of these classes to perform the search faster.


Website: www.techalpine.com Have 16 years of experience as a technical architect and software consultant in enterprise application and product development. Have interest in new technology and innovation area along with technical...

What did you think of this post?
Services
[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually – in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits – in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login