murthy vardhineedi: July 2010

Read The following Article to Become Master in Lucene Scoring.
You can't get this information from anywhere.

Internal Flow in Apache Solr:

Whenver we enter something to search in Solr admin console,it will be received by QueryComponent.java. Means this is the starting point for the search process. QueryComponent.java is present in Solr.

Functionalities of QueryComponent.java

-Receives Query string.

-splits the query string based on some characters.

-collects the unique-id's from schema.xml.

-Find id's that are matched with query string.

-Send response to the user.

INTERNAL FLOW IN LUCENE:

The QueryParser is used to transform user submitted query strings into QueryObjects.

Lucene's Query API:

When a human-readable query is parsed by Lucene's QueryParser, it is converted to a single concrete subclass of the Query class. However, we need some understanding of the underlying concrete Query subclasses. The relevant subclasses, their purpose, and some example expressions for each are listed in the following table:

Query Implementation	Purpose	Sample expressions
TermQuery	Single term query, which effectively is a single word.	reynolds
PhraseQuery	A match of several terms in order, or in near vicinity to one another.	"light up ahead"
RangeQuery	Matches documents with terms between beginning and ending terms, including or excluding the end points.	[A TO Z] {A TO Z}
WildcardQuery	Lightweight, regular-expression-like term-matching syntax.	j*v? f??bar
PrefixQuery	Matches all terms that begin with a specified string.	cheese*
FuzzyQuery	Levenshtein algorithm for closeness matching.	tree~
BooleanQuery	Aggregates other Query instances into complex expressions allowing AND, OR, and NOT logic.	reynolds AND "light up ahead" cheese* -cheesewhiz

QueryParser expression syntax:

The following items in this section describe the syntax QueryParser supports to create the various query types.

Single-term query:

A query string of only a single word is converted to an underlying TermQuery.

Phrase query:

To search for a group of words together in a field, surround the words with double-quotes. The query "hello world" corresponds to an exact phrase match, requiring "hello" and "world" to be successive terms for a match.

Lucene also supports sloppy phrase queries, where the terms between quotes do not necessarily have to be in the exact order. The slop factor measures against how many moves it takes to rearrange the terms into the exact order. If the number of moves is less than a specified slop factor, it is a match. QueryParser parses the expression "hello world"~2 as a PhraseQuery with a slop factor of 2, allowing matches on the phrases "world hello", "hello world", "hello * world", and "hello * * world", where the asterisks represent irrelevant words in the index. Note that "world * hello" does not match with a slop factor of 2. Why? Because the number of moves to get that back to "hello world" is 3. Hopping the word "world" to the asterisk position is one, to the "hello" position is two, and the third hop makes the exact match.

Range query:

Text or date range queries use bracketed syntax, with TO between the beginning term and ending term. The type of bracket determines whether the range is inclusive (square brackets) or exclusive (curly brackets).

NOTES: Non-date range queries use the start and end terms as the user entered them without modification. In the case of {Aardvark TO Zebra}, the terms are not lowercased. Start and end terms must not contain whitespace, or parsing fails; only single words are allowed. The analyzer is not run on the start and end terms.

Date range handling:

When a range query (such as [1/1/03 TO 12/31/03]) is encountered, the parser code first attempts to convert the start and end terms to dates. If the terms are valid dates, according to DateFormat.SHORT and lenient parsing, then the dates are converted to their internal textual representation (however, date field indexing is beyond the scope of this article). If either of the two terms fails to parse as a valid date, they both are used as is for a textual range.

Wildcard and prefix queries:

If a term contains an asterisk or question mark, it is considered a WildcardQuery, except when the term only contains a trailing asterisk and QueryParser optimizes it to a PrefixQuery instead. While the WildcardQuery API itself supports a leading wildcard character, the QueryParser does not allow it. An example wildcard query is w*ldc?rd, whereas the query prefix* is optimized as a PrefixQuery.

Fuzzy query:

Lucene's FuzzyQuery matches terms close to a specified term. The Levenshtein distance algorithm determines how close terms in the index are to a specified target term. "Edit distance" is another term for "Levenshtein distance," and is a measure of similarity between two strings, where distance is measured as the number of character deletions, insertions, or substitutions required to transform one string to the other string. For example, the edit distance between "three" and "tree" is one, as only one character deletion is needed. The number of moves is used in a threshold calculation, which is ratio of distance to string length.

QueryParser supports fuzzy-term queries using a trailing tilde on a term. For example, searching for wuzza~ will find documents that contain "fuzzy" and "wuzzy". Edit distance affects scoring, such that lower edit distances score higher.

Boolean query:

Constructing Boolean queries textually is done using the operators AND, OR, and NOT. Terms listed without an operator specified use an implicit operator, which by default is OR. A query of abc xyz will be interpreted as abc OR xyz. Placing a NOT in front of a term excludes documents containing the following term. Negating a term must be combined with at least one non-negated term to return documents. Each of the uppercase word operators has shortcut syntax shown in the following table.

Verbose syntax	Shortcut syntax
a AND b	+a +b
a OR b	a b
a AND NOT b	+a -b

Implementing Scoring:

The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the abstract methods which must be implemented in child classes.

The following methods are very important while we are implementing Score.

DocIdSetIterator#nextDoc() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
DocIdSetIterator#docID() — Returns the id of the Document that contains the match. It is not valid until next() has been called at least once.

3.Scorer#score(Collector) — Scores and collects all matching documents using the given Collector.

4.Scorer#score() —Abstract method ,the child classes must implement.

--Returns the score of the current document matching the query.This value can be determined in any appropriate way for an application.

For instance, the TermScorer returns the tf * Weight.getValue() * fieldNorm.

NOTE:

Initially invalid, until DocIdSetIterator.next() or DocIdSetIterator.skipTo(int) is called the first time, or when called from within Collector.collect(int).

5. DocIdSetIterator#advance(int) — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, advance can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.

CODING:

If anyone having interest to know more details about scoring implementaion just send a mail to the following mail-id:

murthy.mca53@gmail.com

Thanks&Regards

v.s.n murthy

murthy vardhineedi

Monday, July 12, 2010

About Android SDK and It's contents

Saturday, July 10, 2010

Wonderful Hindu Temples outside INDIA

Monday, July 5, 2010

Basic Flow of Lucene , Apache Solr. (Internal Flow /Process involved in Apache Lucene)