This is not the document you are looking for? Use the search form below to find more!

Report home > Technology

What’s New in Apache Lucene 2.9

0.00 (0 votes)
Document Description
"With the new release of version 2.9, Apache Lucene is now faster and more flexible than before. Learn about these new features and improvements: Near-real-time search Flexible indexing and segments High-performance numerical range queries New APIs introduced Critical bug fixes http://www.lucidimagination.com/developer/whitepaper/Whats-New-in-Apache-Lucene-2-9"
File Details
Submitter
Embed Code:

Add New Comment




Related Documents

What’s New in SharePoint 2010

by: kristinakathy, 2 pages

SharePoint 2010 is Microsoft’s software product that makes it easier for people to work together. Sharing information, managing documents and publishing reports are easier than ever using ...

What’s new in Kofax Capture 9.0 - English

by: hanbal, 23 pages

What’s new in Kofax Capture 9.0 - English

What’s New in Novell Identity Manager 4.0

by: jian, 42 pages

What’s New in Novell Identity Manager 4.0

What’s new in Rails 2?

by: isabel, 61 pages

What’s new in Rails 2? Bryan Helmkamp http://brynary.com / “Look at all the things I’m NOT doing.” –DHH during the “Creating ...

3C’S NEW PROJECT SECTOR 89 GURGAON, Dwarka Expressway |9540009070|

by: bajrangilal, 4 pages

3C’s New Project Gurgaon 3C Builder Gurgaon Ltd. now bring to the people of Gurgaon & Delhi NCR, a new style of living with their upcoming luxurious residential project, which offers a ...

Urbtech 168’s Xaviers in Noida on expressway

by: nisar ahmad, 26 pages

Company Profile Urbtech India is a growing company in real estate sector and has an excellent presence in Noida as well as in Delhi NCR. We are into the real estate business for the last four and ...

3C’S NEW PROJECT SECTOR 89 GURGAON, Dwarka Expressway |9540009070|

by: bajrangilal, 4 pages

3C’s New Project Gurgaon 3C Builder Gurgaon Ltd. now bring to the people of Gurgaon & Delhi NCR, a new style of living with their upcoming luxurious residential project, which offers a ...

3C’S NEW PROJECT SECTOR 89 GURGAON |9540009070|

by: bajrangilal, 4 pages

3C’s New Project Gurgaon 3C Builder Gurgaon Ltd. now bring to the people of Gurgaon & Delhi NCR, a new style of living with their upcoming luxurious residential project, which offers a ...

3C’S NEW PROJECT SECTOR 89 GURGAON |9540009070|

by: bajrangilal, 4 pages

3C’s New Project Gurgaon 3C Builder Gurgaon Ltd. now bring to the people of Gurgaon & Delhi NCR, a new style of living with their upcoming luxurious residential project, which offers a ...

3C’S NEW PROJECT SECTOR 89 GURGAON |9540009070|

by: bajrangilal, 4 pages

3C’s New Project Gurgaon 3C Builder Gurgaon Ltd. now bring to the people of Gurgaon & Delhi NCR, a new style of living with their upcoming luxurious residential project, which offers a ...

Content Preview





































What’s New
in Apache Lucene 2.9
A Lucid Imagination
Technical White Paper































Abstract
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval
library in open source, suitable for nearly every application that requires full-text search
features.
Since its introduction nearly 10 years ago, Apache Lucene has become a competitive player
for developing extensible, high-performance full-text search solutions. The experience
accumulated over time by the community of Lucene committers and contributors and the
innovations they have engineered have delivered significant ongoing advances in Lucene’s
capabilities.
This white paper describes the new features and improvements in the latest version,
Apache Lucene 2.9. It is intended mainly for programmers familiar with the broad base of
Lucene’s capabilities, though those new to Lucene should also find it a useful exploration of
the newest features.
In the simplest terms, Lucene is now faster and more flexible than before. Historic weak
points have been improved to open the way for innovative new features like near-real-time
search, flexible indexing, and high-performance numerical range queries. Many new
features have been added, new APIs introduced, and critical bugs have been fixed—all with
the same goal: improving Lucene’s state-of-the-art search capabilities.


What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page ii


























Table of Contents
Introduction ............................................................................................................................................................ 1
Core Features and Improvements .................................................................................................................. 3
Numeric Capabilities and Numeric Range Queries .............................................................................. 3
New TokenStream API .................................................................................................................................... 7
Per-Segment Search ...................................................................................................................................... 11
Near Realtime Search (NRS) ...................................................................................................................... 12
MultiTermQuery-Related Improvements ............................................................................................. 13
Payloads ............................................................................................................................................................. 14
Additions to Lucene Contrib .......................................................................................................................... 16
New Contrib Analyzers ................................................................................................................................ 16
Lucene Spatial (formerly known as LocalLucene) ............................................................................ 16
Lucene Remote and Java RMI .................................................................................................................... 18
New Flexible QueryParser .......................................................................................................................... 18
Minor Changes and Improvements in Lucene 2.9 ............................................................................. 19
Strategies for Upgrading to Lucene 2.9 ..................................................................................................... 21
Upgrade to 2.9—Recommended Actions .............................................................................................. 21
Upgrade to 2.9—Optional Actions ........................................................................................................... 22
References ............................................................................................................................................................ 23
Next Steps ............................................................................................................................................................. 24
APPENDIX: Choosing Lucene or Solr .......................................................................................................... 25


What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page iii


























Introduction
Apache Lucene is a high-performance, cross-platform, full-featured Information Retrieval
library, in open source, suitable for nearly every application that requires full-text search
features. Lucene currently ranks among the top 15 open source projects and is one of the
top 5 Apache projects, with installations at over 4,000 companies. Downloads of Lucene,
and its server implementation Solr, have grown nearly tenfold over the past three years;
Solr is the fastest-growing Lucene subproject. Lucene and Solr offer an attractive
alternative to proprietary licensed search and discovery software vendors.1 With the
release of version 2.9 in September 2009, the Apache Lucene community delivered the
latest upgrade of Lucene.
This white paper aims to address key issues for you if you have an Apache Lucene-based
application, and need to upgrade existing code to work well with this latest version, so that
you may take advantage of the various improvements and prepare for the next major
release. If you do not have a Lucene application, the paper should also give you a good
overview of the innovations in this release.
Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 is more than just a bug-fix
release. It introduces multiple performance improvements, new features, better runtime
behavior, API changes, and bug-fixes at a variety of levels. The 2.9 release improves Lucene
in several key aspects, which make it an even more compelling alternative to other
solutions. Most notably:
• Improvements for Near-Realtime Search capabilities make documents searchable
almost instantaneously.
• A new, straightforward API for handling Numeric Ranges both simplifies
development and virtually wipes out performance overhead.
Analysis API has been replaced for more streamlined, flexible text handling.


1 See the Appendix for a discussion of when to choose Lucene or Solr.
What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 1


























And, behind the scenes, the groundwork has been laid for yet more indexing flexibility in
future releases.
Lucene Contrib also adds new utility packages, introduced with this release:
• An extremely flexible query parser framework opens new possibilities for
programmers to more easily create their own query parsing syntax.
Local-Lucene and its geo-search capabilities, now donated to Apache, provide this
near-mandatory functionality for state-of-the-art search.
• Various contributions have markedly improved support for languages like Arabic,
Persian, and Chinese.
Some important notes on compatibility: because previous minor releases also contained
performance improvements and bug fixes, programmers have been accustomed to
upgrading to a new Lucene version just by replacing the JAR file in their classpath. And, in
those past cases, Lucene-based apps could be upgraded flawlessly without recompiling the
software components accessing or extending Apache Lucene. However, this may not be so
with Lucene 2.9.
Lucene 2.9 introduces several back-compatibility-breaking changes that may well require
changes in your code that uses the library. A drop-in library replacement is not guaranteed
to be successful; at a minimum, it is not likely to be flawless. As a result, we recommend
that if you are upgrading from a previous Lucene release, you should at least recompile any
software components directly accessing or extending the library. In the latter case,
recompilation alone will most likely not be sufficient. More details on these dependencies
are discussed in the “Upgrading Lucene” section of the paper. We’ve also noted any
significant compatibility issues with this label: [BACK-COMPATIBILITY].
Finally, it is important to note that Lucene 2.9 will be the last release supporting the Java
1.4 platform. While the majority of programmers are already running on either version 1.5
or 1.6 platforms (1.6 is our recommended JVM), Java 1.4 reached its end of service life in
October 2008.
This document is not intended to be a comprehensive overview of Lucene 2.9 in all its
functions, but rather the new key features and capabilities. Always check the Lucid
Imagination Certified distribution and the official Lucene Website
(http://lucene.apache.org) for the most up-to-date release information.

What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 2


























Core Features and Improvements
Numeric Capabilities and Numeric Range Queries
One of Apache Lucene's basic properties is its representation of internal searchable values
(terms) as UTF-8 encoded characters. Every value passed to Lucene must be converted into
a string in order to be searchable. At the same time, Lucene is frequently applied to search
numeric values and ranges, such as prices, dates, or other numeric field attributes.
Historically, searching over numeric ranges has been a weak point of the library. However,
the 2.9 release comes with a tremendous improvement for searching numeric values,
especially for range queries.
Prior to Lucene 2.9, numeric values were encoded with leading zeros, essentially as a full-
precision value. Values stored with full precision ended up creating many unique terms in
the index. Thus, if you needed to retrieve all documents in a certain range (e.g., from $1.50
to $1500.0) Lucene had to iterate through a lot of terms whenever many documents with
unique values were indexed. Consequently, execution of queries with large ranges and lots
of unique terms could be extremely slow as a result of this overhead.
Many workaround techniques have evolved over the years to improve the performance of
ranges, such as encoding date ranges in multiple fields with separate fields for year, month,
and day. But at the end of the day, every programmer had to roll his or her own way of
searching ranges efficiently.
In Lucene 2.9, NumericUtils and its relatives (NumericRangeQuery /
NumericRangeFilter) introduce native numeric encoding and search capabilities.
Numeric Java primitives (long, int, float, and double) are transformed into prefix-
encoded representations with increasing precision. Internally each prefix precision is
generated by stripping off the least significant bits indicated by the precisionStep. Each
value is subsequently converted to a sequence of 7-bit ASCII characters (due to the UTF-8
term encoding in the index, 8 or more bits would split into two or more bytes) resulting in
a predictable number of prefix-terms that can be calculated ahead of time. The figure below
illustrates such a Prefix Tree.
What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 3



























Example of a Prefix Tree, where the leaves of the tree hold the actual term values and all the descendants of a
node have a common prefix associated with the node. Bold circles mark all relevant nodes to retrieve a range
from 215 to 977.


The generated terms are indexed just like any other string values passed to Lucene. Under
the hood, Lucene associates distinct terms with all documents containing the term, so that
all documents containing a numeric value with the same prefix are “grouped” together,
meaning the number of terms that need to be searched is reduced tremendously. This
stands in contrast to the relatively less efficient encoding scheme in previous releases,
where each unique numeric value was indexed as a distinct term based on the number of
terms in the index.


What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 4


























Directory directory = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer,
IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < 20000; i++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(i), Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
String num = Integer.toString(i);
String paddedValue = "00000".substring(0, 5 - num.length()) +
num;
doc.add(new Field("oldNumeric", paddedValue, Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
writer.addDocument(doc);
}
writer.close();
Indexing a zero-padded numeric value for use with an ordinary RangeQuery.

You can also use the native encoding of numeric values beyond range searches. Numeric
fields can be loaded in the internal FieldCache, where they are used for sorting. Zero-
padding of numeric primitives (see code example above) is no longer needed as the trie-
encoding guarantees the correct ordering without requiring execution overhead or extra
coding.
The code listing below instead uses the new NumericField to index a numeric Java
primitive using 4-bit precision. Like the straightforward NumericField, querying
numeric ranges also provides a type-safe API. NumericRangeQuery instances are
created using one of the provided static constructors for the corresponding Java primitive.


What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 5


























Directory directory = new RAMDirectory();
Analyzer analyzer = new WhitespaceAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer,
IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i < 20000; i++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(i), Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
doc.add(new NumericField("newNumeric", 4,
Field.Store.YES, true).setIntValue(i));
writer.addDocument(doc);
}
writer.close();
Indexing numeric values with the new NumericField type
The example below shows a numeric range query using an int primitive with the same
precision used in the indexing example. If different precision values are used at index or
search time, numeric queries can yield unexpected behavior.

What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 6


























IndexSearcher searcher = new IndexSearcher(directory, true);
Query query = NumericRangeQuery.newIntRange("newNumeric", 4, 10,
10000, true, false);
TopDocs docs = searcher.search(query, null, 10);
assertNotNull("Docs is null", docs);
assertEquals(9990, docs.totalHits);
for (int i = 0; i < docs.scoreDocs.length; i++) {
ScoreDocs d= docs.scoreDocs[i];
assertTrue(sd.doc >= 10 && sd.doc < 10000);
}
Searching numeric values with the new NumericRangeQuery

Improvements resulting from new Lucene numeric capabilities are equally significant in
versatility and performance. Now, Lucene can cover almost every use-case related to
numeric values. Moreover, range searches or sorting on float or double values up to fast
date searches (dates converted to time stamps) will execute in less than 100 milliseconds
in most cases. By comparison, the old approach using padded full-precision values could
take up to 30 seconds or more depending on the underlying index.
New TokenStream API
Almost every programmer who has extended Lucene has worked with its analysis function.
Text analysis is common to almost every use-case, and is among the best known Lucene
APIs.
Since its early days, Lucene has used a “Decorator Pattern” to provide a pluggable and
flexible analysis API, allowing a combination of existing and customized analysis
implementations. The central analysis class TokenStream enumerates a sequence of
tokens from either a document's fields or from a query. Commonly, multiple
TokenStream instances are chained, each applying a separate analysis step to text terms
represented by a Token class that encodes all relevant information about a term.
Prior to Lucene 2.9, TokenStream operated exclusively on Token instances transporting
term information through the analysis chain. With this release, the token-based API has
been marked as deprecated. It is completely replaced by an attribute-based API.
What’s New in Lucene 2.9
A Lucid Imagination Technical White Paper • October 2009
Page 7

Download
What’s New in Apache Lucene 2.9

 

 

Your download will begin in a moment.
If it doesn't, click here to try again.

Share What’s New in Apache Lucene 2.9 to:

Insert your wordpress URL:

example:

http://myblog.wordpress.com/
or
http://myblog.com/

Share What’s New in Apache Lucene 2.9 as:

From:

To:

Share What’s New in Apache Lucene 2.9.

Enter two words as shown below. If you cannot read the words, click the refresh icon.

loading

Share What’s New in Apache Lucene 2.9 as:

Copy html code above and paste to your web page.

loading