Indexing and Searching with Apache Lucene 4.7 with Example

This article is about Indexing and Searching documents with Apache Lucene version 4.7. Before jumping to example and explanation, let's see what Apache Lucene is.

Introduction to Apache Lucene

Lucene is a high-performance, scalable information retrieval (IR) library. IR refers to the process of searching for documents, information within documents, or metadata about documents. Lucene lets you add searching capabilities to your application. [ref. Apache Lucene in Action Second edition covers Apache Lucene v3.0]

The main reason for popularity of Lucene is its simplicity. You don't require in-depth knowledge of indexing and searching process to get started with Lucene. You can start with learning handful of classes which actually do the indexing and searching for Lucene. The latest version released is 4.7 and books are only available for v3.0.

Important note

Lucene is not ready-to-use application like file-search program, web-crawler or search engine. It is a software toolkit or library and with the help of it you can build your own search application or libraries. There are many frameworks build on top of Lucene Core API for searching.

Libraries and Environment used
  • Eclipse Kepler
  • JDK 1.7
  • lucene-core-4.7.2.jar
  • lucene-queryparser-4.7.2.jar
  • lucene-demo-4.7.2.jar
  • lucene-analyzers-common-4.7.2.jar

Indexing with Lucene

Let's jump to indexing process in Lucene with example and then we will explain the classes that are used and their purpose.

1. IndexerTest is class used to show the demo.

package lucene.indexer;

import java.io.File;
import java.io.FileFilter;

/**
 * @author Gaurav Rai Mazra
 */
public class IndexerTest {
 
 public static void main(String[] args) throws Exception {
  String indexDir = "index";
  String dataDir = "dir";
  
  long start = System.currentTimeMillis();
  final IndexingHelper indexHelper = new IndexingHelper(indexDir);
  int numIndexed;
 
  try {
   numIndexed = indexHelper.index(dataDir, new TextFilesFilter());
  }
  finally {
   indexHelper.close();
  }
  
  long end = System.currentTimeMillis();
  System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
 }
}

// class filters only .txt files for indexing
class TextFilesFilter implements FileFilter {
 @Override
 public boolean accept(File pathname) {
  return pathname.getName().toLowerCase().endsWith(".txt");
 }
}

2. IndexingHelper class is used to represent how to do the indexing.

package lucene.indexer;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author Gaurav Rai Mazra
 */
public class IndexingHelper {
 //class which actually creates and maintain the indexes in the file
 private IndexWriter indexWriter;
 
 public IndexingHelper(String indexDir) throws Exception {
  //To represent actual directory
  Directory directory = FSDirectory.open(new File(indexDir));
  //Holds configuration required in creation of IndexWriter object
  IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47));
  indexWriter = new IndexWriter(directory, indexWriterConfig);
 }
 
 public void close() throws IOException {
  indexWriter.close();
 }
 
 // exposed method to index files 
 public int index(String dataDir, FileFilter fileFilter) throws Exception {
  File[] files = new File(dataDir).listFiles();
  for (File f : files)
  {
   if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && (fileFilter == null || fileFilter.accept(f)))
    indexFile(f);
  } 
  
  return indexWriter.numDocs();
 }
 
 private void indexFile(File f) throws Exception {
  System.out.println("  " + f.getCanonicalPath());
  Document doc = getDocument(f);
  indexWriter.addDocument(doc);
 }

 private Document getDocument(File f) throws Exception {
  // class used by lucene indexwriter and indexreader to store and reterive indexed data
  Document document = new Document();
  document.add(new TextField("contents", new FileReader(f)));
  document.add(new StringField("filename", f.getName(), Field.Store.YES));
  document.add(new StringField("fullpath", f.getCanonicalPath(), Field.Store.YES));
  return document;
 }
}

In IndexingHelper class, we have used following classes of Lucene library for indexing .txt files.

  • IndexWriter class.
  • IndexWriterConfig class.
  • Directory class.
  • FSDirectory class.
  • Document class.

Explanation

1. IndexWriter: It is the centeral component of indexing process. This class actually creates new Index or opens the existing one and add, remove and update the document in the index. It has one public constructor which takes Directory class's object and IndexWriterConfig class's object as parameters.

This class exposes many methods to add Document class object to be used internally in Indexing.

This class exposes methods used for deletingDocuments from the index as well and other informative methods like numDocs() which returns all the documents in the index including deleted once if they are not flushed on file.

2. IndexWriterConfig: It holds the configuration required to create IndexWriter object. It has one public constructor which takes two parameter one is enum of Version i.e. lucene version for compatibility issues. The other parameter is object of Analyzer class which itself is abstract class but have many implementing classes like WhiteSpaceAnalyzer, StandardAnalyzer etc. which helps in Analyzing the tokens. It is used in analysis process.

3. Directory: The Directory class represents the location of Lucene index. It is an abstract class and have many different concrete implementation. No one implementation is best suited for the computer architecture you have. Hence use FSDirectory abstract class to get best possible concrete implementation available for the Directory class.

4. Analyzer: Before any text is indexed, it is passed to Analyzer for extracting tokens out of that text that should be indexed and rest will be eliminated.

5. Document: Document class represents collection of Fields. It is a chunk of data which we want to index and make it retrievable at a later time.

6. Field: Each document will have one or more than one fields. Each field has a name and corresponding to it a value. Most of Field class methods are depreciated. It is favourable to use other existing implementation of Field class like IntField, LongField, FloatField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField.

Searching with Lucene

Let's jump to searching with Lucene and then will explain the classes used.

package lucene.searcher;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 * @author Gaurav Rai Mazra
 */
public class SearcherTest {

 public static void main(String[] args) throws IOException, ParseException {
  String indexDir = "index";
  String q = "direwolf";
  
  search(indexDir, q);
 }
 
 //Search in lucene index
 private static void search(String indexDir, String q) throws IOException, ParseException {
  //get a directory to search from
  Directory directory = FSDirectory.open(new File(indexDir));
  // get reader to read directory
  IndexReader indexReader = DirectoryReader.open(directory);
  //create indexSearcher
  IndexSearcher is = new IndexSearcher(indexReader);
  // Create analyzer to analyse documents
  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); 
  //create query parser
  QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer);
  //get query
  Query query = queryParser.parse(q);
  
  //Query query1 = new TermQuery(new Term("contents", q));

  long start = System.currentTimeMillis();
  //hit query
  TopDocs hits = is.search(query, 10);
  long end = System.currentTimeMillis();
  
  System.err.println("Found " + hits.totalHits + " document(s) in " + (end-start) + " milliseconds");
  for (ScoreDoc scoreDoc : hits.scoreDocs)
  {
   Document document = is.doc(scoreDoc.doc);
   System.out.println(document.get("fullpath"));
  }
 }
}
Explanation

1. IndexReader: This is an abstract class providing an interface for assessing an index. For getting particular implementation helper class DirectoryReader is used which calls open method with passing directory reference to get IndexReader object.

2. IndexSearcher: IndexSearcher is used to search data which is indexed by IndexWriter. You can think of IndexSearcher as a class which opens the index in read-only mode. It requires the IndexReader instance to create object of it. It has method to search and getting documents.

3. QueryParser: This class is used to parse the string to generate query out of it.

4. Query: It is abstract class represent the query to be used in searching. There are many concrete classes to it like TermQuery, BooleanQuery, PhraseQuery etc. It contains several utility method, one of it is setBoost(float).

5. TopDocs: It represents the hit returned by search method of IndexSearcher. It has one public constructor which take three parameters int totalHits, ScoreDoc[] scoreDocs, float maxScore. The ScoreDoc contains the score and documentId of the document.

No comments :

Post a Comment