Indexing German Basketball News with ZEND Search Lucene
Last week I did a small experiment with the ZEND Search Lucene library.
I ran an indexing script on a collection of 2,329 German news articles. Some of them were already offline or could not be downloaded/indexed for other reasons but I still came out with a total of 1742 indexed documents. The indexing took roughly 100 minutes and resulted in an index size of 7.6 MB.
Then I wrote a simplistic GUI in PHP to be able to search all news articles. You can see the result in the screenshot on the left.
So besides the hard facts, what else did I learn from using ZEND Search Lucene? I definitely encountered some problems that are related to the fact that I am indexing German news articles. This might only be interesting for German readers, so everybody else may skip the first two points.
1) Stopwords
Make sure that you get a stopword list for the language that you are indexing. In case you don’t know what stopwords are, just think of them as words than don’t carry much semantical information an can therefore be removed from a search index. If you want to find out more start with Wikipedia: Stopwords. I found a usable list of German words here but I am sure there are better ones out there: 70 Deutsche Stopwords
2) Special characters aka “Umlaute”
Still got some problems with the stemming/analyzing of the german words as ZEND does not provide you with a dedicated German stemmer. Maybe I should implement a German snowball stemmer for PHP but that was a bit too much effort for this experiment.
Also had some problems with our German special characters called “umlaute”. ZEND Search Lucene provides you with options to specify the encoding of your text at multiple points. You can specify the encoding during the indexing phase, choose a suitable Analyzer for your encoding (e.g. Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive in my case) and also specify the encoding when you are executing a query.
Finally I got that encoding part right, so at least I can store all the characters in their original form in the index. At search time though they still seem to lead to unexpected query results. I saw that some people are using approaches like replacing them, e.g. ‘ö’ becomes ‘oe’. If you do that before indexing and before sending a query as well, your queries will work out fine. This still does not seem to be a very clean approach for me. So for now I acknowledges that problem but did not work on it any further as it was not crucial for the purposes of my experiment.
3) WARNING: Searching for integer IDs in Keyword fields
Don’t fall in the same trap I fell in! Let’s assume that each of the documents that you want to index contains a documentID of type integer. In order to add this documentID to your index documents you might add it as a keyword fields like this one:
$doc->addField(Zend_Search_Lucene_Field::Keyword('documentID', $meldung->id));
When you then later want to search your index for the index document with this documentID you can not just run this query:
$hits = $index->find("documentID: 2")
The problem here is that within the find() method the Lucene StandardAnalyzer will be used. This Analyzer will throw away all the numbers from your query and therefore return no results for your query. If you want to search for document with a given documentID you have to write something like this instead:
/** * Searches for the given documentID in the index * Returns true if the documentID is found, false otherwise. * This function is used when reindexing the news so that * a news with the same documentID is never index twice. * * @param object $index * The ZEND Lucene Search Index in which we want to search. * @param int $documentID * documentID to search for * @return bool * true - if $documentID was found in $index, * false - otherwise */ function documentAlreadyIndexed($index,$documentID){ $pathTerm = new Zend_Search_Lucene_Index_Term($documentID,'documentID'); $pathQuery = new Zend_Search_Lucene_Search_Query_Term($pathTerm); $query = new Zend_Search_Lucene_Search_Query_Boolean(); $query->addSubquery($pathQuery, true); $hits = $index->find($query); return (count($hits) > 0); }
4) Using only ZEND Search Lucene but not the whole framework
If you just want to use only the Lucene Search part of the Zend Framework, just copy these files to your webfolder and make sure that they are in PHP’s include path:
Zend/Search/ folder
Zend/Exception.php file
So this was my little something on ZEND Search Lucene. I hope you may find one or the other information helpful. One very interesting thing at last: Guess what were the most frequent terms in the data set? Right “Alba” and “Berlin” with about 1600 articles containing both of them. Very strange in a data set that consists of a set of news articles about the basketball team “Alba Berlin” :) See the Luke screenshot for other top terms in the index.
Link List
Programmer’s Reference Guide
Zend Framework 1.6 Help
Zend Framework API Documentation
Creating A Fulltext Search Engine In PHP 5 With The Zend Framework’s Zend Search Lucene
ZEND Search Lucene
Wikipedia: Stopwords
70 Deutsche Stopwords
Luke










Nice article! Though the deletion approach you described works fine for me (using ZF 1.9.2). I also use the utf-8-num-caseinsensitive analyzer. Don’t know why it didnt worked for you…maybe they changed s.th. since ZF 1.6
Leave your response!