rock’n roll

May 23, 2006

Write an efficient search module

Filed under: mail — dinhviethoa @ 1:10 am

1. Efficient search

An efficient search will avoid iterating through all the messages to search for something. For example a n-ary tree that will give the list of messages corresponding to a keyword can give an efficiency of O(length_of_keyword). I guess most indexing engine will build such a tree. Obviously, to avoid the user wait for the result, we have to build this tree before the user ask the search results. So that, we have to index as soon as possible. After fetching the list of messages, we will index the headers (subject, from field and recipient fields). That will correspond to most searches. Then, we will index the text content of the message. Less users will make this kind of searches but it remains useful.

2. Reuse existing components

Lucene has implemented this kind of indexer in Java. But etpanX is written in C language and Java is not that easy to interface with C. Lucene4c exists and has a C API but is mostly an abandonned software. Then, the last possible choice was CLucene, which has a C++ API. That made me write a glue between a C API for internal use of etpanX and C++ CLucene. An other possible choice could be Hyper Estraier but Lucene is more tested than the latter.

3. Regexp or not to regexp ?

Lucene won’t be able to work with regular expressions but basic users won’t understand what a regular expression is and even developers of system administrators have to make an effort to write a regular expression that will fit what they are searching. And in most case search by keyword will be sufficient.

4. Reading

That is not related to indexing algorithm though, the following are still interesting.

Designing Visual Interfaces: Communication Oriented Techniques
ISBN: 0133033899

The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design
ISBN: 0393315703

5. Music

Tool

2 Comments »

  1. I wonder if you looked at adapting mairix? It is written in C, and is specially designed for searching email — you can do searches on various headers such as subject, to, from or a date range, or message size, etc.

    It is currently a standalone program, but creating a library out of it shouldn’t be hard.

    http://www.rc0.org.uk/mairix

    Comment by Ivan — May 30, 2006 @ 2:34 am

  2. There are several reason for that :

    1. I did not know mairix.
    2. mairix is GPL.
    3. CLucene or hyper estraier are already libraries and they have a flexible enough API for my needs, so that I mostly don’t have to look into their implementation. I don’t have to change them.

    Comment by hoa — May 31, 2006 @ 4:27 am


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.