1. Efficient search
An efficient search will avoid iterating through all the messages to search for something. For example a n-ary tree that will give the list of messages corresponding to a keyword can give an efficiency of O(length_of_keyword). I guess most indexing engine will build such a tree. Obviously, to avoid the user wait for the result, we have to build this tree before the user ask the search results. So that, we have to index as soon as possible. After fetching the list of messages, we will index the headers (subject, from field and recipient fields). That will correspond to most searches. Then, we will index the text content of the message. Less users will make this kind of searches but it remains useful.
2. Reuse existing components
Lucene has implemented this kind of indexer in Java. But etpanX is written in C language and Java is not that easy to interface with C. Lucene4c exists and has a C API but is mostly an abandonned software. Then, the last possible choice was CLucene, which has a C++ API. That made me write a glue between a C API for internal use of etpanX and C++ CLucene. An other possible choice could be Hyper Estraier but Lucene is more tested than the latter.
3. Regexp or not to regexp ?
Lucene won’t be able to work with regular expressions but basic users won’t understand what a regular expression is and even developers of system administrators have to make an effort to write a regular expression that will fit what they are searching. And in most case search by keyword will be sufficient.
4. Reading
That is not related to indexing algorithm though, the following are still interesting.
Designing Visual Interfaces: Communication Oriented Techniques
ISBN: 0133033899
The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design
ISBN: 0393315703
5. Music
I wonder if you looked at adapting mairix? It is written in C, and is specially designed for searching email — you can do searches on various headers such as subject, to, from or a date range, or message size, etc.
It is currently a standalone program, but creating a library out of it shouldn’t be hard.
http://www.rc0.org.uk/mairix
Comment by Ivan — May 30, 2006 @ 2:34 am
There are several reason for that :
1. I did not know mairix.
2. mairix is GPL.
3. CLucene or hyper estraier are already libraries and they have a flexible enough API for my needs, so that I mostly don’t have to look into their implementation. I don’t have to change them.
Comment by hoa — May 31, 2006 @ 4:27 am