1. Efficient search
An efficient search will avoid iterating through all the messages to search for something. For example a n-ary tree that will give the list of messages corresponding to a keyword can give an efficiency of O(length_of_keyword). I guess most indexing engine will build such a tree. Obviously, to avoid the user wait for the result, we have to build this tree before the user ask the search results. So that, we have to index as soon as possible. After fetching the list of messages, we will index the headers (subject, from field and recipient fields). That will correspond to most searches. Then, we will index the text content of the message. Less users will make this kind of searches but it remains useful.
2. Reuse existing components
Lucene has implemented this kind of indexer in Java. But etpanX is written in C language and Java is not that easy to interface with C. Lucene4c exists and has a C API but is mostly an abandonned software. Then, the last possible choice was CLucene, which has a C++ API. That made me write a glue between a C API for internal use of etpanX and C++ CLucene. An other possible choice could be Hyper Estraier but Lucene is more tested than the latter.
3. Regexp or not to regexp ?
Lucene won’t be able to work with regular expressions but basic users won’t understand what a regular expression is and even developers of system administrators have to make an effort to write a regular expression that will fit what they are searching. And in most case search by keyword will be sufficient.
4. Reading
That is not related to indexing algorithm though, the following are still interesting.
Designing Visual Interfaces: Communication Oriented Techniques
ISBN: 0133033899
The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design
ISBN: 0393315703
5. Music


