Arya a MongoDB based Search Engine
TL;DR
Python powered MongoDB based search engine
https://github.com/SupermanScott/Arya
Motivation
I wanted to explore MongoDB Map Reduce framework and to build something non-trivial So this is a system that provides an Indexer and a Searcher to do the tasks and store the data in MongoDB. It is realtime search as the index is just a collection in MongoDB.SolutionThe solution uses Python and pymongo to store the data. It uses the port of Readability's javascript bookmarklet to extract the content of a given page. A tokenizer that splits on whitespace and one analyzer that runs the Porter Stem algorithm on the text. This choice means it really only works on English texts. The system does allow for multiple analyzers, the obvious case would be to have a Stop word analyzer to remove the obvious terms like 'the'. The order of the analyzers do matter. The indexing phase goes like this snippet
This snippet calculates the number of times that term shows up in that field of the document. This is then added to the reverse index for that term
Structuring the document like this allows for a number of different things to occur. First it is important to add an index on the 'term' field in the proper collection. This will make querying by that term fast. The matches array of embedded documents provides information about the match which includes, the field that the match was found in, the original word which is different from the term, and the number of times that term was seen in that field. Combining the last property with a global count of the number of documents that contain this term gives allows for the tf*idf score to be calculated.
The scoring is done by a finalize function and it is straight forward
Bringing this all together is a call to MongoDB map reduce using inline and with a query on the terms key. It also informs the finalize function of the total documents in the database using Javascript scope. The results from the Map Reduce call are then sorted by their scores and filtered to offset and limit Running the DemoThe code is pushed to Github here: https://github.com/SupermanScott/Arya. Just checkout the repo, pip install -r requirements.txt and follow the instructions in the README. It is currently pretty bare bones, it was intended as a non-trivial exercise in MongoDB not a full on search solution.Issues and Improvements
The system is currently hard coded with one tokenizer and one analyzer. This can easily be changed. The searcher returns the document and the score it received but not where the term is, or any information on how to 'highlight' the result. This is doable by adding in the required information into the match embedded document and processing it out in the Map Reduce phase. There is no query caching in this system. Paging through the results will result in duplicate work. It would be best to actually cache the output of the map reduce into Redis using a sorted set. The Redis key would have to be derived from the query.