Arya a MongoDB based Search Engine

TL;DR

Python powered MongoDB based search engine

https://github.com/SupermanScott/Arya

Motivation

I wanted to explore MongoDB Map Reduce framework and to build something non-trivial So this is a system that provides an Indexer and a Searcher to do the tasks and store the data in MongoDB. It is realtime search as the index is just a collection in MongoDB.

Solution

The solution uses Python and pymongo to store the data. It uses the port of Readability's javascript bookmarklet to extract the content of a given page. A tokenizer that splits on whitespace and one analyzer that runs the Porter Stem algorithm on the text. This choice means it really only works on English texts. The system does allow for multiple analyzers, the obvious case would be to have a Stop word analyzer to remove the obvious terms like 'the'. The order of the analyzers do matter. The indexing phase goes like this snippet

This snippet calculates the number of times that term shows up in that field of the document. This is then added to the reverse index for that term

Structuring the document like this allows for a number of different things to occur. First it is important to add an index on the 'term' field in the proper collection. This will make querying by that term fast. The matches array of embedded documents provides information about the match which includes, the field that the match was found in, the original word which is different from the term, and the number of times that term was seen in that field. Combining the last property with a global count of the number of documents that contain this term gives allows for the tf*idf score to be calculated.
 

Searching

Searching must be accomplished by applying the same tokenizer and analyzer pipeline on the query string. Doing this ensures that when I search for "Redis" the mongo query will use the term "redi". So the query is processed by the same tokenizer and analyzer These tokens derived from the query are then used to provide a 'query' to Map Reduce operation. Here is the map function

The map function's job is to emit all of the documents in the term matches along with the match's term frequency and the term's document frequency. These numbers will be used to calculate the tf*idf score. The reduce function does not calculate that score, but instead sums up the term frequencies on the document. This is needed because the term can be in many fields of the document. This is what the reducer looks like

The scoring is done by a finalize function and it is straight forward

Bringing this all together is a call to MongoDB map reduce using inline and with a query on the terms key. It also informs the finalize function of the total documents in the database using Javascript scope. The results from the Map Reduce call are then sorted by their scores and filtered to offset and limit

Running the Demo

The code is pushed to Github here: https://github.com/SupermanScott/Arya. Just checkout the repo, pip install -r requirements.txt and follow the instructions in the README. It is currently pretty bare bones, it was intended as a non-trivial exercise in MongoDB not a full on search solution.

Issues and Improvements

The system is currently hard coded with one tokenizer and one analyzer. This can easily be changed. The searcher returns the document and the score it received but not where the term is, or any information on how to 'highlight' the result. This is doable by adding in the required information into the match embedded document and processing it out in the Map Reduce phase. There is no query caching in this system. Paging through the results will result in duplicate work. It would be best to actually cache the output of the map reduce into Redis using a sorted set. The Redis key would have to be derived from the query.

 

Elva Realtime Log System

TL;DR

Realtime log server for python processes here:

https://github.com/SupermanScott/Elva

Motivation

Expose long running and inter-connected Python processes to a central spot without using SSH or tailing log files.

Technologies

Solution uses Redispython-redis-logtornado and Server-Sent Eventsto push the log info to Javascript. Redis was chosen because it was an existing part of my infrastructure and provided a straight forward Publish/Subscribe system. There are plenty of other PubSub systems out there and some may fit your situation better.

Python Redis Log provides the connection between Python Logging Framework and Redis and provides a rich json output.

Tornado is an asynchronous webserver that avoids the thread per request that Apache provides. This makes better use of the CPU and memory for requests that are idle awaiting new messages which is very true for this work.

This solution also uses Server-Sent Events to push the new messages to the browser. This was chosen over Websockets because Websockets are for two way communication and the problem calls for just one way (push). But the solution will work just as well with Websockets and would be less code as a Websocket handler exists in the Tornado code base where a Server-Sent event handler does not.

Problem

At my current position we have a number of long running python processes that do a lot of key work behind the scenes for our website. I want to resist the urge to SSH into a machine to inspect log files and I want an easy way to determine the health of these systems quickly without having to use SSH. Providing a central spot to view all these log files aggregated together seemed like a good solution (and providing markup so that Javascript could hide certain messages as well).

Using Redis and PubSub was an easy decision. My search though for a way to connect Redis-Py and Tornado IO Loop didn't net any results. The most common solution was to spawn a thread that just listened to redis seperate from Tornado. The result of this was Ctrl-C failed to shut down the Python process. Another solution was to create a new Redis python library tied to the Tornado IO Loop. But really, seems extreme and I couldn't get those solutions to work for me. I only need the Subscribe command to be on the Tornado IO Loop not the whole thing.

Solution

The solution was to create a sub class of redis.client.PubSub that created an tornado.iostream.IOStream from the socket created by Redis-Py. Then read on that stream until \r\n and process those messages. Then the server needs an Server-Sent Events RequestHandler that can push out the awaiting Javascript. This ended up being really straight forward as Server-Sent Events are really basic unlike Websockets (with all the various drafts...).

The PubSub handler looks like this:

The Server-Sent Event handler is also simple. It is a simple protocol that involves a Content-type header and plain text output.

Then all it takes is a simple Javascript:

Then all that is missing is connection the python processes to the python redis logger. To experiment using python REPL:

Or to use logging.config.dictConfig:

Running the Tornado Application should then create a server on localhost:8888 that prints all log statements into a table

Running the Demo

The code is pushed to Github here: https://github.com/SupermanScott/Elva. It uses all that is described in the post along with Twitter Bootstrap and mustache.js to create a decent looking interface. It buffers the messages and provides a range control to control the speed of the output. Just checkout the repo, pip install -r requirements.txt and python app.py

 

Issues and Improvments

The big issue is how fragile the PubSub client is. It only looks at the message responses from Redis. If an ERR repsonse is returned it is ignored. And it just seems really weak to errors in general. Extending this work should involved mechisims to filter the log output by error level and which logger produced the message.