Map/Reduce: A Simple Explanation

Indexes are a very important part of RavenDB. Without them you couldn’t find your documents. But before I can show how they are created and used I have to explain the Map/Reduce algorithm. Map/Reduce is used for processing large amounts of data and was invented at Google. You can find many explanations and even more formulas on Map/Reduce, but I found them always hard to understand and it took me a long time to recognize the benefit of this algorithm. Therefore I try a simple (and most likely not 100% correct) explanation.

This post is part of the RavenDB series. You can find the other parts here:

 

The Problem

As explained in “Designing Documents for RavenDB” we can make arbitrarily complex documents. We have the freedom to nest objects into other objects and lists in even more lists. What can be represented in JSON can be used for our documents – even if we shouldn’t do that do that for a maintainable application.

When we are looking for a specific document RavenDB will use an index to find it. That index however has a predefined structure that knows nothing of our document. It contains the indexed fields and a reference to the document. For a relational database this is not a problem, since all data can be reduced to a field in a row. In a NoSQL database like RavenDB there are neither rows nor need all documents of one type have the same fields. Instead we have to close the gap between the complex structure of the document and the simple one of the index.

 

A Similar Problem with a Solution

Google once had a similar problem. In order to let you search the web by keywords they needed to know how often a certain word was used on a web page. A document formatted in HTML could be as nested as one in JSON. But all those documents had one thing in common: Words. If we concentrate on the words the problem shifts into a part where we split the text into words and one where we count them.

Google found out that this approach could be generalized and called that algorithm Map/Reduce. As the name implies there are two steps involved:

  1. Map: Transform one structure into another, like splitting the document into a collection of words
  2. Reduce: Group the result of the map step by a key and sum up a value

With those two steps we can go from a text document to a list containing every word and how often it is used:

MapReduce

The reduce function is basically a GROUP BY as we know it from SQL or LINQ. All you need is one value you can use as the key for your group. But as shown in the picture above you most likely want to have another field as a counter and add up the occurrence of the key value.

To work together the map and the reduce function only need a common data format between them. Since there is no other coupling you can easily replace one map function with another one. If you need to count words in a PDF document all you need is another map function who can read a PDF file. The reduce function remains the same and can be reused. The same is true for the reduce function.

 

The Solution

Map/Reduce can also close the gap between the JSON document and the RavenDB index. With the map function we can extract a value and build a simple flat list that is used in the index. The reduce function may not be necessary but could help if we need to group values by a certain key, like how many Items are shipped with a specific order.

All we need to know is a little bit of LINQ and that we can split a problem into smaller parts. The code we have to write to extract the title field from a book looks like that:

And these two lines are all that’s needed to build a map function.

 

Next

This is all the background knowledge you need to know about Map/Reduce for the time being. In the next post I will show how an index is build and used in RavenDB.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.