MapReduce Explained

Intro


Have you ever thought how does Google Search engine - the core Google product that brought up the company to the position of one of the biggest, if not the biggest, leaders on ICT market - work? It all came up thanks to MapReduce data processing framework. Although nowadays Google Search leverages much more powerful engines like BigTable and Caffeine, it has it origins in MapReduce.

It's everywhere around nowadays. Search engines, bank systems, colaboration platforms - they're all using MapReduce. It's de facto standard for processing and analysing big data sets. Any time you'll hear about Big Data, you'll also hear about MapReduce.

But what MapReduce actually is? According to "MapReduce: Simplied Data Processing on Large Clusters" - an official publication made by Google - a MapReduce "is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key". Although you could just read the research paper, that I do encouraged you to do anyway, I believe the best way to understand it is just to cover it based on an simple example. Lets go into the next step then where I'll show you a real, live problem where MapReduce comes with the simplest and lowest cost solution.


Job


I live in Krakow, Poland that is a mid-sized EU city. Lets consider the following problem:

"I need a list of all streets in Krakow, both with the highest house number on particular street. The above needs to be completed in 2 days".

Sounds unworkable. I would need to visit each street, search for a house with the highest number, note it down and so on until I'll have a full list. As the overall length of streets in Krakow is around 1200 km, assuming that I would walk 15 km per day (I am not type of a sportsman and need to spend some time on looking for a house with the highest number too) the above will take me 80 days. Is there any way to accomplish the job then? The answer is: Yes, there's a solution for that and it's called MapReduce! Lets start with the Map step then.

Map


First of all you've forgotten that I have a friends that will help me with that! I have 39 friends, so there are 40 of us. If we split the job that each of us walks 30 km per day, we will all walk 1200 km per day. That means we all could walk the whole Krakow in 1 day, so we would complete the job even before the deadline!

OK. But I said that I could walk 15 km per day only and lets assume that it's indeed an average of all of us. I also mentioned that this comes from both our bodies limitations - performance - and due to the fact that we need to spend some time on looking for a house with the highest number - processing time. I doubt we would be able to bypass the performance, but could we speed up processing time a bit?

Of course we could. Instead of looking for a house with the highest number on particular streets, we could just pass the street quickly and note down its name - key - and each house number - value. As the house number are placed in the best visible places, we would even not need to stop. Lets assume that thanks to the above we save so much time that would enable us to double the distance that we could walk, up to 30 km.

We all would then walk the whole Krakow in 1 day. But will the job be completed after that? Not yet as we would not have a list of streets with the highest house number, but a list of streets with all the house numbers on them instead. This is where the Reduce step comes.

Reduce


So far we've all spent 1 day on completing the job and we're almost done. The only task that is left is to sort the house numbers on particular streets and choose the one with the highest number. Lets assume that it's done just be me and it takes me 1 day. The job will be completed on time then!

But why did we hurry so? Couldn't we just walk those 15 km per day and have the job completed in 2 days anyway, without performing the Reduce step? In each case I'll need to spend 2 days on the job.

That's right, but you might have forgotten about my friends. They won't be involved in the Reduce step, so they'll have a day off and can help someone else, e.g in Warsaw (the capital of Poland). I'm not going to hold them anymore. Each of them did a really great task for me and we all did a really great job. Obviously once I'll sit and manually sort the house numbers that actually sounds like a nightmare ;).

Summary and Explanation


MapReduce is a framework for processing and generating large data sets in a quick and effective manner thanks to distributed computing paradigm. The above example shows that concept in a direct and understandable way. There's a job that gets split into 40 Map tasks and 1 Reduce task. Each task is executed either by me or my friend - tasktrackers - while the whole jobs is executed and coordinated by me only - jobtracker. After each task a list of key - value pairs gets compiled that finally leads to an ultimate one. Simple Map functions enables to speed up processing time. Assigning only one tasktracker to the Map function allows to free up the resources of the remaining tasktrackers.

In a real MapReduce framework jobtrackers and tasktrackers are the computer instances that cooperate together in a distributed computing paradigm. Map and Reduce are the functions written in Java, Python or actually any other language enable for data processing. The whole engine gets coordinated by a dedicated software like Apache Hadoop or proprietary software like in case of Google.

No comments:

Post a Comment