Thursday, March 12, 2009

MapReduce Framework (Article 2) : Distributed computing

Distributed computing - CPUs having seperate memory. All such CPUs are connected by some means (ethernet, gigabit ethernet, wan etc). 

Basic element of distributed computing is to identify the subproblems that can run simultaneously without any data dependency e.g. in an animation movie, all frames can be rendered simultaneously as to render frame #10 we don't need data from frame #11.

In distributed computing environment, there are multiple processes running on hundreds of CPUs. They work on a copy of original input. Nobody will modify the original input. As the computation proceeds, there will be data transfer happening between nodes. We mainly use TCP/IP protocol to transfer data between nodes.

Challenges in distributed computing environment
  1. Reliable messaging is a MUST.
  2. In distributed computing environment, data will move from one node to another, the intermediate nodes can read your data along with IP headers. So, either you need to trust all the intermediate nodes or build your own protocol, so that data is encrypted but not IP headers.
  3. We need to make sure all data packets originated from host machine.
  4. We need to make sure data sits close to logical computing unit. So Router position is very important.
Hadoop (hadoop.apache.org/core/) is an open source java implementation of distributed computing platform for MapReduce framework that supports the above mentioned features. I will discuss Hadoop distributed File System in article #4.

No comments: