Tag: hadoop

Hadoop in 5 minutes for beginners

01_Hadoop_full

So, you don’t know nothing about Hadoop and want to have just a simple picture of it? This post its for you!

So, you have a lot of data (TBs or more), spread all over the place sometimes structured sometimes not structured and you want to query these data. You are thinking by now, I will need a lot of power to query data “organized” like this. Yes, you need, you need Hadoop and all the Big Data techs around it.

What is Hadoop?

Wikipedia has this interesting fact: “Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time,[6] named it after his son’s toy elephant

As Apache states “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.”

What?

You have a system (that can be logically organized in  cluster >> racks >> nodes) with several components to handle distributed processing and files between a lot of machines. You have, between others, HDFS, a distributed file system, and an implementation of the Map Reduce pattern.

HDFS is a file system that works over all machines in the system, but you only see it as one file system, because is is distributed over several machines. How about my local file system? Still exists, the HDFS works over your local file system. (ex: “hadoop fs –ls“ a command in your local file system to runs a “ls” in the HDFS)

The MapReduce is a pattern (see here) to process large data sets (ok, you can use for small data sets too, because it’s a pattern, not a product, and you can implement it in any language you want with very little code). Hadoop uses this pattern to run your queries over the data. (It uses tasks, jobs, etc. for your requests, but always using this pattern in the execution)

So, by now you have a Distributed File System and an engine of tasks and jobs to run applications implemented using the Map Reduce pattern. Yes you are right.

So, how can I query all this data? Well, you can implement applications in any language, usually Java where you control the tasks, jobs, the Map and Reduce functions for the Map Reduce pattern, etc. A lot of work to do. Well, you can use other techs of Big Data that will help you to implement these queries and handle operations over your data, these are some of the languages(platforms) you can use and simplify your programmers life:

PIG (yes, it is this name) – example extracted from Apache. Loading and saving data…

/* id.pig */

A = load 'passwd' using PigStorage(':');  -- load the passwd file 
B = foreach A generate $0 as id;  -- extract the user IDs 
store B into ‘id.out’;  -- write the results to a file name id.out

Hive (“The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage.”) – some examples from Apache? More SQL like but it is not SQL as we know it.

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Jaql – A JSON Language used now by IBM BigInsights. An example from the Project Page

//
// Bind to variable
$log  = read(hdfs(“log”));
$user = read(hdfs(“user”));

//
// Query 1: filter and transform
$log
-> filter $.from == 101
-> transform { mandatory: $.msg };

// result …
[
{
“mandatory”: “Hello, world!”
}
]

Others –The are other projects and languages but for a introduction I think these 3 shows different ways of querying the data.

The overall picture

You install Hadoop, you will have an HDFS and Map Reduce engine. For querying the data you can develop yourself code or you can use languages like (PIG, HIVE, JAQL, …) to handle all the Map Reduce stuff behind the scenes. Yes, all the querying from these languages are always translated to tasks which run Map Reduce patterns, you don’t have to worry about the Map Reduce implementation, that is why its fast and your processing and data could be spread over thousands of machines!

Bad Behavior has blocked 106 access attempts in the last 7 days.

Hyper Smash