Create Your First Hadoop Program

Find out Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv

Prerequisites:

  • This tutorial is developed on Linux – Ubuntu operating System.
  • You should have Hadoop (version 2.2.0 used for this tutorial) already installed.
  • You should have Java (version 1.8.0 used for this tutorial) already installed on the system.

Before we start with the actual process, change user to ‘hduser’ (user used for Hadoop ).

su – hduser_

Steps:

Create a new directory with name MapReduceTutorial

sudo mkdir MapReduceTutorial

Give permissions

sudo chmod -R 777 MapReduceTutorial

Copy files SalesMapper.java, SalesCountryReducer.java and SalesCountryDriver.java in this directory.

Download Files Here

If you want to understand the code in these files refer this Guide

Check the file permissions of all these files

and if ‘read’ permissions are missing then grant the same-

2. Export classpath

export CLASSPATH=”$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:$HADOOP_HOME/lib/*”

3. Compile java files (these files are present in directory Final-MapReduceHandsOn). Its class files will be put in the package directory

javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

This warning can be safely ignored.

This compilation will create a directory in a current directory named with package name specified in the java source file (i.e. SalesCountry in our case) and put all compiled class files in it.

Step )

Create a new file Manifest.txt

sudo gedit Manifest.txt

add following lines to it,

Main-Class: SalesCountry.SalesCountryDriver

SalesCountry.SalesCountryDriver is name of main class. Please note that you have to hit enter key at end of this line.

Step Create a Jar file

jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class

Check that the jar file is created

6. Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh

$HADOOP_HOME/sbin/start-yarn.sh

7. Copy the File SalesJan2009.csv into ~/inputMapReduce

Now Use below command to copy ~/inputMapReduce to HDFS.

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/inputMapReduce /

We can safely ignore this warning.

Verify whether file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /inputMapReduce

8. Run MapReduce job

$HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

This will create an output directory named mapreduce_output_sales on HDFS. Contents of this directory will be a file containing product sales per country.

9. Result can be seen through command interface as,

$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output_sales/part-00000

o/p of above

            OR

Results can also be seen via web interface as-

Results through web interface-

Open r in web browser.

Now select ‘Browse the filesystem’ and navigate upto /mapreduce_output_sales

o/p of above

Open part-r-00000


Introduction To Flume and Sqoop

Before we learn more about Flume and Sqoop , lets study

Issues with Data Load into Hadoop

Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters.

This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges.

Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting right approach for data load.

Major Issues:

1. Data load using Scripts

Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time consuming.

2. Direct access to external data via Map-Reduce application

Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates these applications. So, this approach is not feasible.

3.In addition to having ability to work with enormous data, Hadoop can work with data in several different forms. So, to load such heterogeneous data into Hadoop, different tools have been developed. Sqoop and Flume are two such data loading tools.

Introduction to SQOOP

Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to provide connectivity to new external systems.

An example use case of Sqoop, is an enterprise that runs a nightly Sqoop import to load the day’s data from a production transactional RDBMS into a Hive data warehouse for further analysis.

Sqoop Connectors

All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.

Data transfer between Sqoop and external storage system is made possible with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

In addition to this, Sqoop has various third party connectors for data stores,

ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not come with Sqoop bundle ;those need to be downloaded separately and can be added easily to an existing Sqoop installation.

Introduction to FLUME

Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.

Flume supports multiple sources like –

  • ‘tail’ (which pipes data from local file and write into HDFS via Flume, similar to Unix command ‘tail’)
  • System logs
  • Apache log4j (enable Java applications to write events to files in HDFS via Flume).

Data Flow in Flume

Flume agent is a JVM process which has 3 components –Flume SourceFlume Channel and Flume Sink– through which events propagate after initiated at an external source .

  1. In above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
  2. Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use local file system in order to store these events.
  3. Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Some Important features of FLUME

  • Flume has flexible design based upon streaming data flows. It is fault tolerant and robust with multiple failover and recovery mechanisms. Flume has different levels of reliability to offer which includes ‘best-effort delivery’ and an ‘end-to-end delivery’Best-effort delivery does not tolerate any Flume node failure whereas ‘end-to-end delivery’ mode guarantees delivery even in the event of multiple node failures.
  • Flume carries data between sources and sinks. This gathering of data can either be scheduled or event driven. Flume has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
  • Possible Flume sinks include HDFS and Hbase. Flume can also be used to transport event data including but not limited to network traffic data, data generated by social-media websites and email messages.

Since July 2012, Flume is being released as Flume NG (New Generation), as it differs significantly from its original release, as known as Flume OG (Original Generation).

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it.HDFS can not import streaming data

Hadoop – HDFS Overview

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS Architecture

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.


Hadoop – Big Data Solutions

In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose.

Big Data Traditional Approach

Limitation

This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.

Google’s Solution

Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

Google MapReduce

Above diagram shows various commodity hardwares which could be single CPU machines or servers with higher capacity.

Hadoop

Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005 and Daug named it after his son’s toy elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capabale enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data.


Advanced Cloud Computing Summer Internship

More about cloud computing

In today’s era of the Internet, modern computing is a remarkable thing. Have you ever wondered while sitting at home and watching a video on YouTube, on how does this system work? We here at advanced cloud computing training will give you all the knowledge that is necessary for computing. Your computer is plugged in with all other computers and serve all the information to be spread all over the world. This theory is called cloud computing. Advanced Cloud Computing offers other benefits as well. Today our programs are travelling beyond the realm of personal computers in our homes. When you watch a game at home or a news telecast, remember there are hundreds of other computers connected to the services of these computers. They pass the signal to various machines creating a protocol and then creating a cloud of information.

Major topics that are included in Cloud Computing:

  • Building a Data Center for clouding.
  • Understand the components of cloud computing.
  • Working with the database services.
  • To understand about cloud computing.
  • Taking a look at security risks
  • Obtaining protocols, and cloud storage.
  • Indulging major service providers.
  • Private cloud management

Summer Training and its Importance

Summer break can be well utilised by doing a summer training under any artist, designer and industry as well. If you are smart in your work then you can get it easily and it also helps you to build up your CV. This justifies the purpose of both entrepreneur and students. Summer internships serve the purpose of both Internee and an entrepreneur.It is an opportunity to learn more from more experienced people of the field you want to work.

For college students it is like a test drive to your career and it also gives you a broad way to determine what environment suits you. For instance if you are in any creative field like designing, fine arts etc. then also summer training will carve a path for your future. While working under professional you will come to know what exactly professionals do and how any firm or company works. You will also build up a list of contacts for your future reference. While working in an export house or fashion house you will enquire that this is not for me and I can focus on other. Fashion designing is not you cup of tea and pattern making technical designing or any other branch suits you well. Whether you like to work under a designer or in an export house even in a fashion house or you have your capability to run your own brand name.

Strongest benefit of being involved in a summer internship is that you can be hired as an employee as you finish your college. In current scenario is a huge factor where there is fierce recession. If you involvement during the summer training is good then 90percent of the students hired by the company with permanent positions and that serves your purpose.


OpenStack training and the benefits.

Cloud Computing is getting more attention within the enterprises of all the sizes and shapes. However, a very few technologists know how to exactly design, scope and construct the Cloud stations. By taking up the standard databases, software applications and user interfaces, deploying these in a Cloud environment are some of the techniques for a disaster. Correct scoping; careful design and usage modeling are all important elements to achieve in the Cloud. This architecture training of Open stack cloud will allow to gain a practical and bottomless understanding of the challenges and advantages associated with the Cloud computing. They also teach you how to deal with them at the stage of design of your projects.

They also use the works of the Open stack project to the students to the cloud deployment finest references and practices. Some of the e.g. are providing an expensive and precious insight for anyone who is building a brand new cloud environment through the configurations and internal programs and implementations of cloud. The qualified and experts who are holding this high level of knowledge have always helped their corporation with excellence in migrations and deployments. The Best and the most famous Advanced Cloud Computing Training in Jaipur. Jaipur has many training institutes in these. A candidate can perform the following through this qualification.

First, one can identify that an IT infrastructure, which is cloud-based, varies from a data center conventional design. Second, One can Configure and Deploy their own, private, or public cloud without any of the automation tools or framework. Third, One can be in such a position that he or she based on OpenStack deploy a private cloud environment. Fourth, One can get the knowledge of the different storage, Linux networking, provision techniques and configuration that are available. This happens only in the world of enterprise, especially the ones who are a part of the project Open stack.


Ways to crack the interview for B. Tech

In order to become a qualified and a well-known professional, you need to go for the industrial training. It is very important for students who are in the stream of engineering such as MBA, Polytechnic, B. Tech and many more. To have these skills in today’s world is very essential as most of the companies want tough candidates, who is well trained, equipped and will take less time to grasp new things. If one wants to be in the market or in the company, he or she has to improve his or her talents through this practical oriented program of six months. Because of various socio-economic and political climate most of these companies are not ready to spend more on fresh graduates.

Either they are hiring trained engineering graduates who have the knowledge in this and are having specialized talents or engineering management students hired through campus. One may also be from an average college. This training of six months for engineering students is very crucial. Ample of time and scope one will be having if they do the industrial training. You will get to learn the practical features as you will be working on existing projects. Therefore, this Summer Training for B. Tech Students are very important and beneficial for the students as they can look forward in being selected by reputed and attractive recruiters. Reputed and good training companies all time follow the trend and important practice from other industries and therefore they have tie-ups with industries and tough placement cell. It helps you to learn to solve problems and think analytically while working on the actual time soft wares.

The training program gives you an extensive range of technology every time. The main advantage and benefit the six months training is as follows. Practical based training will be facilitated by the project-oriented learning. The professionals who have the knowledge and experience of working as a software developer in a hard-core development company train these new students. One will be aware of all the upgraded aspects of the technology, and it eventually helps you to crack the interview and amaze the employers.