Summer Internships For College Students – Step By Step Guide

Choosing summer internships is a big decision for college students. It is important that they choose one where they will learn as much as possible but, after all, it is their summer vacation and they should have a bit of fun as well. There are a few steps that every student should follow in order to land the internship that will truly fit their educational and personal needs.
There are a few questions that every student should ask themselves before they even start looking for their dream internship. For example, do they want an internship that is full time or part time? Does the internship need to be paid or can they afford an unpaid internship? Do they want to travel or find a position close to home? The answers to all of these questions will have a significant impact on the kind of internship that a student will apply for.
Once a student has an idea of what kind of internship they want, they need to find the specific internship they will apply for. A great way to start looking for an internship is meeting with a career counselor at the college or university the student attends. They will have extensive information on a variety of opportunities that the student may never have known about if they simply did an online search for internships.
Students need to remember that just because they need an internship for school does not mean that a company is required to take them on, and they will have to go through an interview process. It is important to be as prepared as possible for the interview. The student should research some background on the company, prepare questions they may want to ask the interviewer and prepare answers to some basic interview questions.
Once a student lands their internship, their work is just beginning. They need to make the most out of their experience, taking everything in and learning as much as they can from the experience. They should also make an effort to go the extra mile in all of their tasks because they never know if they will return to the company for a job interview someday.
If students follow these simple steps, they should have no trouble finding summer internships that they will love. They need to be very clear about the kind of internship they are looking for and seek the advice of a career counselor. Once they get the internship, they need to truly shine and show the company that they would make an excellent employee in the future.
If anyone want to do 6 Weeks Summer Training in Jaipur with high end technology like cloud Computing, BigData hadoop, linux, Openstack and many more Job Oriented technology than, visit on – LinuxWorld Informatics Pvt. Ltd, Jaipur

What’s an IT job? Mislabeling could be behind lower growth numbers

Foote Partners’ latest IT jobs report shows lower growth than last year, but also says that the existing job categories don’t show the full picture

The slump in U.S. IT jobs might not be a slump at all, but a case of misreporting.

Foote Partners, LLC, floats the thesis in its latest analysis of the national IT job market, which details the fifth-worst month for IT employment since August 2014.

Only 6,500 jobs were added to the IT sector in February 2016. That’s up from the 5,500 of January, but way down from 2015’s monthly average of 12,300. It’s a net gain, but it’s terribly soft compared to the previous year. The slump doesn’t appear to be a seasonal variation either, as last year at this time, the sector added 13,700 jobs.

The job segments tracked by the BLS that represent IT jobs as we know them are filed under two categories: Management and Technical Consulting Services (3,400 jobs last month), and Computer Systems Design/Related Services (4,400 jobs)

The vast majority of the jobs created in IT — 92 percent — fall into those the two categories, with Telecommunications and Data Processing and Hosting and Related services comprising the remaining 8 percent.

According to Foote, the problem with using these categories as an index for IT hiring is that they “only [represent] approximately 40 percent of the true IT labor market.”

Foote believes many of the hot job segments in IT — “cloud computing, mobile computing, big data analytics, cybersecurity, certain areas of software development and engineering, and a large portion of hybrid IT business positions” — aren’t adequately described by these two categories. As a result, such positions are “distributed throughout companies in administrative areas, functional departments, and products groups,” where Foote has observed “aggressive hiring.”

The obvious long-term solution is for the Department of Labor to revise how its job descriptions, but Foote doesn’t believe that’ll happen any time soon. “[The DoL] can’t afford to render decades of historical employment trend data obsolete,” it wrote.

The general strength of the job market shows the imprecision of the current methods for tabulating IT hiring. Unemployment fell to a low of 4.9 percent in February, and the labor force participation rate is climbing once again after free-falling for several years. If more IT jobs are being added to the economy, it’s reflected more in the overall numbers than in familiar categories.

The controversies about DoL job classifications typically focus on listing a worker as an independent contractor or a salaried employee exempt from overtime or minimum wage. There’s been relatively little discussion about the categories not reflecting the duties IT hires are performing. That’s likely to accelerate as more cutting-edge IT positions overlap multiple categories, such as engineers for blockchain technology or cognitive computing/machine learning.

Red Hat has become the leading vendor of OpenStack, but the company — and others — freely acknowledge serious issues related to complexity, scalability, and availability

OpenStack found its way into four of Red Hat’s top 30 deals last quarter, a quarter that saw the company pushing toward $2 billion in annual revenues — good for Red Hat, good for OpenStack.

Despite that rosy news, however, Red Hat’s earnings call highlighted several areas where OpenStack continues to fall short.
Learn to crunch big data with R Deep Dive promo
Quick guide: Learn to crunch big data with R

Get started using the open source R programming language to do statistical computing and graphics on
Read Now
[ InfoWorld unveils the Bossies — the best open source products of the year. | Track the latest trends in open source with InfoWorld’s Linux report newsletter. ]
OpenStack still has a long way to go

Take, for example, the OpenStack code base, which continues to be a cauldron of competing projects and a fair amount of poor-quality code.

This isn’t exactly news, given that early OpenStack leader Andrew Shafer once pilloried parts of the project as “a mishmash of naive ideas and pipe dreams.” More recently, OpenStack luminary Randy Bias has candidly derided the silos that different vendors impose on OpenStack, containing “special features that only you have.”

The result? “Every OpenStack deployment is its own unique snowflake,” Bias notes, due to the “hundreds upon hundreds of configuration options.”

Even Red Hat has gotten into the criticism game, with CEO Jim Whitehurst acknowledging OpenStack’s scalability issues on the Red Hat earnings call: “One of the issues … with OpenStack is its scalability, but it’s basically … assuming applications that are stateless,” leaving Red Hat needing to “continu[e] to build more high-availability features to allow it to run traditional applications.”

Despite such deficiencies, OpenStack’s complexity is like red meat for Red Hat, which thrives on taking complex infrastructure and making it easily consumable by mainstream enterprises. As former Red Hat CTO Brian Stevens told me in 2006, “Red Hat’s model works because of the complexity of the technology we work with. An operating platform has a lot of moving parts, and customers are willing to pay to be insulated from that complexity.”

While Bias may not like the vendor-imposed complexity, you won’t hear Red Hat complaining about it. In fact, Red Hat is counting on complexity all the way to the bank.
Enterprises still aren’t thinking about OpenStack correctly

However, OpenStack is still early days for Red Hat customers, Whitehurst revealed during the earnings call. The only ones embracing OpenStack are “earlier adopters,” he suggested, with OpenStack deals “lumpy,” meaning sporadic and large when they do close (due to professional services required to make it work).

Among those early adopters, Red Hat may have a problem. They may may not be ready for what OpenStack was designed: cloud-native applications. As Bias says:

There is no doubt that OpenStack was designed as an AWS clone — that is its lineage. OpenStack is for cloud-native applications… It’s not for running applications that require a 5-9’s infrastructure. [Those] don’t belong on OpenStack.

But Whitehurst, speaking on the Red Hat earnings call, gave a somewhat different view as to Red Hat’s customers and their intentions:

One of the issues or features with OpenStack is its scalability, but it’s basically … assuming applications that are stateless, and so continuing to build more high-availability features to allow it to run traditional applications is something we’ve been talking a lot to customers about.

[T]here’s a general belief that OpenStack is going to be a kind of low-cost platform of choice with customers going forward, but there’s a sense that, hey, Red Hat, you need to help us take some of our existing applications and migrate them onto OpenStack. So we’re actively working with some customers on that.

Such legacy applications aren’t likely to be a good fit for a cloud-native platform.

Of course, the kinds of early adopters interested in OpenStack also have big wallets. While Whitehurst didn’t go into detail, he did highlight the kinds of companies that currently are willing to put up with OpenStack’s rough edges and pay up for professional services to fill in the many blanks it leaves: one of the “very large global telcos” and “a very, very large financial service institution.”

In other words, unless you’re big with serious technology chops, you’re probably not safe going into the OpenStack water. Otherwise, be prepared to pay equally big professional services fees.

But wait! There’s more.
Docker is a much bigger deal than OpenStack

As big as the community behind OpenStack has been, Whitehurst declared Docker the “single biggest topic that comes up among … [Red Hat’s] leading [customers].” In fact, Whitehurst noted that he hears more from customers about Docker than OpenStack.

I’ve argued before that Red Hat should forget OpenStack and double-down on Docker. Listening to the earnings call, this argument gains even more force. Whitehurst talked about why Docker containers are such a big deal. It’s “not because the infrastructure people necessarily want [Docker],” he said, “but [because] developers are picking it up because it’s so much more productive for developers.”

OpenStack is a vendor response to Amazon Web Services — and a half-baked one. Containers, by contrast, immediately make developers’ lives easier (just as AWS does), so they’re being adopted in droves. Developers aren’t asking for OpenStack, and it’s the developer that Red Hat must satisfy.

Let’s review: According to the leading vendor of OpenStack, the technology isn’t mature and as a result is expensive to implement successfully. Enterprises continue looking to OpenStack to solve problems its ill-equipped to solve — and Docker attracts vastly more interest because it meets real developer needs.

Why continue to pour resources into OpenStack?

What is Infrastructure as a Service?

The definition of infrastructure as a service (IaaS) is pretty simple. You rent cloud infrastructure—servers, storage and networking—
on demand, in a pay-as-you-go model.

Since you don’t need to invest in your own hardware,
IaaS is perfect for start-ups or businesses testing out
a new idea.

Also, since the infrastructure scales on demand, it’s great for workloads that fluctuate rapidly.
Public IaaS

Your business rents infrastructure from the cloud provider,
and accesses that infrastructure over the Internet, in order to create or use applications.


IaaS is the fastest growing area of cloud computing.*

Enterprise public cloud spending is expected to reach $207 billion
by 2016**

Common public IaaS workloads: dev/test, website hosting, storage, simple application development
Managed IaaS

But some workloads require an advanced solution.
Managed IaaS is suited for large enterprises running production workloads.

Introduction to MAPReduce

MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster.

MapReduce programs work in two phases:

  1. Map phase
  2. Reduce phase.

Input to each phase are key-value pairs. In addition, every programmer needs to specify two functions: map function and reduce function.

The whole process goes through three phase of execution namely,

How MapReduce works

Lets understand this with an example –

Consider you have following input data for your MapReduce Program

Welcome to Hadoop Class

Hadoop is good

Hadoop is bad

The final output of the MapReduce task is

bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1

The data goes through following phases

Input Splits:

Input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map


This is very first phase in the execution of map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, job of mapping phase is to count number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of <word, frequency>


This phase consumes output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output. In our example, same words are clubed together along with their respective frequency.


In this phase, output values from Shuffling phase are aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each words.

The overall process in detail

  • One map task is created for each split which then executes map function for each record in the split.
  • It is always beneficial to have multiple splits, because time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller, the processing is better load balanced since we are processing the splits in parallel.
  • However, it is also not desirable to have splits too small in size. When splits are too small, the overload of managing the splits and map task creation begins to dominate the total job execution time.
  • For most jobs, it is better to make split size equal to the size of an HDFS block (which is 64 MB, by default).
  • Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
  • Reason for choosing local disk over HDFS is, to avoid replication which takes place in case of HDFS store operation.
  • Map output is intermediate output which is processed by reduce tasks to produce the final output.
  • Once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication becomes overkill.
  • In the event of node failure before the map output is consumed by the reduce task, Hadoop reruns the map task on another node and re-creates the map output.
  • Reduce task don’t work on the concept of data locality. Output of every map task is fed to the reduce task. Map output is transferred to the machine where reduce task is running.
  • On this machine the output is merged and then passed to the user defined reduce function.
  • Unlike to the map output, reduce output is stored in HDFS (the first replica is stored on the local node and other replicas are stored on off-rack nodes). So, writing the reduce output

How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

  1. Map tasks (Spilts & Mapping)
  2. Reduce tasks (Shuffling, Reducing)

as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a

  1. Jobtracker : Acts like a master (responsible for complete execution of submitted job)
  2. Multiple Task Trackers : Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode.

  • A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
  • It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to run on different data nodes.
  • Execution of individual task is then look after by tasktracker, which resides on every data node executing part of the job.
  • Tasktracker’s responsibility is to send the progress report to the jobtracker.
  • In addition, tasktracker periodically sends ‘heartbeat’ signal to the Jobtracker so as to notify him of current state of the system.
  • Thus jobtracker keeps track of overall progress of each job. In the event of task failure, the jobtracker can reschedule it on a different tasktracker.

Article Source –

Create Your First Hadoop Program

Find out Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv


  • This tutorial is developed on Linux – Ubuntu operating System.
  • You should have Hadoop (version 2.2.0 used for this tutorial) already installed.
  • You should have Java (version 1.8.0 used for this tutorial) already installed on the system.

Before we start with the actual process, change user to ‘hduser’ (user used for Hadoop ).

su – hduser_


Create a new directory with name MapReduceTutorial

sudo mkdir MapReduceTutorial

Give permissions

sudo chmod -R 777 MapReduceTutorial

Copy files, and in this directory.

Download Files Here

If you want to understand the code in these files refer this Guide

Check the file permissions of all these files

and if ‘read’ permissions are missing then grant the same-

2. Export classpath

export CLASSPATH=”$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:$HADOOP_HOME/lib/*”

3. Compile java files (these files are present in directory Final-MapReduceHandsOn). Its class files will be put in the package directory

javac -d .

This warning can be safely ignored.

This compilation will create a directory in a current directory named with package name specified in the java source file (i.e. SalesCountry in our case) and put all compiled class files in it.

Step )

Create a new file Manifest.txt

sudo gedit Manifest.txt

add following lines to it,

Main-Class: SalesCountry.SalesCountryDriver

SalesCountry.SalesCountryDriver is name of main class. Please note that you have to hit enter key at end of this line.

Step Create a Jar file

jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class

Check that the jar file is created

6. Start Hadoop



7. Copy the File SalesJan2009.csv into ~/inputMapReduce

Now Use below command to copy ~/inputMapReduce to HDFS.

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/inputMapReduce /

We can safely ignore this warning.

Verify whether file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /inputMapReduce

8. Run MapReduce job

$HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

This will create an output directory named mapreduce_output_sales on HDFS. Contents of this directory will be a file containing product sales per country.

9. Result can be seen through command interface as,

$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output_sales/part-00000

o/p of above


Results can also be seen via web interface as-

Results through web interface-

Open r in web browser.

Now select ‘Browse the filesystem’ and navigate upto /mapreduce_output_sales

o/p of above

Open part-r-00000

Introduction To Flume and Sqoop

Before we learn more about Flume and Sqoop , lets study

Issues with Data Load into Hadoop

Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters.

This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges.

Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting right approach for data load.

Major Issues:

1. Data load using Scripts

Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time consuming.

2. Direct access to external data via Map-Reduce application

Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates these applications. So, this approach is not feasible.

3.In addition to having ability to work with enormous data, Hadoop can work with data in several different forms. So, to load such heterogeneous data into Hadoop, different tools have been developed. Sqoop and Flume are two such data loading tools.

Introduction to SQOOP

Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to provide connectivity to new external systems.

An example use case of Sqoop, is an enterprise that runs a nightly Sqoop import to load the day’s data from a production transactional RDBMS into a Hive data warehouse for further analysis.

Sqoop Connectors

All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.

Data transfer between Sqoop and external storage system is made possible with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

In addition to this, Sqoop has various third party connectors for data stores,

ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not come with Sqoop bundle ;those need to be downloaded separately and can be added easily to an existing Sqoop installation.

Introduction to FLUME

Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.

Flume supports multiple sources like –

  • ‘tail’ (which pipes data from local file and write into HDFS via Flume, similar to Unix command ‘tail’)
  • System logs
  • Apache log4j (enable Java applications to write events to files in HDFS via Flume).

Data Flow in Flume

Flume agent is a JVM process which has 3 components –Flume SourceFlume Channel and Flume Sink– through which events propagate after initiated at an external source .

  1. In above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
  2. Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use local file system in order to store these events.
  3. Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Some Important features of FLUME

  • Flume has flexible design based upon streaming data flows. It is fault tolerant and robust with multiple failover and recovery mechanisms. Flume has different levels of reliability to offer which includes ‘best-effort delivery’ and an ‘end-to-end delivery’Best-effort delivery does not tolerate any Flume node failure whereas ‘end-to-end delivery’ mode guarantees delivery even in the event of multiple node failures.
  • Flume carries data between sources and sinks. This gathering of data can either be scheduled or event driven. Flume has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
  • Possible Flume sinks include HDFS and Hbase. Flume can also be used to transport event data including but not limited to network traffic data, data generated by social-media websites and email messages.

Since July 2012, Flume is being released as Flume NG (New Generation), as it differs significantly from its original release, as known as Flume OG (Original Generation).

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it.HDFS can not import streaming data

Introduction To Pig And Hive

In this tutorial we will discuss Pig & Hive


In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.

Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That’s why the name, Pig!

Pig consists of two components:

  1. Pig Latin, which is a language
  2. Runtime environment, for running PigLatin programs.

A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus on data rather than the nature of execution.

PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Execution modes:

Pig has two execution modes:

  1. Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
  2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.


The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of programmers.

Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.

Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with SQL are able to perform data analysis very easily.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays and maps.

Hive comes with a command-line shell interface which can be used to create tables and execute queries.

Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins across Hive tables. It has a support for simple SQL like functions– CONCAT, SUBSTR, ROUND etc., and aggregation functions– SUM, COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query language.

Comparing MapReduce, Pig and Hive

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it. HDFS cannot be used to import structured or streaming data

Hadoop – HDFS Overview

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS Architecture

HDFS follows the master-slave architecture and it has the following elements.


The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.


The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.


Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

Let your summer speak of knowledge

We at Linux informatics Pvt. Ltd provides summer training for all the students. The program of summer training holds a prominent place in the life of any student. The education offered in this training period should be as informative as possible. The main reason for this is that, it determines your future phase in the industry. We are amongst the best institutes I Jaipur when it comes to Summer Training for B. Tech. Our team has highly experienced and hard working technocrats from all industries. They make sure that you are provided with the best of what we have.

In this training, we allow the students to learn new things and ideas. We also offer them an opportunity to revive the knowledge that they already have. The programs offered in this summer training has subjects like .net, Java, web development. There are also many Cisco certified programs and python. The other programs offered include Shell scripting and Perl scripting. The fundamental feature of this training program is to provide the students with the opportunity of working on live projects. This will give them a better exposure to the IT industry.

The associated partners like Red Hat also have a greater impact in providing training.