Introduction To Pig And Hive

In this tutorial we will discuss Pig & Hive


In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.

Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That’s why the name, Pig!

Pig consists of two components:

  1. Pig Latin, which is a language
  2. Runtime environment, for running PigLatin programs.

A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus on data rather than the nature of execution.

PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Execution modes:

Pig has two execution modes:

  1. Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
  2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.


The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of programmers.

Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.

Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with SQL are able to perform data analysis very easily.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays and maps.

Hive comes with a command-line shell interface which can be used to create tables and execute queries.

Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins across Hive tables. It has a support for simple SQL like functions– CONCAT, SUBSTR, ROUND etc., and aggregation functions– SUM, COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query language.

Comparing MapReduce, Pig and Hive

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it. HDFS cannot be used to import structured or streaming data