Vote for the DevOps Dozen

canstockphoto18996369At the end of July we opened up nominations for our inaugural DevOps Dozen, to recognize the top 12 companies in DevOps. Over 1500 of you voted (thank you) selecting from over 120 different companies.  We have taken the top vote getters in each category to come up with a list of the final 32 finalists for the DevOps Dozen.

Frankly 32 finalists was more than we were planning on. But so many companies were so closely grouped in the nomination voting that we didn’t feel comfortable selecting one company for the finals and leaving another out.

So now we are opening the final phase of voting to you our readers. Voting will be open for the entire month of September. But you can only vote once.  On our DevOps Dozen site most of the finalists have sent us information that they would like you to know before casting your vote. Please take the time to read some of this, especially for companies you may not be familiar with.

The finalists range from open source projects, to cloud service providers, DevOps tools vendors to service providers.  When voting you are allowed to select 12 of the 32 as DevOps Dozen winners.  So take the time and make your selections carefully.

Just a word about the 32 finalists. Every one of them is already a winner. Having made it past the nominations means that at least 350 of the 1500 people voting voted for them.  On top of that if you take the time to check out these companies why they are here is pretty obvious.

So help us recognize excellence in DevOps and vote today!

Amazon EMR Update – Apache Spark 1.5.2, Ganglia, Presto, Zeppelin, and Oozie

  • Today we are announcing Amazon EMR release 4.2.0, which adds support for Apache Spark 1.5.2, Ganglia 3.6 for Apache Hadoop and Spark monitoring, and new sandbox releases for Presto (0.125), Apache Zeppelin (0.5.5), and Apache Oozie (4.2.0).

New Applications in Release 4.2.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with EMR API. In the latest release, we added support for several new versions of applications:

  • Spark 1.5.2 – Spark 1.5.2 was released on November 9th, and we’re happy to give you access to it within two weeks of general availability. This version is a maintenance release, with improvements to Spark SQL, SparkR, the DataFrame API, and miscellaneous enhancements and bug fixes. Also, Spark documentation now includes information on enabling wire encryption for the block transfer service. For a complete set of changes, view the JIRA. To learn more about Spark on Amazon EMR, click here.
  • Ganglia 3.6 – Ganglia is a scalable, distributed monitoring system which can be installed on your Amazon EMR cluster to display Amazon EC2 instance level metrics which are also aggregated at the cluster level. We also configure Ganglia to ingest and display Hadoop and Spark metrics along with general resource utilization information from instances in your cluster, and metrics are displayed in a variety of time spans. You can view these metrics using the Ganglia web-UI on the master node of your Amazon EMR cluster. To learn more about Ganglia on Amazon EMR, click here.
  • Presto 0.125 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and the Hadoop Distributed Filesystem (HDFS). Presto 0.125 is a maintenance release, with optimizations to SQL operations, performance enhancements, and general bug fixes. To learn more about Presto on Amazon EMR, click here.
  • Zeppelin 0.5.5 – Zeppelin is an open-source interactive and collaborative notebook for data exploration using Spark. You can use Scala, Python, SQL, or HiveQL to manipulate data and visualize results. Zeppelin 0.5.5 is a maintenance release, and contains miscellaneous improvements and bug fixes. To learn more about Zeppelin on Amazon EMR, click here.
  • Oozie 4.2.0 – Oozie is a workflow designer and scheduler for Hadoop and Spark. This version now includes Spark and HiveServer2 actions, making it easier to incorporate Spark and Hive jobs in Oozie workflows. Also, you can create and manage your Oozie workflows using the Oozie Editor and Dashboard in Hue, an application which offers a web-UI for Hive, Pig, and Oozie. Please note that in Hue 3.7.1, you must still use Shell actions to run Spark jobs. To learn more about Oozie in Amazon EMR, click here.

Launch an Amazon EMR Cluster with Release 4.2.0 Today
To create an Amazon EMR cluster with 4.2.0, select release 4.2.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.2.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.

Jon Fritz, Senior Product Manager

  • Now Available: Version 1.0 of the AWS SDK for Go

    by Jeff Barr | on | in Developers | Permalink
    Earlier this year, my colleague Peter Moon shared our plans to launch an AWS SDK for Go. As you will read in Peter’s guest post below, the SDK is now generally available!— Jeff;

    At AWS, we work hard to promote and serve the developer community around our products. This is one of the reasons we open-source many of our libraries and tools on GitHub, where we cherish the ability to directly communicate and collaborate with our developer customers. Of all the experiences we’ve had in the open source community, the story of how the AWS SDK for Go came about is one we particularly love to share.

    Since the day we took ownership of the project 10 months ago, community feedback and contributions have made it possible for us progress through the experimental and preview stages, and today we are excited to announce that the AWS SDK for Go is now at version 1.0 and recommended for production use. Like many of our projects, the SDK follows Semantic Versioning, which means starting from 1.0, you can upgrade the SDK within the same major version 1.x and have confidence your existing code will continue to work.

    Since the Developer Preview announcement in June, we have added a number of key improvements to the SDK, including:

    • Sessions – Easily share configuration and request handlers between clients.
    • JMESPATH support – Query and reshape complex API responses and other structures using simple expressions.
    • Paginators – Iterate over multiple pages of list-type API responses.
    • Waiters – Wait for asynchronous state changes in AWS resources.
    • Documentation – Revamped developer guide.

    Here’s a code sample that exercises some of these new features:

    // Create a session
    s := session.New(aws.NewConfig().WithRegion("us-west-2"))
    // Add a handler to print every API request for the session
    s.Handlers.Send.PushFront(func(r *request.Request) {
    	fmt.Printf("Request: %s/%s\n", r.ClientInfo.ServiceName, r.Operation)
    // We want to start all instances in a VPC, so let's get their IDs first.
    ec2client := ec2.New(s)
    var instanceIDsToStart []*string
    describeInstancesInput := &ec2.DescribeInstancesInput{
    	Filters: []*ec2.Filter{
    			Name:   aws.String("vpc-id"),
    			Values: aws.StringSlice([]string{"vpc-82977de9"}),
    // Use a paginator to easily iterate over multiple pages of response
    	func(page *ec2.DescribeInstancesOutput, lastPage bool) bool {
    		// Use JMESPath expressions to query complex structures
    		ids, _ := awsutil.ValuesAtPath(page, "Reservations[].Instances[].InstanceId")
    		for _, id := range ids {
    			instanceIDsToStart = append(instanceIDsToStart, id.(*string))
    		return !lastPage
    // The SDK provides several utility functions for literal <--> pointer transformation
    fmt.Println("Starting:", aws.StringValueSlice(instanceIDsToStart))
    // Skipped for brevity here, but *always* handle errors in the real world 🙂
    	InstanceIds: instanceIDsToStart,
    // Finally, use a waiter function to wait until the instances are running
    fmt.Println("Instances are now running.") 

    We would like to again thank Coda Hale and our friends at Stripe for contributing the original code base and giving us a wonderful starting point for the AWS SDK for Go. Now that it is fully production-ready, we can’t wait to see all the innovative applications our customers will build with the SDK!

    For more information please see:

    Peter Moon, Senior Product Manager

  • AWS Device Farm Update – Test Web Apps on Mobile Devices

    by Jeff Barr | on | in AWS Device Farm | Permalink | Comments
    If you build mobile apps, you know that you have two implementation choices. You can build native or hybrid applications that compile to an executable file. You can also build applications that run within the device’s web browser.We launched the AWS Device Farm in July with support for testing native and hybrid applications on iOS and Android devices (see my post, AWS Device Farm – Test Mobile Apps on Real Devices, to learn more).

    Today we are adding support for testing browser-based applications on iOS and Android devices. Many customers have asked for this option and we are happy to be able to announce it. You can now create a single test run that spans any desired combination of supported devices and makes use of the Appium Java JUnit or Appium Java TestNG frameworks (we’ll add additional frameworks over time; please let us know what you need).

    Testing a Web App
    I tested a simple web app. It opens and searches for the string “Kindle”. I opened the Device Farm Console and created a new project (Test Amazon Site). Then I created a new run (this was my second test, so I called it Web App Test #2):

    Then I configured the test by choosing the test type (TestNG) and uploading the tests (prepared for me by one of my colleagues):

    The file ( contains the compiled test and the dependencies (a bunch of JAR files):

    Next, I choose the devices. I had already created a “pool” of Android devices, so I used it:

    I started the run and then checked in on it a few minutes later:

    Then I inspected the output, including screen shots, from a single test:

    Available Now
    This new functionality is available now and you can start using it today! Read the Device Farm Documentation to learn more.

Apache Spark 1.5.2 and new versions of Ganglia monitoring, Presto, Zeppelin, and Oozie now available in Amazon EMR

You can now deploy new applications on your Amazon EMR cluster. Amazon EMR release 4.2.0 now offers Ganglia 3.6, an upgraded version of Apache Spark (1.5.2), and upgraded sandbox releases of Apache Oozie (4.2.0), Presto (0.125), and Apache Zeppelin (0.5.5). Ganglia provides resource utilization monitoring for Hadoop and Spark. Oozie 4.2.0 includes several new features, such as adding Spark actions and HiveServer2 actions in your Oozie workflows. Spark 1.5.2, Presto 0.125, and Zeppelin 0.5.5 are maintenance releases, and contain bug fixes and other optimizations.

You can create an Amazon EMR cluster with release 4.2.0 by choosing release label “emr-4.2.0” from the AWS Management Console, AWS CLI, or SDK. You can specify Ganglia, Spark, Oozie-Sandbox, Presto-Sandbox, and Zeppelin-Sandbox to install these applications on your cluster. To view metrics in Ganglia or create a Zeppelin notebook, you can connect to the web-based UIs for these applications on the master node of your cluster. Please visit the Amazon EMR documentation for more information about Ganglia 3.6, Spark 1.5.2, Oozie 4.2.0, Presto 0.125, and Zeppelin 0.5.5

What is Infrastructure as a Service?

The definition of infrastructure as a service (IaaS) is pretty simple. You rent cloud infrastructure—servers, storage and networking—
on demand, in a pay-as-you-go model.

Since you don’t need to invest in your own hardware,
IaaS is perfect for start-ups or businesses testing out
a new idea.

Also, since the infrastructure scales on demand, it’s great for workloads that fluctuate rapidly.
Public IaaS

Your business rents infrastructure from the cloud provider,
and accesses that infrastructure over the Internet, in order to create or use applications.


IaaS is the fastest growing area of cloud computing.*

Enterprise public cloud spending is expected to reach $207 billion
by 2016**

Common public IaaS workloads: dev/test, website hosting, storage, simple application development
Managed IaaS

But some workloads require an advanced solution.
Managed IaaS is suited for large enterprises running production workloads.

Create Your First Hadoop Program

Find out Number of Products Sold in Each Country.

Input: Our input data set is a CSV file, SalesJan2009.csv


  • This tutorial is developed on Linux – Ubuntu operating System.
  • You should have Hadoop (version 2.2.0 used for this tutorial) already installed.
  • You should have Java (version 1.8.0 used for this tutorial) already installed on the system.

Before we start with the actual process, change user to ‘hduser’ (user used for Hadoop ).

su – hduser_


Create a new directory with name MapReduceTutorial

sudo mkdir MapReduceTutorial

Give permissions

sudo chmod -R 777 MapReduceTutorial

Copy files, and in this directory.

Download Files Here

If you want to understand the code in these files refer this Guide

Check the file permissions of all these files

and if ‘read’ permissions are missing then grant the same-

2. Export classpath

export CLASSPATH=”$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.2.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar:~/MapReduceTutorial/SalesCountry/*:$HADOOP_HOME/lib/*”

3. Compile java files (these files are present in directory Final-MapReduceHandsOn). Its class files will be put in the package directory

javac -d .

This warning can be safely ignored.

This compilation will create a directory in a current directory named with package name specified in the java source file (i.e. SalesCountry in our case) and put all compiled class files in it.

Step )

Create a new file Manifest.txt

sudo gedit Manifest.txt

add following lines to it,

Main-Class: SalesCountry.SalesCountryDriver

SalesCountry.SalesCountryDriver is name of main class. Please note that you have to hit enter key at end of this line.

Step Create a Jar file

jar cfm ProductSalePerCountry.jar Manifest.txt SalesCountry/*.class

Check that the jar file is created

6. Start Hadoop



7. Copy the File SalesJan2009.csv into ~/inputMapReduce

Now Use below command to copy ~/inputMapReduce to HDFS.

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal ~/inputMapReduce /

We can safely ignore this warning.

Verify whether file is actually copied or not.

$HADOOP_HOME/bin/hdfs dfs -ls /inputMapReduce

8. Run MapReduce job

$HADOOP_HOME/bin/hadoop jar ProductSalePerCountry.jar /inputMapReduce /mapreduce_output_sales

This will create an output directory named mapreduce_output_sales on HDFS. Contents of this directory will be a file containing product sales per country.

9. Result can be seen through command interface as,

$HADOOP_HOME/bin/hdfs dfs -cat /mapreduce_output_sales/part-00000

o/p of above


Results can also be seen via web interface as-

Results through web interface-

Open r in web browser.

Now select ‘Browse the filesystem’ and navigate upto /mapreduce_output_sales

o/p of above

Open part-r-00000

Introduction To Flume and Sqoop

Before we learn more about Flume and Sqoop , lets study

Issues with Data Load into Hadoop

Analytical processing using Hadoop requires loading of huge amounts of data from diverse sources into Hadoop clusters.

This process of bulk data load into Hadoop, from heterogeneous sources and then processing it, comes with certain set of challenges.

Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting right approach for data load.

Major Issues:

1. Data load using Scripts

Traditional approach of using scripts to load data, is not suitable for bulk data load into Hadoop; this approach is inefficient and very time consuming.

2. Direct access to external data via Map-Reduce application

Providing direct access to the data residing at external systems(without loading into Hadopp) for map reduce applications complicates these applications. So, this approach is not feasible.

3.In addition to having ability to work with enormous data, Hadoop can work with data in several different forms. So, to load such heterogeneous data into Hadoop, different tools have been developed. Sqoop and Flume are two such data loading tools.

Introduction to SQOOP

Apache Sqoop (SQL-to-Hadoop) is designed to support bulk import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop is based upon a connector architecture which supports plugins to provide connectivity to new external systems.

An example use case of Sqoop, is an enterprise that runs a nightly Sqoop import to load the day’s data from a production transactional RDBMS into a Hive data warehouse for further analysis.

Sqoop Connectors

All the existing Database Management Systems are designed with SQL standard in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.

Data transfer between Sqoop and external storage system is made possible with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. In addition, Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

In addition to this, Sqoop has various third party connectors for data stores,

ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase). However, these connectors do not come with Sqoop bundle ;those need to be downloaded separately and can be added easily to an existing Sqoop installation.

Introduction to FLUME

Apache Flume is a system used for moving massive quantities of streaming data into HDFS. Collecting log data present in log files from web servers and aggregating it in HDFS for analysis, is one common example use case of Flume.

Flume supports multiple sources like –

  • ‘tail’ (which pipes data from local file and write into HDFS via Flume, similar to Unix command ‘tail’)
  • System logs
  • Apache log4j (enable Java applications to write events to files in HDFS via Flume).

Data Flow in Flume

Flume agent is a JVM process which has 3 components –Flume SourceFlume Channel and Flume Sink– through which events propagate after initiated at an external source .

  1. In above diagram, the events generated by external source (WebServer) are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
  2. Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use local file system in order to store these events.
  3. Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Some Important features of FLUME

  • Flume has flexible design based upon streaming data flows. It is fault tolerant and robust with multiple failover and recovery mechanisms. Flume has different levels of reliability to offer which includes ‘best-effort delivery’ and an ‘end-to-end delivery’Best-effort delivery does not tolerate any Flume node failure whereas ‘end-to-end delivery’ mode guarantees delivery even in the event of multiple node failures.
  • Flume carries data between sources and sinks. This gathering of data can either be scheduled or event driven. Flume has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
  • Possible Flume sinks include HDFS and Hbase. Flume can also be used to transport event data including but not limited to network traffic data, data generated by social-media websites and email messages.

Since July 2012, Flume is being released as Flume NG (New Generation), as it differs significantly from its original release, as known as Flume OG (Original Generation).

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it.HDFS can not import streaming data

Introduction To Pig And Hive

In this tutorial we will discuss Pig & Hive


In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.

Pig is a high level programming language useful for analyzing large data sets. Pig was a result of development effort at Yahoo!

Pig enables people to focus more on analyzing bulk data sets and to spend less time in writing Map-Reduce programs.

Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That’s why the name, Pig!

Pig consists of two components:

  1. Pig Latin, which is a language
  2. Runtime environment, for running PigLatin programs.

A Pig Latin program consist of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows programmer to focus on data rather than the nature of execution.

PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.

Execution modes:

Pig has two execution modes:

  1. Local mode : In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small data sets using Pig
  2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with fully distributed cluster is useful of running Pig on large data sets.


The size of data sets being collected and analyzed in the industry for business intelligence is growing and in a way, it is making traditional data warehousing solutions more expensive. Hadoop with MapReduce framework, is being used as an alternative solution for analyzing data sets with huge size. Though, Hadoop has proved useful for working on huge data sets, its MapReduce framework is very low level and it requires programmers to write custom programs which are hard to maintain and reuse. Hive comes here for rescue of programmers.

Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.

Hive provides SQL-like declarative language, called HiveQL, which is used for expressing queries. Using Hive-QL users associated with SQL are able to perform data analysis very easily.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays and maps.

Hive comes with a command-line shell interface which can be used to create tables and execute queries.

Hive query language is similar to SQL wherein it supports subqueries. With Hive query language, it is possible to take a MapReduce joins across Hive tables. It has a support for simple SQL like functions– CONCAT, SUBSTR, ROUND etc., and aggregation functions– SUM, COUNT, MAX etc. It also supports GROUP BY and SORT BY clauses. It is also possible to write user defined functions in Hive query language.

Comparing MapReduce, Pig and Hive

Sqoop Flume HDFS
Sqoop is used for importing data from structured data sources such as RDBMS. Flume is used for moving bulk streaming data into HDFS. HDFS is a distributed file system used by Hadoop ecosystem to store data.
Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. Flume has an agent based architecture. Here, code is written (which is called as ‘agent’) which takes care of fetching data. HDFS has a distributed architecture where data is distributed across multiple data nodes.
HDFS is a destination for data import using Sqoop. Data flows to HDFS through zero or more channels. HDFS is an ultimate destination for data storage.
Sqoop data load is not event driven. Flume data load can be driven by event. HDFS just stores data provided to it by whatsoever means.
In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.

In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.

HDFS has its own built-in shell commands to store data into it. HDFS cannot be used to import structured or streaming data

Hadoop – HDFS Overview

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

Features of HDFS

  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of cluster.
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS Architecture

HDFS follows the master-slave architecture and it has the following elements.


The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

  • Manages the file system namespace.
  • Regulates client’s access to files.
  • It also executes file system operations such as renaming, closing, and opening files and directories.


The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.

  • Datanodes perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.


Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS

  • Fault detection and recovery : Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
  • Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
  • Hardware at data : A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.

Summer Internship 2016

The Summer Internship at LinuxWorld Informatics Pvt Ltd is a great opportunity to take the first step into your bright career. The internship or training program is set to begin in the mid quarter of 2016 and is supposed to be held in Jaipur. If you are a graduate in the technical field and are looking for a place to begin with, then this option will reap you great rewards in your career. It will be an added shine in your resume and you can also learn many professional skills that would give you a head start.

The internship program is for all students who have finished a course in computer science or any stream related to computer science. It will be conducted in the company specified location at Jaipur. We are seeking aspiring students in this field, who wish to take a step further to learn. Interested students could have a fruitful summer research experience.

The main objective of this program is to offer the students industry oriented and job specific advanced training – that would hone their skills to be experts at their field. The course covers Linux administration, Big data Hadoop Implementation, Cloud Computing Deployment, Web development, Mobile application Development, Software development and Security programs. Knowledge of all these will not only help you become efficient but will also aid the industry to have quality work force.

All the training programs will be 100% practical and we specifically invite B.E, M.E,,, BCA, MCA, PGDCA, MSC IT and BSC IT, and any other such related fields. All the topics covered by us are taught in a highly interactive fashion and it is made sure that the students leave after getting a full knowledge from us. Linux is arguably the best Summer Internship Institute in Jaipur. Our study materials are regularly updated and we have the best infrastructure to impart such training.

We have more than twelve years of experience in this industry and all our teachers are well equipped with materials so that they can teach the students with clarity. We aspire to provide our students with the best quality of summer training and to connect them to the latest and swiftest of technologies.

The selected students will be challenged to face some real world systems and applications and learn their prototypes. They will also have the opportunity to connect with the top students from other universities.

Summer Training and its Importance

Summer break can be well utilised by doing a summer training under any artist, designer and industry as well. If you are smart in your work then you can get it easily and it also helps you to build up your CV. This justifies the purpose of both entrepreneur and students. Summer internships serve the purpose of both Internee and an entrepreneur.It is an opportunity to learn more from more experienced people of the field you want to work.

For college students it is like a test drive to your career and it also gives you a broad way to determine what environment suits you. For instance if you are in any creative field like designing, fine arts etc. then also summer training will carve a path for your future. While working under professional you will come to know what exactly professionals do and how any firm or company works. You will also build up a list of contacts for your future reference. While working in an export house or fashion house you will enquire that this is not for me and I can focus on other. Fashion designing is not you cup of tea and pattern making technical designing or any other branch suits you well. Whether you like to work under a designer or in an export house even in a fashion house or you have your capability to run your own brand name.

Strongest benefit of being involved in a summer internship is that you can be hired as an employee as you finish your college. In current scenario is a huge factor where there is fierce recession. If you involvement during the summer training is good then 90percent of the students hired by the company with permanent positions and that serves your purpose.