What Is Apache Pig?
Data is addictive. Our ability to collect store data has grown massively in the last several decades. Yet our appetite for ever more data shows no sign of being satiated. The computer and Internet revolutions have greatly increased our ability to collect and store data. Part of the reason for this massive growth in data is our ability to collect much more data. For example, every time someone clicks on a website’s links, the web server can record information about what page the user was on and which link he clicked.
The high cost and unneeded features of RDBMSs have led to the development of many alternative data-processing systems. One such alternative system is Apache Hadoop.
The development of new data-processing systems such as Hadoop has spurred the porting of existing tools and languages and the construction of new tools, such as Apache Pig. Tools like Pig provide a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write
extensive data-processing applications in low-level Java code.
What is Pig?
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.
Pig on Hadoop
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce.
Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. A Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data.
How Pig differs from MapReduce
Pig provides users with several advantages over using MapReduce directly. Pig Latin provides all of the standard data-processing operations, such as join, filter, group by, order by, union, etc. MapReduce provides the group by operation directly (that is what the shuffle plus reduce phases are), and it provides the order by operation indirectly through the way it implements the grouping. Filter and projection can be implemented trivially in the map phase. But other operators, particularly join, are not provided and must instead be written by the user.
All of these points mean that Pig Latin is much lower cost to write and maintain than Java code for MapReduce.
Grunt* is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to interact with HDFS.
To enter Grunt, invoke Pig with no script or command to run. Typing:
pig -x local
will result in the prompt:
This gives you a Grunt shell to interact with your local filesystem. If you omit the -x local the Grunt shell will interact with HDFS on your cluster.