Introduction to Oracle Big Data Cloud Service – Compute Edition (Part V) – Pig

This is my fifth blog post of my introduction series for Oracle Big Data Cloud Service – Compute Edition. In this blog post, I’ll mention “Apache Pig”. It’s a tool/platform created by “Yahoo!” to analyze large data sets without the complexities of writing a traditional MapReduce program. It’s designed to process any kind of data (structured or unstructured) so it’s a great tool for ETL jobs. Pig comes installed and ready to use with “Oracle Big Data Cloud Service – Compute Edition”. In this blog post, I’ll show how we can write use pig to read, parse and analyze data.

Pig has a high-level SQL-like programming language called Pig Latin. We need to learn basics of this language to be able to use Pig. Each statement in a Pig script, is processed by the Pig interpreter to build a logical plan which will be used to procedure MapReduce jobs. The steps in the logical plan are not “executed” until a DUMP or STORE statement is used.

Pig scripts have generally the following structure:

  1. Data is read by using LOAD statements.
  2. Data is transformed/processed.
  3. The result is dumped (to screen) or stored to a file (or a Hive table).

In my previous post, I used PySpark (The Spark Python API) to load flight data and created very simple analyze using Spark SQL. I’ll do the same thing with same data using Pig.

Pig has an interactive shell named “Grunt”. Instead of writing Pig Latin statements to a file and execute it with pig command, we can also enter each statement to Grunt and interactively run them. To enter the Grunt shell, I’ll login to my first node, switch to hdfs user (to be able to reach the CSV files I used in my previous blog post) and then run the pig executable (pig) in the terminal:

Grunt can also be used to run hdfs commands. So I’ll run the following commands to see if I can access the CSV files (2008.csv, carriers.csv). These files are located in /user/zeppelin folder because, I used Zeppelin to download them.

When we run the above commands, I saw 2008.csv and carriers.csv, also the last lines of carriers.csv. So I’m sure that my pig script will also be able to access them.

So first of all, I need to read the data from these CSV files. The following statement will tell Pig to read data from carriers.csv and declare a data set named “carriers_file”;

We can tell LOAD statement that we will use a CSV file, the rows are not multi-line, used UNIX style end of line and skip the header. Don’t worry, it’s very simple and commonly used option, all we need to add some “engine” parameters.

We can also define our schema while loading data:

Be careful about case-sensitivity of variable names. In Pig scripts, variable names are case sensitive while Pig Latin keywords are not case sensitive. So if we want to use carriers_file in next statements, we need to use “carriers_file” (all lower case). I’ll define schema on load, because the 2008.csv file has lots of fields we do not use, therefore I don’t want to define them).

If you do not define schema for a data set, you can access the fields using $0, $1 … variables. In our CSV files, fields are quoted “ZMZ”,”Winnipesaukee Aviation Inc.”, so I also need to remove quote signs using REPLACE command. To be able to do it, I need to say Pig to process each row in “carriers_file” data set. FOREACH and GENERATE statement is used for it. So above statement process each row, removes ‘”‘ quote signs from $0 and $1 (first and second columns) and name them as Code and Description.

As you will see, these commands will take no time to process, and they won’t trigger any real work (MapReduce job). If you want to see the content of a data set, you can use DUMP command. But first, don’t forget to filter the data. For example, I can use the following statements to get first 10 rows of “carriers” data set:

I will do the very similar thing to 2008.csv, but it will be a little bit comples because there are more columns, and I also want to define the variable type for each column:

In my last query I will only use “CarrierDelay” but I wanted to show how we can get the other columns. This time I didn’t removed quote signs, instead I replace NA with 0. You might wonder why we put “(int)” before the REPLACE commands, like this:

I tell Pig that “the REPLACE command will process 25th field in “flights_file” and return an integer, and I want to store it as CarrierDelay (an integer column)”. I define the data types returned from the function, to prevent casting problems on MapReduce jobs. This is very important tip, specially if you use an earlier release than Pig 0.17.

Now we need to join these two data sets. So we will use JOIN command:

Then we group the joined data by the Code and the Description columns:

In my previous blog post, I found the total of the delayed flights, the total flights, average delay time (based on CarrierDelay), the ratio of Delayed/Total flights for each airline/carrier and sorted them. That was very easy when you do it with SQL, Pig requires more thinking to build such a query:

Let me try to explain the above code (item numbers match line numbers of the code block):
Line 1) I define “result” as the result data set. Pig should process each row in “flight_grouped” data set.
Line 2) The number of flights for each airline will be counted and stored as Total.
Line 3) A new data set for delayed flights will be created.
Line 4) Average delay will be calculated and stored as avg_CarrierDelay.
Line 5) The number of delayed flights will be counted and stored as Delayed.
Line 6) The delayed/total flight ratio is calculated and stored as DelayedTotalRatio.
Line 7) The row containing (Airline) Code, Description, average delay time, number of delayed flights, total number of flights and the ratio is returned as a row (for the result set).
Line 10) A new data set (ordered_result) is generated by ordering descending the result data set by DelayedTotalRatio column.
Line 12) The dump is used to generate the result (to the screen). It will trigger the MapReduce job produced by Pig.

This is the whole script (that you can copy and paste into grunt):

The followings are the statistics of the produced MapReduce job when you run the last statement (DUMP) of the script:

Whole process took less than 3 minutes, and here’s the result:

Pig Latin is a very powerful language to process data but it requires a different mindset than SQL. Hope this blog post will help you, at least give you an idea about Pig.

Related Posts

Please share this post Share on Facebook13Share on Google+0Share on LinkedIn117Share on Reddit0Tweet about this on Twitter

Gokhan Atil is a database architect who has hands-on experience with both RDBMS and noSQL databases (Oracle, PostgreSQL, Microsoft SQL Server, Sybase IQ, MySQL, Cassandra, MongoDB and ElasticSearch), and strong background on software development. He is certified as Oracle Certified Professional (OCP) and is awarded as Oracle ACE (in 2011) and Oracle ACE Director (in 2016) for his continuous contributions to the Oracle users community.

Leave Comment

Your email address will not be published. Required fields are marked *