Introduction to Oracle Big Data Cloud Service – Compute Edition (Part VI) – Hive

I though I would stop writing about “Oracle Big Data Cloud Service – Compute Edition” after my fifth blog post, but then I noticed that I didn’t mention about the Apache Hive, another important component of the Big Data. Hive is a data warehouse infrastructure built on top of Hadoop, designed to work with large datasets. Why is it so important? Because it includes support for SQL (SQL:2003 and SQL:2011), and helps users to utilize existing SQL skillsets to quickly derive value from big data.

Although new improvements of Hive project enables sub-second query retrieval (Hive LLAP) but it’s not designed for online transaction processing (OLTP) workloads. Hive is best used for traditional data warehousing tasks.

In this blog post, I’ll demonstrate how we can import data from CSV files into hive tables, and run SQL queries to analyze the date stored in these tables.

We can use a 3rd party Hive client (such as Zeppelin, HUE, DBeaver), ambari Hive view or Hive command line to connect Hive server. I’ll use Zeppelin notebook although it has some limitations when running with Hive: Each block can only run one hive SQL (you can not run several queries by separating them with a semicolon), and in one notebook, you can run up to 10 queries.

If you already read my previous blog posts about the Oracle Big Data Cloud Service – Compute Edition, you probably remember that I used a dataset about flights. This time, I’ll use a different dataset. I found movielens database (a movie recommendation service) on the net, and it’s much better than the one I used in my previous posts. It contains movie information and ratings/votes for each movie. It’s very clean and simple to understand.

First, I create a new Zeppelin notebook named Hive (its name doesn’t matter), and in first code block, I write a simple bash scrip to download and unzip the movielens database. Here’s the code:

In case you missed my blog post about Zeppelin, the first statement (%sh) indicates that this is a shell block, so Zeppelin uses shell interpreter to run the content of the block. When we run the above block, the ml-latest.zip will be downloaded and unzipped into the “/tmp/ml-latest” directory. We’ll only use “movies.csv” and “ratings.csv”.

We have various options to import data to Hive. For example, we can use “PIG” to import the data. We could also create “external tables” which pointing the existing CSV files, but I’ll use Hive’s native LOAD command to import the data.

When using LOAD command, I’ll use OpenCSVSerde. It’s a SerDe designed to read CSV files. You probably wonder what SerDE is. SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for reading and writing data in any custom format. OpenCSVSerde treats all columns to be of type String so I’ll create temporary tables, load data to them using OpenCSVSerde and then convert the columns of the tables to proper data types.

The above blocks will create the Hive tables to hold the movies and ratings data to the tables. As you can see, the ratings table has 4 columns (userId, movieId, rating, timestamp) and the movies table has 3 columns (movieId, title, genres). “skip.header.line.count” lets us skip the first line (the header of CSV).

The above two blocks will import the data to the Hive files. The keyword “LOCAL”, tells that we read from a local directory on the server. If we want to read from an hdfs directory, we need to remove the keyword “LOCAL”. The keyword “OVERWRITE”, tells that we want to replace the existing data. If we don’t use that keyword, it appends new data to the existing data.

By the way, Hive blocks are special blocks like Spark SQL blocks and their outputs are shown as data grid, and they come with buttons to generate graphs. On the other hand, the recent queries we executed, do not provide a resul set. So the output won’t be fancy and all we see is “an update count equals -1”. Ignore this -1 result. It’s not an error code. If you get a real error, Hive will return a clean error message.

So we imported the data to our hive tables, now we’ll create our real tables in ORC format. Hive supports several formats, you can pick one according to your needs. When I examine the data, I noticed that titles which contains a comma character have an interesting error. For examples, the title is stored as “American President, The (1995)” instead of “The American President (1995)”. I also fixed the titles with a simple CASE statement when copying data from temporary table. Here are the code blocks:

After running the above blocks, our tables will be ready so we can use traditional SQL queries to fetch data from these tables.

I tried to find the top 10 “Thriller” movies which are voted by at least 20.000 users. Here’s the code of the above block:

Have you noticed that there is a problem with the average ratings? All of them are “4”. First I thought it’s a problem about the data type of the rating column, but It seems it’s just a bug of the current Zeppelin’s data grid.

I logged in to the cloud server, switch to hive user and issue the “hive” command to start the Hive command line. When I run the above query from the hive command line, I got the correct output.

As you see, Hive is a great software for traditional SQL guys who want to work with Big Data, and it’s ready to use with the Oracle Big Data Cloud Service – Compute Edition.

Related Posts

Please share this post Share on Facebook4Share on Google+0Share on LinkedIn127Share on Reddit0Tweet about this on Twitter

Gokhan Atil is a database architect who has hands-on experience with both RDBMS and noSQL databases (Oracle, PostgreSQL, Microsoft SQL Server, Sybase IQ, MySQL, Cassandra, MongoDB and ElasticSearch), and strong background on software development. He is certified as Oracle Certified Professional (OCP) and is awarded as Oracle ACE (in 2011) and Oracle ACE Director (in 2016) for his continuous contributions to the Oracle users community.

1 Comment

Leave a Reply to chatchai Cancel reply

Your email address will not be published. Required fields are marked *