HIVE Installation And MySql configuration

As we all know HIVE is based on Hadoop and works on MapReduce and HDFS.

Step 1) Go to http://apache.claz.org/hive/stable/ and download apache-hive-1.2.1-bin.tar.gz

Step 2) Go to downloaded directory and extract

tar -xvf apache-hive-1.2.2-bin.tar.gz

Step 3) Different Configuration properties to be placed in Hive.

In this step, we are going to do two things

  1. Placing Hive Home path in bashrc file
  2. Placing Hadoop Home path location in hive-config.sh
  • Open the bashrc file as shown in above screenshot
  • Mention Hive home path i.e., HIVE_HOME path in bashrc file and export it as shown in below

Code to be placed in bashrc

export HIVE_HOME=”/home/radhe/apache-hive-1.2.0-bin”

export PATH=$PATH:$HIVE_HOME/bin

Step 4: Download and install derby/MySQL Database

Why to Use MySQL in Hive as Meta store:

  • By Default, Hive comes with derby database as metastore.
  • Derby database can support only single active user at a time
  • Derby is not recommended in production environment

So the solution here is

  • Use MYSQL as Meta storage at backend to connect multiple users with Hive at a time
  • MYSQL is Best choice for the standalone metastore.

Step 5) Download MySQL and install. Once installation done, Create MySQL user

To install the MySQL connector on a Debian/Ubuntu system:

Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo apt-get install libmysql-java
$ ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar

Execute the commands as shown below,

mysql> CREATE USER ‘hiveuser’@’%’ IDENTIFIED BY ‘hivepassword’;

mysql> GRANT all on *.* to ‘hiveuser’@localhost identified by ‘hivepassword’;

mysql> flush privileges;

Step 6) Configuring hive-site.xml

  • After Step 5 assign username and password to MySQL database and given privileges.
  • Here we will configure some properties in Hive to get a connection with MySQL database.

Place this code in hive-site.xml

hive-site.xml

<configuration>
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
		<description>metadata is stored in a MySQL server</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
		<description>MySQL JDBC driver class</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>hiveuser</value>
		<description>user name for connecting to mysql server</description>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>hivepassword</value>
		<description>password for connecting to mysql server</description>
	</property>
</configuration>

Step 7) Create table using hive command
hive> create table products(id int, name string);
Step 8) Go to MySql prompt and execute following command
mysql> use metastore ;//select database which is hive using
mysql>show tables;
+——————————+
Tables in metastore
+——————————+
.
.
Products

 

WHAT IS HIVE?

Hive is a data warehousing tool based on Hadoop. As we know Hadoop provides massive scale out on distributed infrastructure with high degree of fault tolerance for data storage and processing. Hadoop uses Map Reduce algorithm to process huge amount of data with minimal cost as it does not require high end machines to process such amount of data. Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, ad-hoc querying and analysis of huge volumes of data.

Hive Is Not

Even though Hive gives SQL dialect it does not give SQL like latency as it ultimately runs Map Reduce programs underneath. As we all know, Map Reduce framework is built for batch processing jobs it has high latency, even the fastest hive query would take several minutes to get executed on relatively smaller set of data in few megabytes. We cannot simply compare the performance of traditional SQL systems like Oracle, MySQL or SQL Server as these systems are meant to do something and Hive is meant to do else. Hive aims to provide acceptable (but not optimal) latency for interactive querying over small data sets for sample queries.

hive is not an OLTP (Online transaction Processing) application and not meant to be connected with systems which needs interactive processing. It is meant to be used to process batch jobs on huge data which is immutable. A good example of such kind of data would be Web logs, Application Logs, call data records (CDR) etc.

 

Features of Hive

  • Stores schema in database and processed data into HDFS.
  • Designed for OLAP
  • Provides SQL type language for querying called HiveQL
  • Fast, Scalable and extensible

Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.

  • DDL: create table, create index, create views.
  • DML: Select, Where, group by, Join, Order By
  • Pluggable Functions:
    • UDF: User Defined Function
    • UDAF: User Defined Aggregate Function
    • UDTF: User Defined Table Function