Zookeeper Architecture and Basics

A Distributed Coordination Service for Distributed Applications

ZooKeeper is a distributed, open-source coordination service for distributed applications.

It provides services for

  • Distributed Synchronization
  • Configuration maintenance
  • Coordination service which does not suffer from RACE CONDITION, DEAD LOCKs
  • Naming Services
  • Groups Services

Why Zookeeper Required?

What we have seen in case of Non-Distributed application, we use multiple threads to process some task to achieve good performance and we ensure consistency of shared resources with the help of synchronization.

Synchronization of shared resources access is done via some logic, API etc. As we know that threads synchronization can be achieved in same process space not in between different set of process so if we want to synchronize our system in clustered environment OR between different process running on different machines, we need some coordination system/service to synchronize and monitor to provide consistency in distributed application and to achieve above goals ZooKeeper OR similar kind of system required.


ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers – called znodes, in ZooKeeper parlance – and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers.

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.

Leader selection is done by majority of follower server and if leader get down all other follower selects another leader in zookeeper cluster.

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.



Leader Selection

Leader selection is done by majority of follower server and if leader get down all other follower selects another leader in zookeeper cluster.

Atomic Broadcast
  • All write request forwarded to leader
  • Leader Broadcast the request to followers
  • When majority of followers persisted the change
    • Leader commits the updates
    • Client gets success response
  • Machines write to disk before memory.
  • Two-Phase commit protocol used for atomic operation


Data model and the hierarchical namespace

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/) . Every node in ZooKeeper’s name space is identified by a path. it is similar to file or directory system.

Each node (also called as zNode) in a ZooKeeper namespace can have data associated with it as well as children.

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted.



ZooKeeper supports the concept of watches. Clients can set a watch on a znode. A watch will be triggered and removed (watchers are triggered only once for multiple notification re-register required)when the znode changes. When a watch is triggered, the client receives a packet saying that the znode has changed. If the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification.

Simple API

One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:

creates a node at a location in the tree

deletes a node

tests if a node exists at a location

get data
reads the data from a node

set data
writes data to a node

get children
retrieves a list of children of a node

waits for data to be propagated

To Check Server (Followers) status

use below commands after telnet


Example:- telnet ip:port for example telnet localhost:2181

after telnet commands default string type ‘ruok’


Server statistic details


Thank you a lot for visiting. Hope it clears zookeeper concept.


HIVE Installation And MySql configuration

As we all know HIVE is based on Hadoop and works on MapReduce and HDFS.

Step 1) Go to http://apache.claz.org/hive/stable/ and download apache-hive-1.2.1-bin.tar.gz

Step 2) Go to downloaded directory and extract

tar -xvf apache-hive-1.2.2-bin.tar.gz

Step 3) Different Configuration properties to be placed in Hive.

In this step, we are going to do two things

  1. Placing Hive Home path in bashrc file
  2. Placing Hadoop Home path location in hive-config.sh
  • Open the bashrc file as shown in above screenshot
  • Mention Hive home path i.e., HIVE_HOME path in bashrc file and export it as shown in below

Code to be placed in bashrc

export HIVE_HOME=”/home/radhe/apache-hive-1.2.0-bin”

export PATH=$PATH:$HIVE_HOME/bin

Step 4: Download and install derby/MySQL Database

Why to Use MySQL in Hive as Meta store:

  • By Default, Hive comes with derby database as metastore.
  • Derby database can support only single active user at a time
  • Derby is not recommended in production environment

So the solution here is

  • Use MYSQL as Meta storage at backend to connect multiple users with Hive at a time
  • MYSQL is Best choice for the standalone metastore.

Step 5) Download MySQL and install. Once installation done, Create MySQL user

To install the MySQL connector on a Debian/Ubuntu system:

Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.

$ sudo apt-get install libmysql-java
$ ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar

Execute the commands as shown below,

mysql> CREATE USER ‘hiveuser’@’%’ IDENTIFIED BY ‘hivepassword’;

mysql> GRANT all on *.* to ‘hiveuser’@localhost identified by ‘hivepassword’;

mysql> flush privileges;

Step 6) Configuring hive-site.xml

  • After Step 5 assign username and password to MySQL database and given privileges.
  • Here we will configure some properties in Hive to get a connection with MySQL database.

Place this code in hive-site.xml


		<description>metadata is stored in a MySQL server</description>
		<description>MySQL JDBC driver class</description>
		<description>user name for connecting to mysql server</description>
		<description>password for connecting to mysql server</description>

Step 7) Create table using hive command
hive> create table products(id int, name string);
Step 8) Go to MySql prompt and execute following command
mysql> use metastore ;//select database which is hive using
mysql>show tables;
Tables in metastore



Hive is a data warehousing tool based on Hadoop. As we know Hadoop provides massive scale out on distributed infrastructure with high degree of fault tolerance for data storage and processing. Hadoop uses Map Reduce algorithm to process huge amount of data with minimal cost as it does not require high end machines to process such amount of data. Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, ad-hoc querying and analysis of huge volumes of data.

Hive Is Not

Even though Hive gives SQL dialect it does not give SQL like latency as it ultimately runs Map Reduce programs underneath. As we all know, Map Reduce framework is built for batch processing jobs it has high latency, even the fastest hive query would take several minutes to get executed on relatively smaller set of data in few megabytes. We cannot simply compare the performance of traditional SQL systems like Oracle, MySQL or SQL Server as these systems are meant to do something and Hive is meant to do else. Hive aims to provide acceptable (but not optimal) latency for interactive querying over small data sets for sample queries.

hive is not an OLTP (Online transaction Processing) application and not meant to be connected with systems which needs interactive processing. It is meant to be used to process batch jobs on huge data which is immutable. A good example of such kind of data would be Web logs, Application Logs, call data records (CDR) etc.


Features of Hive

  • Stores schema in database and processed data into HDFS.
  • Designed for OLAP
  • Provides SQL type language for querying called HiveQL
  • Fast, Scalable and extensible

Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.

  • DDL: create table, create index, create views.
  • DML: Select, Where, group by, Join, Order By
  • Pluggable Functions:
    • UDF: User Defined Function
    • UDAF: User Defined Aggregate Function
    • UDTF: User Defined Table Function






Hadoop vs. Spark

Spark is not, despite the hype, a replacement for Hadoop. Nor is MapReduce dead.

Spark can run on top of Hadoop, benefiting from Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). Spark can also run completely separately from Hadoop, integrating with alternative cluster managers like Mesos and alternative storage platforms like Cassandra and Amazon S3.

Much of the confusion around Spark’s relationship to Hadoop dates back to the early years of Spark’s development. At that time, Hadoop relied upon MapReduce for the bulk of its data processing. Hadoop MapReduce also managed scheduling and task allocation processes within the cluster; even workloads that were not best suited to batch processing were passed through Hadoop’s MapReduce engine, adding complexity and reducing performance.

MapReduce is really a programming model. In Hadoop MapReduce, multiple MapReduce jobs would be strung together to create a data pipeline. In between every stage of that pipeline, the MapReduce code would read data from the disk, and when completed, would write the data back to the disk. This process was inefficient because it had to read all the data from disk at the beginning of each stage of the process. This is where Spark comes in to play. Taking the same MapReduce programming model, Spark was able to get an immediate 10x increase in performance, because it didn’t have to store the data back to the disk, and all activities stayed in memory. Spark offers a far faster way to process data than passing it through unnecessary Hadoop MapReduce processes.

Hadoop has since moved on with the development of the YARN cluster manager, thus freeing the project from its early dependence upon Hadoop MapReduce. Hadoop MapReduce is still available within Hadoop for running static batch processes for which MapReduce is appropriate. Other data processing tasks can be assigned to different processing engines (including Spark), with YARN handling the management and allocation of cluster resources.

Spark is a viable alternative to Hadoop MapReduce in a range of circumstances. Spark is not a replacement for Hadoop, but is instead a great companion to a modern Hadoop cluster deployment.

What Hadoop Gives Spark

Apache Spark is often deployed in conjunction with a Hadoop cluster, and Spark is able to benefit from a number of capabilities as a result. On its own, Spark is a powerful tool for processing large volumes of data. But, on its own, Spark is not yet well-suited to production workloads in the enterprise. Integration with Hadoop gives Spark many of the capabilities that broad adoption and use in production environments will require, including:

  • YARN resource manager, which takes responsibility for scheduling tasks across available nodes in the cluster;
  • Distributed File System, which stores data when the cluster runs out of free memory, and which persistently stores historical data when Spark is not running;
  • Disaster Recovery capabilities, inherent to Hadoop, which enable recovery of data when individual nodes fail. These capabilities include basic (but reliable) data mirroring across the cluster and richer snapshot and mirroring capabilities such as those offered by the MapR Data Platform;
  • Data Security, which becomes increasingly important as Spark tackles production workloads in regulated industries such as healthcare and financial services. Projects like Apache Knox and Apache Ranger offer data security capabilities that augment Hadoop. Each of the big three vendors have alternative approaches for security implementations that complement Spark. Hadoop’s core code, too, is increasingly recognizing the need to expose advanced security capabilities that Spark is able to exploit;
  • A distributed data platform, benefiting from all of the preceding points, meaning that Spark jobs can be deployed on available resources anywhere in a distributed cluster, without the need to manually allocate and track those individual jobs.

What Spark Gives Hadoop

Hadoop has come a long way since its early versions which were essentially concerned with facilitating the batch processing of MapReduce jobs on large volumes of data stored in HDFS. Particularly since the introduction of the YARN resource manager, Hadoop is now better able to manage a wide range of data processing tasks, from batch processing to streaming data and graph analysis.

Spark is able to contribute, via YARN, to Hadoop-based jobs. In particular, Spark’s machine learning module delivers capabilities not easily exploited in Hadoop without the use of Spark. Spark’s original design goal, to enable rapid in-memory processing of sizeable data volumes, also remains an important contribution to the capabilities of a Hadoop cluster.

In certain circumstances, Spark’s SQL capabilities, streaming capabilities (otherwise available to Hadoop through Storm, for example), and graph processing capabilities (otherwise available to Hadoop through Neo4J or Giraph) may also prove to be of value in enterprise use cases.


A powerful FrameLayout that specifies behavior for child views for various interactions. All the components within your xml layout should be wrapped inside this coordinator layout.

we can define how views inside the Coordinator Layout relate each other and what can be expected when one of them changes. Behaviours they identify which views another view depend on, and how they behave when any of these views changes.

Android Comparisons: Service, IntentService, AsyncTask & Thread


  • Service used to perform task with no UI, task should not be long running. Use threads within service for long tasks.
  • To trigger call method onStartService().
  • Any thread can trigger.
  • Runs on main thread.It may block main thread.


  • Thread can be used for long running task.
  • For tasks in parallel use multiple threads (traditional mechanisms).
  • To trigger call thread start() method.
  • Need to manage thread manually.
  • Any thread can start.


  • IntentService used for long running task usually with no communication to main thread.
  • Clients send requests through startService(Intent) calls; the service is started as needed, handles each Intent in turn using a worker thread, and stops itself when it runs out of work.
  • To use it, extend IntentService and implement onHandleIntent(Intent).
  • IntentService will receive the Intents, launch a worker thread.only one request will be processed at a time.
  • To communicate with main thread, Thread can use main thread handler or broadcast. intents.
  • Intent is received on main thread and then worker thread is spawed.
  • Multiple intents are queued on the same worker thread.


  • This class allows to perform background operations and publish results on the UI thread without having to manipulate threads and/or handlers.
  • one instance can only be executed once. We cannot call execute method on same object.
  • Must be created and executed from the Main thread.
  • For tasks in parallel use multiple instances OR Executor.

Android ListView with EditText loses focus when the keyboard displays

Issue comes when you have ListView with  input capable fields (EditText)  that displays the soft keyboard on focus, the EditText loses its focus for the first time but the second time it works and gains focus.


To Solve this issue you need to add android:descendantFocusability=”beforeDescendants” in your ListView and android:windowSoftInputMode=”adjustPan” for your activity in the app Manifest.

<activity android:name=”com.yourActivity” android:windowSoftInputMode=”adjustPan”/>


and in ListView

<ListView android:id=”@+id/list” android:layout_width=”fill_parent” android:descendantFocusability=”beforeDescendants”
android:layout_height=”fill_parent” android:dividerHeight=”1.0dp”/>


You have to listen for the onFocusChange listener on the EditText and set a variable to keep track. So when the view get rendered again in your getView or getChildView method you can set the focus back to that EditText.