Software Testing Class

Tutorial 11: Apache Oozie

Introduction

Apache Oozie is nothing but a workflow scheduler for Hadoop. It is used as a system to run the workflow of dependent jobs. Apache Oozie allows users to create Directed Acyclic Graphs of workflows. These acyclic graphs have the specifications about the dependencies between the Job. These graphs can be executed in parallel and sequentially in Hadoop depending on the requirements. Apache Oozie consists of two parts.

  1. Workflow engine: The workflow engine has a responsibility to store and execute the workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive, etc.
  2. Coordinator engine: The coordinator engine executes the workflow jobs based on predefined schedules and data availability.

Apache Oozie is very scalable. It can manage the timely execution of thousands of workflows where each workflow can be composed of dozens of jobs within a Hadoop cluster without any deviation. Also, Apache Oozie is very much flexible. It provides flexibility in the start, stop, suspend and rerun jobs. The failed workflows can be rerun very easily. It allows us to skip a specific failed node and one can easily understand the difficulties, it can be in order to catch up missed or failed jobs due to failure or planned downtime.

 

Features of Apache Oozie

Apache-Oozie

The following are the features of Apache Oozie.

 

Apache OOZIE Operations

Apache Oozie acts as a service in the Hadoop cluster and clients. It submits workflow definitions for instant or later processing. Apache Oozie workflow comprised of action nodes and control-flow nodes.

  1. Action Nodes: It represents a workflow task such as executing a MapReduce, migrating files into the HDFS system, Hive Jobs, Pig Jobs, data import or export using Sqoop, executing the shell script to trigger the program written in Java, etc.
  2. Control-flow Nodes: It controls the execution workflow between the actions. It permits constructs such as conditional logic where the different branches may be followed which are dependant on the outcome of earlier action node. End Node, Start Node, and Error Node fall under the category of control-flow nodes. Start node kick starts the workflow job, End node signals the job end, and the Error Node labels the existence of an error and corresponding error message to be published.

After the workflow execution has completed, the HTTP callback is used by Apache Oozie in order to update the client with the workflow status. The callback may also be get trigged by Entry-to or exit from action nodes.

Workflow Example

 

Deployment of an Oozie Workflow Application

An Oozie workflow application is comprised of the workflow definition and all the linked resources such as Pig scripts, MapReduce Jar files, etc. The workflow Application requires to obey a simple directory structure that is deployed to HDFS so that they can be accessed by Apache Oozie.

Given below is an example of the directory structure used for the Oozie Workflow Application.

The directory structure used for Oozie Workflow Application

<workflow name>/</name>
??? lib/
? ??? hadoop-application-examples.jar
??? workflow.xml

Build copying command to HDFS

% hadoop fs -put hadoop-application-examples/target/<workflow dir name> workflow name
 

How to run an Oozie Workflow Job?

We are going to discuss how to run Oozie workflow jobs. In order to execute the jobs, we need to use the Oozie command-line tool which is a simple client program that communicates with the Oozie server.

Step 1: Export OOZIE_URL environment variable to expresses the Oozie command that specifies which Oozie server to use for execution.

OOZIE_URL environment variable

% export OOZIE_URL="http://localhost:11000/oozie"

Step 2: Run the Oozie workflow Job using the -config option that refers to a local Java properties file. The file contains the definitions for the parameters which are used in the workflow XML file.

Oozie Workflow Job

% oozie job -config app/src/main/resources/max-temp-workflow.properties   -run

Step 3: The oozie.wf.application.path. It expresses the Oozie the location of the workflow application in HDFS.

Contents of the properties file

nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
oozie.wf.application.path=${nameNode}/user/${user.name}/<workflow name>

Step 4:  The status of workflow job can be obtained by using subcommand ‘job’ with ‘-info’ option i.e. specifying job id after ‘-info’ as explained below. The Job status will be shown in one of RUNNING, KILLED or SUCCEEDED.

Status of Workflow Job

% oozie job -info <workflow job id>

Step 5: We need to execute the following Hadoop command in order to obtain the outcome of successful workflow execution.

Hadoop command to obtain the outcome of successful workflow execution

% hadoop fs -cat <location of the Job   execution result>

Conclusion

In this chapter, we discussed the Apache Oozie features, package deployment to use a scheduler, and operations with the help of suitable commands.


Big Data Course Syllabus


TutorialIntroduction to BigData
TutorialIntroduction to Hadoop Architecture, and Components
TutorialHadoop installation on Windows
TutorialHDFS Read & Write Operation using Java API
TutorialHadoop MapReduce
TutorialHadoop MapReduce First Program
TutorialHadoop MapReduce and Counter
TutorialApache Sqoop
TutorialApache Flume
TutorialHadoop Pig
TutorialApache Oozie
TutorialBig Data Testing


⇓ Subscribe Us ⇓


If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:


 

Check email in your inbox for confirmation to get latest updates Software Testing for free.


  Happy Testing!!!
 
Exit mobile version