Introduction
Apache Oozie is nothing but a workflow scheduler for Hadoop. It is used as a system to run the workflow of dependent jobs. Apache Oozie allows users to create Directed Acyclic Graphs of workflows. These acyclic graphs have the specifications about the dependencies between the Job. These graphs can be executed in parallel and sequentially in Hadoop depending on the requirements. Apache Oozie consists of two parts.
- Workflow engine: The workflow engine has a responsibility to store and execute the workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive, etc.
- Coordinator engine: The coordinator engine executes the workflow jobs based on predefined schedules and data availability.
Apache Oozie is very scalable. It can manage the timely execution of thousands of workflows where each workflow can be composed of dozens of jobs within a Hadoop cluster without any deviation. Also, Apache Oozie is very much flexible. It provides flexibility in the start, stop, suspend and rerun jobs. The failed workflows can be rerun very easily. It allows us to skip a specific failed node and one can easily understand the difficulties, it can be in order to catch up missed or failed jobs due to failure or planned downtime.
Features of Apache Oozie
The following are the features of Apache Oozie.
- Apache Oozie has a client API and command-line interface that can be used to launch, monitor and control jobs from Java application.
- Apache Oozie has a facility to run jobs that are scheduled to execute periodically.
- Apache Oozie has a facility to send email notifications upon the completion of the job.
- By using Oozie’s Web Service APIs, we can easily control jobs from anywhere.
Apache OOZIE Operations
Apache Oozie acts as a service in the Hadoop cluster and clients. It submits workflow definitions for instant or later processing. Apache Oozie workflow comprised of action nodes and control-flow nodes.
- Action Nodes: It represents a workflow task such as executing a MapReduce, migrating files into the HDFS system, Hive Jobs, Pig Jobs, data import or export using Sqoop, executing the shell script to trigger the program written in Java, etc.
- Control-flow Nodes: It controls the execution workflow between the actions. It permits constructs such as conditional logic where the different branches may be followed which are dependant on the outcome of earlier action node. End Node, Start Node, and Error Node fall under the category of control-flow nodes. Start node kick starts the workflow job, End node signals the job end, and the Error Node labels the existence of an error and corresponding error message to be published.
After the workflow execution has completed, the HTTP callback is used by Apache Oozie in order to update the client with the workflow status. The callback may also be get trigged by Entry-to or exit from action nodes.
Workflow Example
Deployment of an Oozie Workflow Application
An Oozie workflow application is comprised of the workflow definition and all the linked resources such as Pig scripts, MapReduce Jar files, etc. The workflow Application requires to obey a simple directory structure that is deployed to HDFS so that they can be accessed by Apache Oozie.
Given below is an example of the directory structure used for the Oozie Workflow Application.
The directory structure used for Oozie Workflow Application
<workflow name>/</name>
??? lib/
? ??? hadoop-application-examples.jar
??? workflow.xml
- The workflow.xml is a workflow definition file and should always be placed in the top-level directory i.e. in the parent directory with the workflow name.
- The lib directory holds the Jar files comprising MapReduce classes.
- The Oozie workflow application which is compatible with this layout can be built with any build tool such as Ant, Maven, etc. the end build can be copied to HDFS using a command which is given below.
Build copying command to HDFS
% hadoop fs -put hadoop-application-examples/target/<workflow dir name> workflow name
How to run an Oozie Workflow Job?
We are going to discuss how to run Oozie workflow jobs. In order to execute the jobs, we need to use the Oozie command-line tool which is a simple client program that communicates with the Oozie server.
Step 1: Export OOZIE_URL environment variable to expresses the Oozie command that specifies which Oozie server to use for execution.
OOZIE_URL environment variable
% export OOZIE_URL="http://localhost:11000/oozie"
Step 2: Run the Oozie workflow Job using the -config option that refers to a local Java properties file. The file contains the definitions for the parameters which are used in the workflow XML file.
Oozie Workflow Job
% oozie job -config app/src/main/resources/max-temp-workflow.properties -run
Step 3: The oozie.wf.application.path. It expresses the Oozie the location of the workflow application in HDFS.
Contents of the properties file
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
oozie.wf.application.path=${nameNode}/user/${user.name}/<workflow name>
Step 4: The status of workflow job can be obtained by using subcommand ‘job’ with ‘-info’ option i.e. specifying job id after ‘-info’ as explained below. The Job status will be shown in one of RUNNING, KILLED or SUCCEEDED.
Status of Workflow Job
% oozie job -info <workflow job id>
Step 5: We need to execute the following Hadoop command in order to obtain the outcome of successful workflow execution.
Hadoop command to obtain the outcome of successful workflow execution
% hadoop fs -cat <location of the Job execution result>
Conclusion
In this chapter, we discussed the Apache Oozie features, package deployment to use a scheduler, and operations with the help of suitable commands.
Big Data Course Syllabus
Tutorial | Introduction to BigData |
Tutorial | Introduction to Hadoop Architecture, and Components |
Tutorial | Hadoop installation on Windows |
Tutorial | HDFS Read & Write Operation using Java API |
Tutorial | Hadoop MapReduce |
Tutorial | Hadoop MapReduce First Program |
Tutorial | Hadoop MapReduce and Counter |
Tutorial | Apache Sqoop |
Tutorial | Apache Flume |
Tutorial | Hadoop Pig |
Tutorial | Apache Oozie |
Tutorial | Big Data Testing |
⇓ Subscribe Us ⇓
If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:
Happy Testing!!!
- Tutorial 4: HDFS Read and Write Operation using Java API
- Tutorial 7: Hadoop MapReduce Example on Join Operation and Counter
- Tutorial 8: Apache Sqoop
- Tutorial 9: Apache Flume Tutorial
- Tutorial 10: Hadoop Pig
- Big Data: From Testing Perspective
- Jira Issue Management, Workflow and Reporting Feature – Tutorial 4
- Tutorial 1: Introduction to Big Data
- Tutorial 5: Hadoop MapReduce
- Tutorial 6: Hadoop MapReduce First Program