Hadoop & Cloud Based Solutions

Sunday, 18 November 2018

Why use Oozie instead of just cascading a jobs one after another?

Major flexibility – Start, Stop, Suspend, and re-‐run jobs

Oozie allows you to restart from a failure – You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes

Other Features

• Java Client API/ Command Line Interface

– Launch, control, and monitor jobs from your Java Apps

• Web Service API

– You can control jobs from anywhere

• Run Periodic jobs

– Have jobs that you need to run every hour, day, week? Have Oozie run the jobs for you

• Receive an email when a job is complete

What is Oozie?

Oozie is a workflow scheduler for Hadoop

Originally, designed at Yahoo! for their complex search engine workflows.

Now it is an open-‐source Apache incubator project.

Oozie allows a user to create Directed Acyclic Graphs of workflows and these can be ran in parallel and sequential in Hadoop.

Oozie can also run plain java classes, Pig workflows, and interact with the HDFS.

Oozie can run job’s sequentially (one after the other) and in parallel (multiple at a time).

Monday, 9 April 2018

Solution for Big Data-Hadoop problems

Wednesday, 13 September 2017

A common situation in many companies is that transaction records are kept separate from the customer data. There is, of course, a relationship between the two; usually a transaction record contains the unique ID of the customer through which the sale was performed.

In the Hadoop world, these would be represented by two types of data files: one containing records of the customer IDs and information for transactions, and the other would contain the full data for each customer.

Frequent tasks require reporting that uses data from both these sources; say, for example, we wanted to see the total number of transactions and total value for customer but do not want to associate it with an anonymous ID number, but rather with a name. This may be valuable

when customer service representatives wish to call the most frequent customers—data from the sales records—but want to be able to refer to the person by name and not just a number.

We can perform the report explained in the previous section using a reduce-side join.

/* Mayank */

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class ReduceJoin {

    public static class CustsMapper extends
            Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String record = value.toString();
            String[] parts = record.split(",");
            context.write(new Text(parts[0]), new Text("custs\t" + parts[1]));
        }
    }

    public static class TxnsMapper extends
            Mapper<Object, Text, Text, Text> {
        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String record = value.toString();
            String[] parts = record.split(",");
            context.write(new Text(parts[2]), new Text("txns\t" + parts[3]));
        }
    }

    public static class ReduceJoinReducer extends
            Reducer<Text, Text, Text, Text> {
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            String name = "";
            double total = 0.0;
            int count = 0;
            for (Text t : values) {
                String parts[] = t.toString().split("\t");
                if (parts[0].equals("txns")) {
                    count++;
                    total += Float.parseFloat(parts[1]);
                } else if (parts[0].equals("custs")) {
                    name = parts[1];
                }
            }
            String str = String.format("%d\t%f", count, total);
            context.write(new Text(name), new Text(str));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Reduce-side join");
        job.setJarByClass(ReduceJoin.class);
        job.setReducerClass(ReduceJoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);


        MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class, CustsMapper.class);
        MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class, TxnsMapper.class);
        Path outputPath = new Path(args[2]);


        FileOutputFormat.setOutputPath(job, outputPath);
        outputPath.getFileSystem(conf).delete(outputPath);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Monday, 27 March 2017

Hive Exercise-I

Mail me for Raw data....

These are the steps for performing exercise :

i) Create Database
------------------
create database retail_store;

ii) Select Database
------------------
use retail_store;

iii) Create table for storing transactional records
-------------------------------------------------
create table txn_records(txnno INT, txndate STRING, custno INT, amount DOUBLE,
category STRING, product STRING, city STRING, state STRING, spendby STRING)
row format delimited
fields terminated by ','
stored as textfile;

iv) Load the data into the table
-------------------------------
LOAD DATA LOCAL INPATH 'txns' OVERWRITE INTO TABLE txn_records;

v) Describing metadata or schema of the table
---------------------------------------------
describe txn_records;

vi) Counting no of records
-------------------------
select count(*) from txn_records;

vii) Counting total spending by category of products
--------------------------------------------------
select category, sum(amount) from txn_records group by category;

viii) 10 customers
--------------------
select custno, sum(amount) from txn_records group by custno limit 10;

Friday, 24 March 2017

Hive Introduction (Cont...)

Hive Introduction-2

The main components of Hive are:

Metastore: It stores all the metadata of Hive. It stores data of data stored in database, tables, columns, etc.
Driver: It includes compiler, optimizer and executor used to break down the Hive query language statements.
Query compiler: It compiles HiveQL into graph of map reduce tasks.
Execution engine: It executes the tasks produces by compiler.
Thrift server: It provides an interface to connect to other applications like MySQL, Oracle, Excel, etc. through JDBC/ODBC drivers.
Command line interface: It is also called Hive shell. It is used for working with data either interactively or batch data processing.
Web Interface: It is a visual structure on Hive used for interaction with data.

Data Storage in Hive:

Hive has different forms of storage options and they include:

Metastore: Metastore keeps track of all the metadata of database, tables, columns, datatypes etc. in Hive. It also keeps track of HDFS mapping.
Tables: There can be 2 types of tables in Hive. First, normal tables like any other table in database. Second, external tables which are like normal tables except for the deletion part. HDFS mappings are used to create external tables which are pointers to table in HDFS. The difference between the two types of tables is that when the external table is deleted its data is not deleted. Its data is stored in the HDFS whereas in case of normal table the data also gets deleted on deleting the table.
Partitions: Partition is slicing of tables that are stored in different subdirectory within a table’s directory. It enhances query performance especially in case of select statements with “WHERE” clause.
Buckets: Buckets are hashed partitions and they speed up joins and sampling of data.

Hive vs. RDBMS (Relational database)

Hive and RDBMS are very similar but they have different applications and different schemas that they are based on.

RDBMS are built for OLTP (Online transaction processing) that is real time reads and writes in database. They also perform little part of OLAP (online analytical processing).
Hive is built for OLAP that is real time reporting of data. Hive does not support inserting into an existing table or updating table data like RDBMS which is an important part of OLTP process. All data is either inserted in new table or overwritten in existing table.
RDBMS is based on write schema that means when data is entered in the table it is checked against the schema of table to ensure that it meets the requirements. Thus loading data in RDBMS is slower but reading is very fast.
Hive is based on read schema that means data is not checked when it is loaded so data loading is fast but reading is slower.

Hive Query Language (HQL)

HQL is very similar to traditional database. It stores data in tables, where each table consists of columns and each column consists of specific number of rows. Each column has its own data type. Hive supports primitive as well as complex data types. Primitive types like Integer, Bigint, Smallint, Tinyint, Float, Double Boolean, String, and Binary are supported. Complex types include Associative array: map , Structs: struct , and Lists: list .

Data Definition statements (DDL) like create table, alter table, drop table are supported. All these DDL statements can be used on Database, tables, partitions, views, functions, Index, etc. Data Manipulation statements (DML) like load, insert, select and explain are supported. Load is used for taking data from HDFS and moving it into Hive. Insert is used for moving data from one Hive table to another. Select is used for querying data. Explain gives insights into structure of data.

Hive Introduction

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarise Big Data, and makes querying and analysing easy.

Hive Architecture

Hive started by facebook......