Saturday, 5 March 2016

Hadoop Multi Node Cluster Installation

Multi node Installation involves one master and many slaves. For building this setup we start by initially installing a single node cluster in each node test them and then merging them with required settings to make one node as master and other node as slaves. It is much easier to track down any problems one might encounter due to reduced complexity of a single node cluster setup on each machine.

Step 1: Configuring Single Node Cluster for each node

Download and configure single node cluster in each node on this cluster using http://rachana706.blogspot.in/2015/05/first-step-with-apache-hadoop.html

Step 2: Networking

It is easier to put all the machine on same network with regards to hardware and software configuration, Here we take three machine for exampleconnect the both machine via a single hub or switch and configure the networkinterfaces to use a common network such as 192.168.0.x/24. We need to update the IP address of each node (master node and slaves) in file /etc/hosts.

$nano /etc/hosts
Add the following lines to this file:
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2

Step 3: SSH Access

Master node must be able to connect to itself and to all slaves in order to start-services and doing cluster management tasks. For this we need to add masters ssh key to authorized_ key file of all the slaves.

hduser@master:$ssh-copy-id –i HOME/.ssh/id_rsa.pub hduser@slave1
hduser@master:$ssh-copy-id –i HOME/.ssh/id_rsa.pub hduser@slave2

Step 4: Test SSH connection

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.

hduser@master:$ssh master
hduser@master:$ssh slave

Step 6: Configuring Files For All Nodes

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.
  • Open conf/core-site.xml, and modify file.
    $nano conf/core-site.xml
    Add between <configuration></configuration> tags:
    <property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
    </property> 
  • Open conf/yarn-site.xml and Add following properties to this file.
    $nano conf/yarn-site.xml
    Add between <configuration></configuration> tags:
    <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>Hadoopmaster:8025</value>
    </property>
    
    <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>Hadoopmaster:8030</value>
    </property>
    
    <property>
    <name>yarn.resourcemanager.address</name>
    <value>Hadoopmaster:8050</value>
    </property>
     
  • Open conf/hdfs-site.xml and replace 1 with 3 in replication property of this file
    $nano conf/hdfs-site.xml
    replace 1 with 3 in between <property></property> tags:
    
  • Open conf/mapred-site.xml
    $nano conf/mapred-site.xml
    Add between <configuration></configuration> tags:
    <property>
    <name>mapred. job.tracker</name>
    <value>master:54311</value>
    </property> 

Step 7: Configuring Master Node Files

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.


$cd /usr/local/Hadoop/etc/Hadoop
  • Open conf/masters, add name of master node to it.
  • Open conf/slaves, add name of all slave to this file.
  • Open conf/hdfs-site.xml, and remove property related to DataNode directory.

Step 8: Configuring slave Nodes Files

Open conf/hdfs-site.xml, remove property of NameNode directory Step 9: Format new Hadoop distributed file system. By restarting terminal as Hadoop user
$hdfs namenode –format
  

Step 9: Starting Hadoop cluster

$start-all.sh     
or 
$start-yarn.sh
$start-dfs.sh 

 Step 10: Checking running services in Hadoop cluster

$jps
Service  for Master node:

18221 ResourceManager
20582 Jps
17706 SecondaryNameNode
17085 NameNode

Services for Slave node:
4916 DataNode
5053 NodeManager

Wednesday, 17 February 2016

Spark Standalone mode installation on Microsoft Azure linux VM

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Step 1: Download and install Spark


We need to choose a spark version by choosing our Hadoop version or it will download default source file.

$wget http://www.eu.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz

Step 2: Extracting Spark


After downloading it, if you type ls on terminal it show tgz file "spark-1.5.2-bin-hadoop2.6.tgz " to untar, type following:

$tar xvf spark-1.3.1-bin-hadoop2.6.tgz 

Step 3: Moving Spark to respective Directory


Moving Spark software files The following commands for moving the Spark software files to respective directory (/usr/local/spark).

$su – 
Give Password:  

$cd /home/Hadoop/Downloads/ 
$mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 

Step 4: Working with spark Dependencies:

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
  • Downloading and installing Scala
    $wget http://www.scala-lang.org/files/archive/scala-2.11.7.tgz
    
  • Update bashrc file:
  • Adding path of Spark and Scala to this file.
    nano ~/.bashrc
    add lines: 
    export SCALA_HOME=/usr/local/scala 
    exoprt PATH=$PATH:$SCALA_HOME/bin
    export spark=/usr/local/spark
    export PATH=$PATH:$spark/bin
    
  • Verify installation:
    scala -version
    
  • Install Maven:
  • Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.
     $Sudo apt-get install maven 

  • Install git
  • Git (/ɡɪt/) is a widely used source code management system for software development. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows. 
     $Sudo apt-get git
    

Step 5: Build Assembly

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.

$cd /usr/local/Hadoop/etc/Hadoop:sudo sbt/sbt assembly

Step 6: Start using Spark Shell

Spark’s interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively. Start the shell by running ./bin/spark-shell in the Spark directory.

$ cd /usr/local/spark/bin/spark-shell