live out with rachana: Spark Standalone mode installation on Microsoft Azure linux VM

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Step 1: Download and install Spark

We need to choose a spark version by choosing our Hadoop version or it will download default source file.

$wget http://www.eu.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz

Step 2: Extracting Spark

After downloading it, if you type ls on terminal it show tgz file "spark-1.5.2-bin-hadoop2.6.tgz " to untar, type following:

$tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Step 3: Moving Spark to respective Directory

Moving Spark software files The following commands for moving the Spark software files to respective directory (/usr/local/spark).

$su – 
Give Password:  

$cd /home/Hadoop/Downloads/ 
$mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark

Step 4: Working with spark Dependencies:

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Downloading and installing Scala

$wget http://www.scala-lang.org/files/archive/scala-2.11.7.tgz

Update bashrc file:

nano ~/.bashrc
add lines: 
export SCALA_HOME=/usr/local/scala 
exoprt PATH=$PATH:$SCALA_HOME/bin
export spark=/usr/local/spark
export PATH=$PATH:$spark/bin

Verify installation:

scala -version

Install Maven:

 $Sudo apt-get install maven

Install git

 $Sudo apt-get git

Step 5: Build Assembly

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.

$cd /usr/local/Hadoop/etc/Hadoop:sudo sbt/sbt assembly

Step 6: Start using Spark Shell

Spark’s interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively. Start the shell by running ./bin/spark-shell in the Spark directory.

$ cd /usr/local/spark/bin/spark-shell

live out with rachana

Wednesday, 17 February 2016

Spark Standalone mode installation on Microsoft Azure linux VM