Wednesday 17 February 2016

Spark Standalone mode installation on Microsoft Azure linux VM

Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Step 1: Download and install Spark


We need to choose a spark version by choosing our Hadoop version or it will download default source file.

$wget http://www.eu.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz

Step 2: Extracting Spark


After downloading it, if you type ls on terminal it show tgz file "spark-1.5.2-bin-hadoop2.6.tgz " to untar, type following:

$tar xvf spark-1.3.1-bin-hadoop2.6.tgz 

Step 3: Moving Spark to respective Directory


Moving Spark software files The following commands for moving the Spark software files to respective directory (/usr/local/spark).

$su – 
Give Password:  

$cd /home/Hadoop/Downloads/ 
$mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark 

Step 4: Working with spark Dependencies:

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
  • Downloading and installing Scala
    $wget http://www.scala-lang.org/files/archive/scala-2.11.7.tgz
    
  • Update bashrc file:
  • Adding path of Spark and Scala to this file.
    nano ~/.bashrc
    add lines: 
    export SCALA_HOME=/usr/local/scala 
    exoprt PATH=$PATH:$SCALA_HOME/bin
    export spark=/usr/local/spark
    export PATH=$PATH:$spark/bin
    
  • Verify installation:
    scala -version
    
  • Install Maven:
  • Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.
     $Sudo apt-get install maven 

  • Install git
  • Git (/ɡɪt/) is a widely used source code management system for software development. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows. 
     $Sudo apt-get git
    

Step 5: Build Assembly

We can test our configuration by connecting Master node with a ssh session to slave node and to itself.

$cd /usr/local/Hadoop/etc/Hadoop:sudo sbt/sbt assembly

Step 6: Start using Spark Shell

Spark’s interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively. Start the shell by running ./bin/spark-shell in the Spark directory.

$ cd /usr/local/spark/bin/spark-shell

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Hi, thanks for this post, I am new to spark, and it will be kind of you if you can share how to make it work with Python on an Azure VM.

    ReplyDelete