Apache Spark is an open source cluster computing framework. Originally developed at the University of California, Berkeley, the Spark codebase was later donated to the Apache Software Foundation that has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Step 1: Download and install Spark
We need to choose a spark version by choosing our Hadoop version or it will download default source file.
$wget http://www.eu.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
Step 2: Extracting Spark
After downloading it, if you type ls on terminal it show tgz file "spark-1.5.2-bin-hadoop2.6.tgz " to untar, type following:
$tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Step 3: Moving Spark to respective Directory
Moving Spark software files The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$su –
Give Password:
$cd /home/Hadoop/Downloads/
$mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
Step 4: Working with spark Dependencies:
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.- Downloading and installing Scala
$wget http://www.scala-lang.org/files/archive/scala-2.11.7.tgz
- Update bashrc file: Adding path of Spark and Scala to this file.
nano ~/.bashrc add lines: export SCALA_HOME=/usr/local/scala exoprt PATH=$PATH:$SCALA_HOME/bin export spark=/usr/local/spark export PATH=$PATH:$spark/bin
- Verify installation:
scala -version
- Install Maven: Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a project's build, reporting and documentation from a central piece of information.
$Sudo apt-get install maven
- Install git Git (/ɡɪt/) is a widely used source code management system for software development. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.
$Sudo apt-get git
Step 5: Build Assembly
We can test our configuration by connecting Master node with a ssh session to slave node and to itself.$cd /usr/local/Hadoop/etc/Hadoop:sudo sbt/sbt assembly
Step 6: Start using Spark Shell
Spark’s interactive shell provides a simple way to learn the API, as well as a powerful tool to analyze datasets interactively. Start the shell by running ./bin/spark-shell in the Spark directory.$ cd /usr/local/spark/bin/spark-shell
This comment has been removed by the author.
ReplyDeleteHi, thanks for this post, I am new to spark, and it will be kind of you if you can share how to make it work with Python on an Azure VM.
ReplyDelete