How to Install Apache Spark on Ubuntu 18.04 | 20.04

How to Install Apache Spark on Ubuntu 18.04 20.04

In this article, we are going to shows that how to Install Apache Spark on Ubuntu 18.04 | 20.04.

Apache Spark is an open-source framework packaged and a general-purpose cluster computing system for large-scale data and machine learning processing.

It comes with built-in modules and higher-level libraries, including support for SQL queries( Spark SQL), streaming data(Spark Streaming), machine learning(MLlib) and graph processing(GraphX).

Apache Spark is able to analyzing a large scale of data and distribute it across the cluster and process the data in parallel and it is also supports a wide range of programming languages that includes Java, Scala, Python, and R.


Install Apache Spark on Ubuntu

Simply follow below steps to installing Apache Spark Go on Ubuntu 18.04 | 20.04:

Step 1 : Install Java JDK

To install the latest version of Java, run the commands below:

sudo apt update
sudo apt install default-jdk

To check the version of java, run the commands below:

java --version

When you run the commands above, it will show the some lines as below:

openjdk 11.0.10 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Step 2 : Install Scala

Scala is required to run Apache Spark bacause Apache Spark is developed using the Scala.

To install Scala run the command below:

sudo apt install scala

Run the below command to check the version of Scala:

scala -version

The above command will show some lines as below:

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Step 3 : Install Apache Spark

At this point, all required packages is installed in your Ubuntu system, now run the commands below to download the latest version of Apache Spark:

cd /tmp
wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

Now extract the downloaded file and move it to the /opt directory by running command below:

tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz
sudo mv spark-2.4.6-bin-hadoop2.7 /opt/spark

Next, you will to create and configure Spark environment variables to easily execute and run Spark commands. You can do that by editing .bashrc file:

nano ~/.bashrc

And add the below lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file. Now run the command below to apply changes and activate the environment:

source ~/.bashrc

Step 4 : Start Apache Spark

To start Apache Spark master server, run the commands below:

start-master.sh

Next, run the commands below to start Spark work process :

start-slave.sh spark://localhost:7077

Replace localhost with the server hostname or IP address as below:

start-slave.sh spark://your-server-ip:7077

After the Spark work process start, open your favorite browser and browse to the server hostname or IP address as show below:

http://localhost:8080

Again, replace localhost with the server hostname or IP address.

The above command will open the Spark dashboard:

spark dashboard

Step 5 : Spark Shell

You can also connect to Spark server using its command shell. To do that, run the command below:

spark-shell

It will launch the Spark shell.

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

That’s all

If you face any error and issue in above steps , please use comment box below to report.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top