How to Install Apache Spark on Ubuntu 20.04

Apache Spark is an open-source framework and a general-purpose cluster computing system. Spark provides high-level APIs in Java, Scala, Python and R that supports general execution graphs. It comes with built-in modules used for streaming, SQL, machine learning and graph processing. It is capable of analyzing a large amount of data and distribute it across the cluster and process the data in parallel.

In this tutorial, we will explain how to install Apache Spark cluster computing stack on Ubuntu 20.04.

Prerequisites

  • A server running Ubuntu 20.04 server.
  • A root password is configured the server.

Getting Started

First, you will need to update your system packages to the latest version. You can update all of them with the following command:

apt-get update -y

Once all the packages are updated, you can proceed to the next step.

Install Java

Apache Spark is a Java-based application. So Java must be installed in your system. You can install it with the following command:

apt-get install default-jdk -y

Once the Java is installed, verify the installed version of Java with the following command:

java --version

You should see the following output:

openjdk 11.0.8 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)

Install Scala

Apache Spark is developed using the Scala. So you will need to install Scala in your system. You can install it with the following command:

apt-get install scala -y

After installing Scala. You can verify the Scala version using the following command:

scala -version

You should see the following output:

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Now, connect to the Scala interface with the following command:

scala

You should get the following output:

Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8).
Type in expressions for evaluation. Or try :help.

Now, test the Scala with the following command:

scala> println("Hitesh Jethva")

You should get the following output:

Hitesh Jethva

Install Apache Spark

First, you will need to download the latest version of Apache Spark from its official website. At the time of writing this tutorial, the latest version of Apache Spark is 2.4.6. You can download it to the /opt directory with the following command:

cd /opt
wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

Once downloaded, extract the downloaded file with the following command:

tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz

Next, rename the extracted directory to spark as shown below:

mv spark-2.4.6-bin-hadoop2.7 spark

Next, you will need to configure Spark environment so you can easily run Spark commands. You can configure it by editing .bashrc file:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file then activate the environment with the following command:

source ~/.bashrc

Start Spark Master Server

At this point, Apache Spark is installed and configure. Now, start the Spark master server using the following command:

start-master.sh

You should see the following output:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu2004.out

By default, Spark is listening on port 8080. You can check it using the following command:

ss -tpln | grep 8080

You should see the following output:

LISTEN   0        1                               *:8080                *:*      users:(("java",pid=4930,fd=249))   

Now, open your web browser and access the Spark web interface using the URL http://your-server-ip:8080. You should see the following screen:

Apache Spark Web UI

Start Spark Worker Process

As you can see, Spark master service is running on spark://your-server-ip:7077. So you can use this address to start the Spark worker process using the following command:

start-slave.sh spark://your-server-ip:7077

You should see the following output:

starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu2004.out

Now, go to the Spark dashboard and refresh the screen. You should see the Spark worker process in the following screen:

Apache Spark Worker

Working with Spark Shell

You can also connect the Spark server using the command-line. You can connect it using the spark-shell command as shown below:

spark-shell

Once connected, you should see the following output:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:35:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ubuntu2004:4040
Spark context available as 'sc' (master = local[*], app id = local-1598711719335).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

If you want to use Python in Spark. You can use pyspark command-line utility.

First, install the Python version 2 with the following command:

apt-get install python -y

Once installed, you can connect the Spark with the following command:

pyspark

Once connected, you should get the following output:

Python 2.7.18rc1 (default, Apr  7 2020, 12:05:55) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:36:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/

Using Python version 2.7.18rc1 (default, Apr  7 2020 12:05:55)
SparkSession available as 'spark'.
>>> 

If you want to stop Master and Slave server. You can do it with the following command:

stop-slave.sh
stop-master.sh

Conclusion

Congratulations! you have successfully installed Apache Spark on Ubuntu 20.04 server. Now you should able to perform basic tests before you start configuring a Spark cluster. Feel free to ask me if you have any questions.

Share this page:

Suggested articles

0 Comment(s)

Add comment