Apache Spark 1.3.0 on YARN on HDP 2.2

Apache Spark 1.3.0 was released mid march this year and if your are familiary with the technical preview of Hortonworks for Spark 1.2.0 you might also want to try out the newly released version on your Hadoop cluster. This short tutorial covers the basic steps to get your up and running with Spark on YARN on HDP.

Download and Configure

First we need to get the latest version of Apache Spark from the official download page. Since we are running on Hadoop/YARN we have to select the pre-build version for Hadoop 2.4 or later.

Untar the downloaded archive

tar xvfz spark-1.3.0-bin-hadoop2.4.tgz  

Since spark-1.3.0-bin-hadoop2.4 will be our base directory ($SPARK_HOME) for the rest of this tutorial, lets cd into it.

cd spark-1.3.0-bin-hadoop2.4  

Before we start Spark for the first time, we need to make some changes to the default configuration file. Spark ships with an excample file in its /conf directory which we will be extending.

cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf  

Now lets fire up an editor and change the spark-defaults.conf file.

vi $SPARK_HOME/conf/spark-defaults.conf  

Add the following lines to the spark-defaults.conf file

spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041  
spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041  

This will tell YARN which HDP jar files to use when starting the Spark components. In detail, it will use the hdp.version as a substitution in the classpath variable.

Running Spark on YARN

To actually Spark on YARN you will need to tell Spark where your YARN configuration resides. This is done via exporting YARN_CONF_DIR which should point to your YARN configuration directory.

export YARN_CONF_DIR=/PATH/TO/YARN/CONF  

Typically this location would be /etc/hadoop/yarn/.

Now we can try to run the spark-shell on top of YARN. To do so type

./bin/spark-shell --master yarn-client

inside your $SPARK_HOME directory. Here --master yarn-client will tell Spark to run the Spark components as YARN containers on the corresponding NodeManagers.

You might run into a bad substitution exception when trying to execute the command above. To avoid this issue you need to add one additional --conf parameter into your startup command.

./bin/spark-shell --master yarn-client --conf hdp.version=2.2.0.0-2014

When everything works fine, you will see the scala console and you can now start interacting with your Spark shell.

Troubleshooting

  • File permissions issue: In case you are getting an access permission error on HDFS, you might want to create a user directory for the user you are running the Spark shell/job under. Typically the Spark staging process is done in the /user/MYUSER directory on HDFS. Make sure you have the proper permissions for this directory. In a development setup you could also switch off the HDFS permission handling completely to avoid this issue. Just change/add the dfs.permissions to false in your hdfs-site.xml and restart your NameNode.

Andreas Fritzler

Read more posts by this author.