Apache Spark 1.3.0 was released mid march this year and if your are familiary with the technical preview of Hortonworks for Spark 1.2.0 you might also want to try out the newly released version on your Hadoop cluster. This short tutorial covers the basic steps to get your up and running with Spark on YARN on HDP.
Download and Configure
First we need to get the latest version of Apache Spark from the official download page. Since we are running on Hadoop/YARN we have to select the pre-build version for Hadoop 2.4 or later.
Untar the downloaded archive
tar xvfz spark-1.3.0-bin-hadoop2.4.tgz
spark-1.3.0-bin-hadoop2.4 will be our base directory (
$SPARK_HOME) for the rest of this tutorial, lets
cd into it.
Before we start Spark for the first time, we need to make some changes to the default configuration file. Spark ships with an excample file in its
/conf directory which we will be extending.
cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
Now lets fire up an editor and change the
Add the following lines to the
spark.driver.extraJavaOptions -Dhdp.version=188.8.131.52–2041 spark.yarn.am.extraJavaOptions -Dhdp.version=184.108.40.206–2041
This will tell YARN which HDP jar files to use when starting the Spark components. In detail, it will use the
hdp.version as a substitution in the classpath variable.
Running Spark on YARN
To actually Spark on YARN you will need to tell Spark where your YARN configuration resides. This is done via exporting
YARN_CONF_DIR which should point to your YARN configuration directory.
Typically this location would be
Now we can try to run the
spark-shell on top of YARN. To do so type
./bin/spark-shell --master yarn-client
$SPARK_HOME directory. Here
--master yarn-client will tell Spark to run the Spark components as YARN containers on the corresponding NodeManagers.
You might run into a
bad substitution exception when trying to execute the command above. To avoid this issue you need to add one additional
--conf parameter into your startup command.
./bin/spark-shell --master yarn-client --conf hdp.version=220.127.116.11-2014
When everything works fine, you will see the
scala console and you can now start interacting with your Spark shell.
- File permissions issue: In case you are getting an access permission error on HDFS, you might want to create a user directory for the user you are running the Spark shell/job under. Typically the Spark staging process is done in the
/user/MYUSERdirectory on HDFS. Make sure you have the proper permissions for this directory. In a development setup you could also switch off the HDFS permission handling completely to avoid this issue. Just change/add the
hdfs-site.xmland restart your NameNode.