Getting Apache Spark running on YARN is pretty easy and straight forward. I covered in a previous post how to get started with Spark 1.3.0 on HDP 2.2. However you might have noticed, that the bootstrapping of a Spark environment on YARN might take a couple of seconds. This is due to the fact, that the Spark jars have to be copied to HDFS first, before the containers can be created on the NodeManager machines.
The uploading step is reflected in the following log snipped when you run the
./bin/spark-shell --master yarn-client command
INFO yarn.Client: Uploading resource file:/$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar -> hdfs://NAMENODE:8020/user/USER_ID/.sparkStaging/application_1427707672069_0012/spark-assembly-1.3.0-hadoop2.4.0.jar
To avoid the upload of the
spark-assembly everytime you run a spark job/shell, you need first to copy the corresponding jar file to a location of your choice inside HDFS.
hadoop fs -put $SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar /apps/
Here, we uploaded the jar file into the
Now you have to export the
SPARK_JAR environment variable to tell Spark where the jar file is located.
UPDATE April 8th 2015
With Spark 1.3.0 the
SPARK_JAR property has been deprecated. Instead of
SPARK_JAR you should use the
spark.yarn.jar = hdfs://NAMENODE:PORT/apps/spark-assembly-1.3.0-hadoop2.4.0.jar property in your
spark-defaults.conf configuration file.
That's it! You can now run e.g. the
./bin/spark-shell --master yarn-client
without the uploading step.