Apache Spark on Alderaan (Single‑Node Usage)

Overview

Apache Spark is a distributed data‑processing framework designed for large‑scale parallel computation using resilient distributed datasets (RDDs), DataFrames, and SQL‑like operations. Spark is typically deployed on clusters using a resource manager such as YARN, Kubernetes, or Spark’s own standalone cluster manager.

On Alderaan, none of these Spark cluster managers are installed. However, Spark can still be used effectively in local mode within a single SLURM job allocation. In this mode:

  • Spark runs entirely on one compute node.
  • Parallelism is provided by threads, not distributed executors.
  • SLURM controls resource allocation (nodes, CPUs, memory).
  • Spark uses the allocated CPUs via local[N] execution.

This setup is suitable for:

  • Prototyping Spark applications
  • Moderate‑scale data processing on a single node (up to 64 cores)
  • Teaching and experimentation

It is not a multi‑node Spark cluster.


Key Constraints on Alderaan

  • No system‑wide Spark installation
  • No Hadoop / YARN
  • No HADOOP_CONF_DIR
  • No Spark standalone cluster

All Spark usage must therefore:

  • Be installed in user space (e.g., via conda)
  • Run inside a single SLURM job
  • Use spark-submit --master local[N]

Installing Spark with Conda

Create a dedicated conda environment:

conda create -n spark1node -c conda-forge python=3.11 pyspark
conda activate spark1node

Verify the installation:

which spark-submit
spark-submit --version

You should see Spark reporting a version (e.g. 4.x) and Java runtime information.


Example Spark Program (π Estimation)

Create a file spark_pi.py:

from pyspark.sql import SparkSession
import random

spark = SparkSession.builder.appName("pi").getOrCreate()
sc = spark.sparkContext

n = 1_000_000
count = sc.parallelize(range(n)).map(
    lambda _: 1 if (random.random()**2 + random.random()**2) <= 1 else 0
).sum()

pi_est = 4.0 * count / n
print(f"pi_est={pi_est}")

spark.stop()

Test interactively (single node, login shell):

spark-submit --master local[4] spark_pi.py

Running Spark in a SLURM Job (1 Node)

Create a SLURM batch script spark1node.sbatch:

#!/bin/bash
#SBATCH -p math-alderaan
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH -t 00:10:00
#SBATCH -J spark1node
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.err

source ~/.bashrc
conda activate spark1node

spark-submit \
  --master "local[${SLURM_CPUS_PER_TASK}]" \
  --conf "spark.local.dir=/tmp/$USER/spark-$SLURM_JOB_ID" \
  ~/spark_pi.py

Submit the job:

sbatch spark1node.sbatch

Check job status:

squeue -j <jobid>
scontrol show job <jobid>

View output:

cat slurm-<jobid>.out
cat slurm-<jobid>.err

Why --master local[N] Is Required

  • SLURM enforces CPU limits using cgroups.
  • Spark does not automatically understand SLURM allocations.
  • local[N] explicitly tells Spark how many worker threads to use.

Using:

--master "local[${SLURM_CPUS_PER_TASK}]"

ensures:

  • Spark uses exactly the CPUs allocated by SLURM
  • No oversubscription
  • Predictable performance

Do not rely on Spark defaults such as local[*].


About Hadoop and HADOOP_CONF_DIR

  • Hadoop is not installed on Alderaan.
  • Spark runs using its built‑in Java I/O.
  • Warnings such as:
Unable to load native-hadoop library

are expected and harmless in this configuration.


What This Setup Can and Cannot Do

Supported

  • Parallel Spark workloads on one node
  • Up to 64 CPU cores
  • RDDs, DataFrames, Spark SQL
  • PySpark applications

Not Supported

  • Multi‑node Spark jobs
  • YARN / Kubernetes / Standalone Spark clusters
  • Inter‑job Spark communication
  • Long‑running Spark services

For Alderaan users:

  • Use Spark only in local mode
  • Always run Spark inside a SLURM job
  • Always specify --master local[N]
  • Treat Spark as a threaded analytics framework, not a cluster service

References