Apache Spark on Alderaan (Single‑Node Usage)
Overview
Apache Spark is a distributed data‑processing framework designed for large‑scale parallel computation using resilient distributed datasets (RDDs), DataFrames, and SQL‑like operations. Spark is typically deployed on clusters using a resource manager such as YARN, Kubernetes, or Spark’s own standalone cluster manager.
On Alderaan, none of these Spark cluster managers are installed. However, Spark can still be used effectively in local mode within a single SLURM job allocation. In this mode:
- Spark runs entirely on one compute node.
- Parallelism is provided by threads, not distributed executors.
- SLURM controls resource allocation (nodes, CPUs, memory).
- Spark uses the allocated CPUs via
local[N]execution.
This setup is suitable for:
- Prototyping Spark applications
- Moderate‑scale data processing on a single node (up to 64 cores)
- Teaching and experimentation
It is not a multi‑node Spark cluster.
Key Constraints on Alderaan
- No system‑wide Spark installation
- No Hadoop / YARN
- No
HADOOP_CONF_DIR - No Spark standalone cluster
All Spark usage must therefore:
- Be installed in user space (e.g., via conda)
- Run inside a single SLURM job
- Use
spark-submit --master local[N]
Installing Spark with Conda
Create a dedicated conda environment:
conda create -n spark1node -c conda-forge python=3.11 pyspark
conda activate spark1node
Verify the installation:
which spark-submit
spark-submit --version
You should see Spark reporting a version (e.g. 4.x) and Java runtime information.
Example Spark Program (π Estimation)
Create a file spark_pi.py:
from pyspark.sql import SparkSession
import random
spark = SparkSession.builder.appName("pi").getOrCreate()
sc = spark.sparkContext
n = 1_000_000
count = sc.parallelize(range(n)).map(
lambda _: 1 if (random.random()**2 + random.random()**2) <= 1 else 0
).sum()
pi_est = 4.0 * count / n
print(f"pi_est={pi_est}")
spark.stop()
Test interactively (single node, login shell):
spark-submit --master local[4] spark_pi.py
Running Spark in a SLURM Job (1 Node)
Create a SLURM batch script spark1node.sbatch:
#!/bin/bash
#SBATCH -p math-alderaan
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=64
#SBATCH -t 00:10:00
#SBATCH -J spark1node
#SBATCH -o slurm-%j.out
#SBATCH -e slurm-%j.err
source ~/.bashrc
conda activate spark1node
spark-submit \
--master "local[${SLURM_CPUS_PER_TASK}]" \
--conf "spark.local.dir=/tmp/$USER/spark-$SLURM_JOB_ID" \
~/spark_pi.py
Submit the job:
sbatch spark1node.sbatch
Check job status:
squeue -j <jobid>
scontrol show job <jobid>
View output:
cat slurm-<jobid>.out
cat slurm-<jobid>.err
Why --master local[N] Is Required
- SLURM enforces CPU limits using cgroups.
- Spark does not automatically understand SLURM allocations.
local[N]explicitly tells Spark how many worker threads to use.
Using:
--master "local[${SLURM_CPUS_PER_TASK}]"
ensures:
- Spark uses exactly the CPUs allocated by SLURM
- No oversubscription
- Predictable performance
Do not rely on Spark defaults such as local[*].
About Hadoop and HADOOP_CONF_DIR
- Hadoop is not installed on Alderaan.
- Spark runs using its built‑in Java I/O.
- Warnings such as:
Unable to load native-hadoop library
are expected and harmless in this configuration.
What This Setup Can and Cannot Do
Supported
- Parallel Spark workloads on one node
- Up to 64 CPU cores
- RDDs, DataFrames, Spark SQL
- PySpark applications
Not Supported
- Multi‑node Spark jobs
- YARN / Kubernetes / Standalone Spark clusters
- Inter‑job Spark communication
- Long‑running Spark services
Recommended User Guidance
For Alderaan users:
- Use Spark only in local mode
- Always run Spark inside a SLURM job
- Always specify
--master local[N] - Treat Spark as a threaded analytics framework, not a cluster service
References
- Apache Spark documentation: https://spark.apache.org/docs/latest/
- PySpark API: https://spark.apache.org/docs/latest/api/python/