Guide to CCM Clusters

Logging in

How to log in

The main entry point to the system is math-alderaan.ucdenver.pvt, which is similar to Alderaan nodes. You can also log in through clas-compute.ucdenver.pvt, which is similar to nodes of the other clusters, Score and Colibri. You can submit and monitor your batch jobs to any ofthe cluster from either entry pint. The system uses CentOS 7 and 8 operating system. Alderaan cluster runs Centos 8 while the other clusters and clas-compute run Centos 7.

At this time, the main way of using the system is to use an SSH client to login to a terminal session on math-alderaan or clas-compute. You will need to be on the CU Denver private network (wired or CU Denver wireless, not CU Denver Guest). To connect from the internet, you need to use the university's VPN or remote access (click "Complimentary" to start Windows, then the Windows icon and search for Powershell). It is highly recommended to download and use the VMware Horizon app instead of continuing the browser. Either way, log in and click on the "Complimentary" button, which will give you a Windows virtual machine on the campus network. Then open a Powershell window (Windows button, search box opens, type shell, select Powershell). Type ssh math-alderaan in the Powershell window. See here for more on the remote client.

This system uses your normal portal/email username and password, but your account must be set up before using the system. Please go to accounts to request an account; if you are a student, the faculty project lead should request your account.

On Linux or a Mac, you can use simply the Terminal app, which is built into the operating system. It is hidden away in Applications -> Utilities folder on a Mac and in similar places on various Linux desktops. You may want to drag it to your dock (on a Mac) or the desktop (on Linux) so that it is available more conveniently next time.

Current Windows 10/11 has a native ssh client - just type ssh in a terminal window (also called powershell window or command window). The ssh client also has scp and sftp for file transfer.

You can also use the Windows Subsystem for Linux (SSL), where you install a Linux distribution as an app and can use it to ssh out like from a terminal window on any Linux machine. Note that the SSL may not work with VPN.

Either way, from a terminal window, at the command line prompt type in:

ssh username@math-alderaan.ucdenver.pvt

or

ssh username@clas-compute.ucdenver.pvt

The username is your account name, a single short word which you can use to log into the university portal instead of email address, not the firstname.lastname in your university email. Contact us or OIT helpdesk at ucd-oit-helpdesk@cuanschutz.edu if you do not know what your account name is.

After connecting, ssh should ask for your CU Denver password and you enter it at this point. You should be then at the math-alderaan or clas-compute prompt and in your home directory, which is /home/username.

Interactive use limitations

Using a server ‘interactively’ (aka not scheduling a job) is often needed for troubleshooting a job or just watching what it is doing in real time. After SSH’ing into a head node, you can type ssh math-colibri-i01 or whatever server you want to go to directly.

Please do not run anything directly on compute nodes, which are reserved for jobs under the control of the scheduler, even if you may be able to ssh there. These are nodes with names like math-alderaan-c01 with something else than "i" before the number. Using compute nodes, where other people run jobs through the scheduler, will interfere with their work and make you very unpopular. It is OK to ssh to a compute node to check on your job, but don't run anything there.

Screen virtual terminal in interactive usage

If you use screen, if you get disconnected, whatever you were running is still going and you can connect to it later. This is called a virtual terminal session. It is generally a good idea to use screen on math-compute, math-alderaan, or on the interactive nodes.

Typing screen creates a new terminal session. You can give it a name you want to juggle more sessions, by screen -S 'name' (make the name whatever you want).

If you want to disconnect from the session but leave it running, hit the combination of Control-A and press the D key to disconnect. Control-A is the combo to let screen know you want to do an action.

When you want to reconnect to your screen session later, log back onto wherever you started the screen and type screen –r. If you have more than one screen, it’ll complain and tell you the screens you have available to reconnect to. Type screen –r 'name’ to reconnect to that screen.

You can't just scroll in screen to see your terminal history as you normally would. Press Control-A and then Esc and scrolling up and down will work temporarily the usual way. When you type anything, screen will leave the scrolling model.

File Storage

You are responsible for keeping copies of your important files elsewhere. Files and entire filesystems can be lost.

The home directories are on a shared file server and linked as /home/username. Everyone can have also a project directory /storage/department/projects/username (where department may be one of many departments who use this system). Users from some departments have project directories created in /data001/projects/username instead. These are currently accessible from alderaan nodes only, i.e., not accessible from clas-compute and math-colibri or math-score nodes. The location of project directory is emailed to the user when the directory is created, usualy as a part of onboarding.

The difference betwen project and home directories is that home directories are backed up occasionally while project directories are not. Please keep your home directory small to make the backups possible.

In addition, groups can request shared project directories also in /storage/department/projects' or/data001/projects`.

Please monitor the usage of the partition you are on by

 df -h .

and if it nearing full check you you do not use more space than you are aware of by df -h. If you need a lot of data storage, please contact us before filling everything you can find.

On Alderaan only, you can make your own directory in /scratch, which is on a large fast filesystem. When /scratch starts filling up, oldest files will be purged automatically.

Do not keep any confidential or sensitive files on this system. We are not equipped for the level of security this would take. In particular, no proprietary data, health records, grades, social security numbers, and like are allowed. If you use ssh keys to connect elsewhere from this system (such as github or another computer account), it is highly recommended to make an ssh key with a passcode for that. Otherwise, the security of the account you are connecting to is only as good as the read protection of your files here.

Files and directories including your home directory are created with permissions which allow anyone to read them but not write. This is Linux default to encourage collaboration. If you want to keep a file or directory private, you need to change the permissions yourself. Type chmod og-rwx file_or_directory_name to make the file or directory not accessible by others (except system administrators, of course).

Where is the software? Modules and Singularity containers

We normally do not install application software directly on the system because of software and version conflicts. Instead, we install sofware in modules or singularity containers.

Modules

To see what software is available in modules,

module avail

provide a list of available software packages and their versions. A command module load modulename/version will change your environment (such as the PATH variable) temporarily so that the software and its various parts can be found, for example

module load R/4.2.1

or

module load matlab

Try that and use env command to see what has changed.

You may need to load multiple modules at the same time. When you are done with a module you can module unload it but it is strongly recommended to do

module purge

and start over loading exactly the modules you need, or simply log out and back in again.

Installed software and environment modules on our different clusters are generally different. See modules for more information.

Singularity containers

A singularity container is a bit like a separate computer in itself which just runs on the same hardware. Thus, software in different containers won't conflict, and a container can provide a complete environment including a different operating system, libraries, etc. A disadvantage, however, is that you can use only the software installed in the container; software on the system outside of the container is not visible from the inside. Containers are read only and cannot be changed. An exception is that some package managers, like conda, may allow installing software while you are inside the container. Additions made by conda this way actually reside in directory .local in your home directory.

Using singularity is easy. Type, for example,

singularity shell /storage/singularity/tensorflow
python3

and you can use many Python packages for machine learning. We have containers with statistics software, optimization, molecular chemistry, optimization, and more. See Singularity for more details and list of software in our containers.

Old software versions

Sometimes, you may need a specific version of some software package from few years ago. We'll try. If the software version is not too much in the past, we may be able to install the software in a module or in a singularity container. However, installing an older or a more complicated package may require recreating an entire software ecosystem at a certain point in computer history years ago, which would be overwhelming or impossible.

File Transfer

Linux or Mac

On a Linux or Mac computer, you can use file transfer utilities rsync, scp, sftp on your computer to transfer files and entire directores between your computer and clusters. These utilities are normally a part of the system, if not you can install them from your Linux distribution. Rsync is recommended. Typing man rsync should give you the manual for the system you are on. Rsync can transfer file trees recursively and resume a transfer which was interrupted.

Windows

On current Windows PC, you can use scp and sftp from the command window (a.k.a. Powershell window). Current Windows 10 and 11 have OpenSSH client built in.

From a Website

You can download a file from a website using simply wget followed by the URL of the file. You can get the URL of a file posted on the web by a right-click and selectingv something like "Copy link address".

Github

The easiest way to download files from Github is to clone the entire repository. On the repository main page, click green button "Code" and copy the link. Then

git clone <the link you just copied>

You can use https link if you want to clone the repository for reading only. If you want to push your changes to Github in future, you need to use ssh. It is strongly recommended to create a separate key secured by a strong passphrase for this. Otherwise, the security of your Github account is only as good as the protection of files here - anyone who gains administrator access here can log into your Github account.

Dropbox

coming soon

Globus

Globus is a free service which can transfer large files (many GB and TB) between servers on the internet using a simple web interface and without supervision. See the Globus section how to use Globus here.

Requesting Information about the Environment

Queues

Jobs are submitted to compute nodes through the scheduler. To see the queues (called "partitions") on the scheduler, type

$ sinfo
PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
math-alderaan        up 7-00:00:00      1 drain* math-alderaan-c28
math-alderaan        up 7-00:00:00      4    mix math-alderaan-c[01,08-10]
math-alderaan        up 7-00:00:00     25  alloc math-alderaan-c[04-07,11-27,29-32]
math-alderaan        up 7-00:00:00      2   idle math-alderaan-c[02-03]
math-alderaan-gpu    up 7-00:00:00      1    mix math-alderaan-h01
math-alderaan-gpu    up 7-00:00:00      1  alloc math-alderaan-h02
math-colibri-gpu     up   infinite     24   idle math-colibri-c[01-24]
math-score           up   infinite      5   idle math-score-c[01-05]
chem-xenon           up   infinite      6   unk* chem-xenon-c[01-06]
clas-interactive     up   infinite      1   idle math-colibri-i02
math-alderaan-osg    up 1-00:00:00      1 drain* math-alderaan-c28
math-alderaan-osg    up 1-00:00:00      4    mix math-alderaan-c[01,08-10]
math-alderaan-osg    up 1-00:00:00     25  alloc math-alderaan-c[04-07,11-27,29-32]
math-alderaan-osg    up 1-00:00:00      2   idle math-alderaan-c[02-03]
clas-dev             up   infinite      1   idle clas-devnode-c01

Nodes

To see a list of all nodes, use:

 $ sinfo -N
 NODELIST           NODES         PARTITION STATE 
 chem-xenon-c01         1        chem-xenon unk*  
 chem-xenon-c02         1        chem-xenon unk*  
 chem-xenon-c03         1        chem-xenon unk*  
 chem-xenon-c04         1        chem-xenon unk*  
 chem-xenon-c05         1        chem-xenon unk*  
 chem-xenon-c06         1        chem-xenon unk*  
 clas-rcdesktop-01      1    clas-rcdesktop down* 
 math-alderaan-c01      1     math-alderaan alloc 
 math-alderaan-c02      1     math-alderaan alloc 
 math-alderaan-c03      1     math-alderaan alloc 
 math-alderaan-c04      1     math-alderaan alloc 
 math-alderaan-c05      1     math-alderaan alloc 
 math-alderaan-c06      1     math-alderaan alloc 
 math-alderaan-c07      1     math-alderaan alloc 
 math-alderaan-c08      1     math-alderaan alloc 
 math-alderaan-c09      1     math-alderaan alloc 
 math-alderaan-c10      1     math-alderaan alloc 
 math-alderaan-c11      1     math-alderaan alloc 
 math-alderaan-c12      1     math-alderaan alloc 
 math-alderaan-c13      1     math-alderaan alloc 
 math-alderaan-c14      1     math-alderaan alloc 
 math-alderaan-c15      1     math-alderaan alloc 
 math-alderaan-c16      1     math-alderaan mix   
 math-alderaan-c17      1     math-alderaan idle  
 math-alderaan-c18      1     math-alderaan idle  
 math-alderaan-c19      1     math-alderaan idle  
 math-alderaan-c20      1     math-alderaan idle  
 math-alderaan-c21      1     math-alderaan idle  
 math-alderaan-c22      1     math-alderaan idle  
 math-alderaan-c23      1     math-alderaan idle  
 math-alderaan-c24      1     math-alderaan idle  
 math-alderaan-c25      1     math-alderaan idle  
 math-alderaan-c26      1     math-alderaan idle  
 math-alderaan-c27      1     math-alderaan idle  
 math-alderaan-c28      1     math-alderaan idle  
 math-alderaan-c29      1     math-alderaan idle  
 math-alderaan-c30      1     math-alderaan idle  
 math-alderaan-c31      1     math-alderaan idle  
 math-alderaan-c32      1     math-alderaan idle  
 math-alderaan-h01      1 math-alderaan-gpu idle  
 math-alderaan-h02      1 math-alderaan-gpu idle  
 math-colibri-c01       1  math-colibri-gpu idle  
 math-colibri-c02       1  math-colibri-gpu idle  
 math-colibri-c03       1  math-colibri-gpu idle  
 math-colibri-c04       1  math-colibri-gpu unk*  
 math-colibri-c05       1  math-colibri-gpu unk*  
 math-colibri-c06       1  math-colibri-gpu unk*  
 math-colibri-c07       1  math-colibri-gpu unk*  
 math-colibri-c08       1  math-colibri-gpu unk*  
 math-colibri-c09       1  math-colibri-gpu unk*  
 math-colibri-c10       1  math-colibri-gpu unk*  
 math-colibri-c11       1  math-colibri-gpu unk*  
 math-colibri-c12       1  math-colibri-gpu unk*  
 math-colibri-c13       1  math-colibri-gpu idle  
 math-colibri-c14       1  math-colibri-gpu idle  
 math-colibri-c15       1  math-colibri-gpu idle  
 math-colibri-c16       1  math-colibri-gpu idle  
 math-colibri-c17       1  math-colibri-gpu idle  
 math-colibri-c18       1  math-colibri-gpu idle  
 math-colibri-c19       1  math-colibri-gpu idle  
 math-colibri-c20       1  math-colibri-gpu idle  
 math-colibri-c21       1  math-colibri-gpu idle  
 math-colibri-c22       1  math-colibri-gpu idle  
 math-colibri-c23       1  math-colibri-gpu idle  
 math-colibri-c24       1  math-colibri-gpu idle  
 math-score-c01         1        math-score unk*  
 math-score-c02         1        math-score unk*  
 math-score-c03         1        math-score idle  
 math-score-c04         1        math-score idle  
 math-score-c05         1        math-score idle

It looks confusing but there is a method to the madness in the naming convention. Obviously, math-colibri and math-score are the identifiers for what cluster/building the servers are in, but the –c## and –i## stand for compute and interactive. The c## servers are usually part of the queuing system and the i## ones are for interactive use. Again, never ssh to compute nodes directly.

Submitting Jobs to the Scheduler

Submitting a job

The sbatch job_script command is used to submit a job into a queue. Your job starts executing in the directory where it was submitted, so submit it from a directory accessible to all compute nodes, such as a subdirectory of your home directory. You can add switches to the sbatch command, but it is recommended to make them a part of your batch script so that you do not have to do that every time. Please do not use more cores than the number of tasks specified in your script.

Template batch job scripts

The template batch scripts and simple examples to run are available. Get your copy by

    git clone https://github.com/ccmucdenver/templates.git

To build the examples, type make in the examples directory.

Please do not request the number of nodes on Alderaan by --nodes or -N, unless you really need entire nodes for some reason. Request only the CPU cores you need by --ntasks, then the node or nodes you use can be shared with others.

Single-core job

This script will be sufficient for many jobs, such as those you code yourself which do not use multiprocessing.

 #!/bin/bash
 # A simple single core job template
 #SBATCH --job-name=mpi_hello_single
 #SBATCH --partition=math-alderaan
 #SBATCH --time=1:00:00                    # Max wall-clock time
 #SBATCH --ntasks=1                        # number of cores, leave at 1
 examples/hello_world_fortran.exe          # replace by your own executable

If you run an application that can use more cores, you can requests the number of cores in --ntask parameter instead of 1. Your allocation will be charged for the time of all cores you requested, regardless if you use them or not.

If you expect that your application will use more memory than 8GB (our nodes have 512GB memory and 64 cores each), you should request more tasks, about the expected memory usage in GB divided by 8. Otherwise the node memory may get overloaded when the machine gets busy with many jobs, and everyone's jobs may stall or crash. Note: this may change once we start allocating memory use, but at the moment we do not.

Multiple single-core jobs using arrays

 #!/bin/bash
 # Multiple single core jobs using array template
 #SBATCH --job-name=mpi_hello_single
 #SBATCH --partition=math-alderaan
 #SBATCH --time=1:00:00                    # Max wall-clock time
 #SBATCH --ntasks=1                        # number of cores, leave at 1
 #SBATCH --array=1-5,10-11                 # specifies to submit this script 7 times where array values are 1, 2, 3, 4, 5, 10, and 11.

 examples/hello_world_fortran.exe          # replace by your own executable

SLURM job arrays simplify running multiple instances of the same job script using a single batch script. The above example demonstrates submitting the 'hello_world_fortran.exe' script seven times where array values are 1, 2, 3, 4, 5, 10, and 11.

Helpful Directives/Variables:

  • %a: add the array number to naming convention.

    #SBATCH --job-name=mpi_hello_single_%a
    
  • %[insert-number]: Limit the number of array jobs to submit at a time.

    #SBATCH --array=1-1000%10
    

    A SLURM array job automatically submits jobs within your allocated resources. If you wish to conserve resources for other tasks, it can be advantageous to control the number of array jobs submitted simultaneously. In the example provided above, a total of 1000 jobs are executed, with 10 jobs running concurrently at any given time.

  • SLURM_ARRAY_TASK_ID: An environment variable that holds the array value. You can use it to pass the array value to the script you intend to execute.

    python example_script.py ${SLURM_ARRAY_TASK_ID}
    

A simple MPI job template

 #!/bin/bash
 # alderaan_mpi.sh
 # A simple MPI job template
 #SBATCH --job-name=mpi_hello
 #SBATCH --partition=math-alderaan
 #SBATCH --time=1:00:00                    # Max wall-clock time
 #SBATCH --ntasks=360                      # Total number of MPI processes, no need for --nodes
 mpirun examples/mpi_hello_world.exe       # replace by your own executable, no need for -np

A more general MPI job template

You can request the number of nodes. The scheduler will then split the tasks over the nodes.

 #!/bin/bash
 # alderaan_mpi_general.sh
 # A a more general MPI job template
 #SBATCH --job-name=mpi_hello
 #SBATCH --partition=math-alderaan    
 #SBATCH --nodes=2                   # Number of requested nodes
 #SBATCH --time=1:00:00              # Max wall-clock time
 #SBATCH --ntasks=5                  # Total number of tasks over all nodes, max 64*nodes
 mpirun -np 10 examples/mpi_hello_world.exe # replace by your own executable and number of processors
 # do not use more MPI processes than nodes*ntasks

Please do not request the number of nodes on Alderaan by --nodes or -N, unless you really need entire nodes for some reason. Request only the CPU cores you need by --ntasks, then the node or nodes you use can be shared with others.

How to use GPU

How to to run with GPU on Alderaan

The partition math-alderaan-gpu has two high memory/GPU nodesmath-alderann-h[01,02] with two NVIDIA A-100 40GB GPUs and 2TB memory each. Use --partition=math-alderaan-gpu with --gres=gpu:a100:1 to request one GPU and --gres=gpu:a100:2 to request two GPUs. At the moment, Alderaan does not support explicit memory allocation by the --mem flag.

Please do not use Alderaan GPUs without allocating them by --gres as above first. Please do not request an entire node on Alderaan by --nodes or -N, unless you really need all of it, request only the CPU cores you need by --ntasks. Large memory jobs and GPUs jobs can share the same node.

An example job script: #!/bin/bash #SBATCH --job-name=gpu #SBATCH --gres=gpu:a100:1 #SBATCH --partition=math-alderaan-gpu #SBATCH --time=1:00:00 # Max wall-clock time 1 day 1 hour #SBATCH --ntasks=1 # number of cores singularity exec /storage/singularity/tensorflow.sif python3 yourgpucode.py

Of course, instead of singularity you can run another GPU code on one of the GPU nodes directly. The nodes have currently installed CUDA 11.2. You will have to install tensorflow in your account yourself. A compatible version is [tensorflow 2.4.0] (https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_21-03.html).

It is recommended to use the tensorflow singularity container because it has updated CUDA (11.4) and a version of tensorflow compatible with the CUDA version.

Interactive jobs with GPU on Alderaan

From the command line,

 srun -p math-alderaan-gpu --time=2:00:0 -n 1 --gres=gpu:a100:1 --pty bash -i

will give you an interactive shell on one of the GPU nodes with one GPU allocated. You can then start singluarity shell

singularity shell /storage/singularity/tensorflow.sif

You can also start the Singularity shell directly:

srun -p math-alderaan-gpu --time=2:00:0 -n 1 --gres=gpu:a100:1 singularity shell /storage/singularity/tensorflow.sif

will allocate one GPU, one core, and run an internactive sinularity shell.

How to run with GPUs on Colibri

To use Colibri GPUs, do not use --gres but reserve a whole node by --nodes=1. Singularity containers work on Colibri, but current versions of tensorflow do not support the CPUs on Colibri. You can use an older version instead:

#!/bin/bash
#SBATCH --job-name=gpu
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=math-colibri-gpu
#SBATCH --time=1:00:00                  # Max wall-clock time 1 day 1 hour
#SBATCH --nodes=1                       # number of nodes
singularity exec /storage/singularity/tensorflow-v1.3.sif python3 yourgpucode.py

Interactive jobs

Remember you should not directly ssh to a node because it would interfere with jobs scheduled to run on that node. For interactive access to a compute node, do instead:

srun -p math-alderaan --time=2:00:0 -n 1 --pty bash -i

This will request a session for you as a job in a single core slot on a compute node in the math-alderaan partition for up to 2 hours. After the job starts, your session is transfered to the node. The job will end when you exit or the time runs out. Of course you can do the same for other partitions and add other flags such as to request more cores or a GPU.

To start an interactive job on Alderaan with a GPU:

srun -p math-alderaan-gpu --time=2:00:0 -n 1 --gres=gpu:a100:1 --pty bash -i

Viewing Job Queues, Job Status, and System Status

The command squeue will show one line for each job running on the system.

The command sinfo will show a summary of jobs and partitions status on the system:

PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
math-alderaan        up 7-00:00:00     10    mix math-alderaan-c[01-10]
math-alderaan        up 7-00:00:00      8  alloc math-alderaan-c[11-15,29,31-32]
math-alderaan        up 7-00:00:00     14   idle math-alderaan-c[16-28,30]
math-alderaan-gpu    up 7-00:00:00      1   drng math-alderaan-h01
math-alderaan-gpu    up 7-00:00:00      1    mix math-alderaan-h02
math-colibri-gpu     up   infinite     24   idle math-colibri-c[01-24]
math-score           up   infinite      5   idle math-score-c[01-05]
chem-xenon           up   infinite      6  down* chem-xenon-c[01-06]
clas-interactive     up   infinite      1  down* math-score-i01
clas-interactive     up   infinite      1   idle math-colibri-i02
math-alderaan-osg    up 1-00:00:00     10    mix math-alderaan-c[01-10]
math-alderaan-osg    up 1-00:00:00      8  alloc math-alderaan-c[11-15,29,31-32]
math-alderaan-osg    up 1-00:00:00     14   idle math-alderaan-c[16-28,30]
clas-dev             up   infinite      1   idle clas-devnode-c01

Real-time system status including temperature, load, and the partitions from sinfo, is available in News and Status Updates.

We will be happy to install software and build containers for you, do not hesitate to ask!

Building Your Own Software

Here are the best practices when you compile and link your own software:

  • Use math-alderaan head node to build software for use on the Alderaan cluster. Use module avail to see which tools are available in modules. We can add other tools and package them in modules on request.

  • Use clas-compute or math-colibri-i02 to build software for the Colibri cluster, and clas-compute or math-score-i01 for the Score cluster. You can download and build libraries and other package in your own account.

  • Alderaan runs Centos 8, while clas-compute and Colibri and Score clusters Centos 7. Software built on one will normally not work on the other.