Setting up and using your research cluster account
Modules
The cluster has a lot of scientific libraries installed, most of which you won’t need. Rather than having everything always accessible to everyone, you choose what will be in your paths using modules. All modules can be seen by running module avail; the ones currently set up for you can be seen by running module list. You load a module by running module load <modulename>, and remove one by running module rm <modulename>. Loading a module will also load any dependencies.
Drives
Your home directory is on a relatively small disk - it’s OK for code, but not for large datasets, because everybody has to share that small disk. /mnt/beegfs is where you want to store datasets for your computation to use. It’s big (640TB) and loads quickly onto the computing nodes.
Setting up your account
ssh -X <username>@borg.hpc.usna.edu- Now, get yourself access to the already-installed anaconda: 
module load anaconda3 - In order to install new packages locally, you need to make a python “virtual environment:” 
conda create -y -n <some short name you like> python=3.9 - Change your ~/.bashrc to include 
module load anaconda3so that happens automatically. - Run 
conda init - To the bottom of your .bashrc, add 
conda activate <whatever you named the environment> - Sign out, sign back in, make sure that your virtual environment name is in the parentheses to the left of your command prompt.
 - Run 
conda install -c conda-forge numpy scipy matplotlib scikit-learn pandasto install all those packages in your virtual environment. cd /mnt/beegfsandmkdir <your username>. This is where you’ll put datasets.- If you want, go back to your home directory and make a soft link to that folder in beegfs, so it’s easy for you to navigate to.
 
OK, how do I run code?
When you first log in, you are on a login node. This is appropriate for coding, small scripts, data transfer, etc. Login nodes are not appropriate for actual computation. Running code on a login node is slower, and it also slows down everybody else’s login node experience. Rude!
Instead, you need to ask the machine for resources on a compute node. We do this using a program called Slurm, which allocates a finite machine’s resources to all the many computation jobs the different users may want to run. You may even have to wait! (You will probably not have to wait long).
You can see all the jobs that are currently running (or waiting to run) by running the command squeue. You can cancel one of your jobs by running scancel <jobnumber>
The most common way to submit a computation job is with a submission script. The below is an example script called cluster.sh:
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=afile.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --time=10:00
#SBATCH --partition=hpc
cd some_directory
run some other lines of code
The shebang line must be first, and says what shell you’re using to interpret the rest of the code. The comments starting with #SBATCH tell Slurm what requirements your job has. The job will show up in squeue as test_job. Output will appear in aFile.txt in the directory you submitted the file from. We’re only running 1 task (we’re not doing any parallel computation using MPI or something). I’ve requested all 20 CPUs on the 1 node I’m going to get (if you’re taking HPC - ntasks is the number of processes you’ll run (ie, MPI), while cpus-per-task is useful for threads (ie, OpenMP). I’ve put in a walltime line telling the queue that if this job is still running after 10 minutes, kill it. That line is optional, but voluntarily including some reasonable ceiling is encouraged, in case your code has an infinite loop or something. A good list of SBATCH options is here. partition is used to tell the machine what kind of queue you want to be in. Here’s some you may be interested in:
hpc- the default, which gives you nodes with 20 cores and 192 GB of memory. There are 35 of these. Unless you’re doing something special, you’re on this partition.debug- Same nodes ashpc. You go to the front of the line, but you shouldn’t use this for more than a short while. Not appropriate for running jobs, just appropriate for actually debugging and developing code.himem- gives you nodes with 20 cores and 768 GB of memory. There are 4 of these.gpu- gives you access to nodes with 4 Nvidia V100s. 10 cores per node. 128 GB of memory. There are 2 of these.
The non-SBATCH lines are the code you want to have run. cd-ing to directories, running python code, whatever it is.
To submit this, in the login node, I run sbatch cluster.sh, and the work is submitted to the job scheduler. Once my SBATCH requirements I’ve requested can be met, the job start to run.
Interactive jobs
If you want to see the job run, or interact with it by having a shell on the compute node, you can submit an interactive job. For example, srun --pty bash -i gets it done. Once it’s scheduled, you get another prompt. From there, you can do whatever it is you’re there to do.