HPC Training
Parallel computing in High-Performance Computing (HPC) involves breaking down large computational tasks into smaller parts that can be processed simultaneously across multiple processors or computing nodes. This approach accelerates problem-solving by dividing the workload, enabling the system to handle vast data and complex simulations much faster than a single processor could.
Parallel computing is fundamental in HPC because it maximizes resource efficiency, reduces computational time, and allows for solving large-scale scientific, engineering, and data-intensive problems. Techniques include data parallelism, where large datasets are split across nodes, and task parallelism, where distinct tasks are executed concurrently. Commonly, parallel computing in HPC uses frameworks like MPI (Message Passing Interface) and OpenMP (Open Multi-Processing) to coordinate tasks across processors.
For anyone using advanced computing resources, it's helpful to understand a bit about the hardware so you can see what factors influence how well applications run. This session will cover some key points:
- CPUs (Processors): Learn about what makes a CPU powerful, like the number of cores, hyperthreading (which makes a core act like two), and instruction sets (types of tasks a CPU can handle).
- Compute Node Anatomy: Each compute node is like a mini-computer in the cluster, with its own processors, memory, storage devices, and sometimes special accelerators like GPUs.
- Cluster Structure: It will also explain how clusters are set up, including different types of nodes (login and compute nodes), how they connect, and how they handle file storage.
- Using Tools to Check Hardware: Finally, you鈥檒l learn to use Linux tools to view hardware details, check system usage, and monitor performance.
This video introduces the concept of a distributed batch job scheduler 鈥 what it is, why it exists, and how it works 鈥 using the Slurm Workload Manager as a reference tool and testbed. It explains how to write and submit a first job script to an HPC system using Slurm as the scheduler. The video also covers best practices for structuring batch job scripts, leveraging Slurm environment variables, and requesting resources from the scheduler to optimize task completion time.
This session covers typical methods for transitioning computations to HPC resources, including using pre-installed applications via Linux environment modules, compiling code from source with recommended compilers, libraries, and optimization flags, setting up Python and R environments (including conda), managing workflows, and utilizing containerized solutions with Singularity. General principles are explained, with hands-on activities provided on SDSC resources.
This session introduces high-throughput computing (HTC) and many-task computing (MTC) on HPC systems, focusing on how to harness consistent, aggregate compute power over time. HTC workloads tackle large problems by completing numerous smaller subtasks, ideal for parameter sweeps or data analysis tasks. The session covers using the Slurm Workload Manager to set up these workflows with job arrays and dependencies, discusses common challenges in HTC/MTC setups, and explains job bundling strategies and when to apply them. Additional HTC/MTC workflow topics will be addressed as time allows.