AWS PCS Adds Support for Slurm v25.05 to Boost HPC Reliability

AiTech365 Bureau

6 hours ago

Slurm 25.05 is now supported in AWS Parallel Computing Service (PCS), and users can now create PCS clusters that run the newer version of Slurm. This release introduces improved multi-cluster sackd configuration and better requeue behavior for failed instance launch, i.e., login nodes will be able to handle multiple clusters without sackd reconfig or restart, and jobs will retry automatically after instance launch failures because of capacity shortages improving overall cluster reliability. PCS is an AWS managed service aimed at making it easier to deploy and scale high performance computing (HPC) workloads on AWS with Slurm.

Also Read: NVIDIA DGX Spark: Empowering AI Developers with Desktop Supercomputing

With support for Slurm v25.05, AWS PCS targets primary operational pain points in HPC: administrators have more control over multi-cluster access and users suffer fewer interruptions due to failed instance starts. The new auto-requeue ability enhances scheduling robustness in times of capacity shortage, ensuring workloads keep moving even under tight situations. With this release, organisations using AWS for HPC can enjoy both operational ease and higher reliability on compute-hungry jobs.

Also Read: NVIDIA DGX Spark: Empowering AI Developers with Desktop Supercomputing

Read More: AWS Parallel Computing Service (PCS) now supports Slurm v25.05