Skip to main content

Slurm BridgeNewThis content is new.

5.4.0This functionality requires CAESES version 5.4.0 or later.

The Slurm Bridge module included in the CTC Server replaces the SSH resource manager. It acts as SLURM (Simple Linux Utility for Resource Management) bridge, enabling CAESES to:

  • Submit jobs to Slurm queues
  • Monitor job states
  • Cancel jobs
  • Allow monitoring of jobs via a lightweight web interface

Configuration

The Slurm Bridge uses the following additional configuration section in the CTC Server configuration file.

[SlurmBridge]

  • datadir / core-data-dir
    • Base directory used by the CTC Server to manage runtime data.
    • Default: datadir setting in the [Core] section.
    • Contains subdirectories:
      • work/ → scratch directory for CAESES input data and Slurm job execution
      • scripts/ → stores user-defined bash scripts that are submitted to Slurm using sbatch
  • removeInputDir
    • Controls handling of existing job input directories.
    • 1 → Always remove and recreate the directory for a new job
    • 0 → Keep existing directory; if it exists, append a numeric counter (e.g., _1, _2, …)
  • oversubscribe
    • Enables or disables Slurm oversubscription.
    • 0 → Oversubscription disabled
    • 1 → Oversubscription enabled
Example ctcconfig.ini
[...]

[SlurmBridge]
datadir=/mnt/data/slurmbridge
removeInputDir=0
oversubscribe=1

[...]

Adding an Application

Applications are automatically generated from the bash scripts located in the scrips directory in the datadir. To add a new application, create a bash script in that folder, that will be passed to squeue and restart the CTC server.

Usage with CAESES

  1. In CAESES, add a new Host for the Resource Manager in the Execution Settings of the software connector. Set the SSH Resource Manager Address to http://<hostname>:<port>/RPC2 and enter your CTC Server credentials.

RessourcemanagerSetup

  1. Configure it to connect to your running CTC Server.
  2. Additional Slurm-specific parameters will become available when this option is selected. In this example, allrun.sh is the script configured on the CTC Server. 16 Cores shall be used for the simulation with on thread per core/CPU (16 tasks times 1 CPU/task). All CPUs are on the same node. Further, no particular priority is given to this job (nice) nor is a particular node prescribed. For further information please refer to the documentation of your Slurm installation.

Configuration

  1. Job submission is then handled via CAESES.
Authentication Required

CAESES requires valid CTC Server credentials to submit and monitor jobs. Default credentials are admin / admin, but you should change the password immediately for security.

Job Directory Structure

When jobs are submitted, directories inside the work/ folder in the datadir are created automatically. The naming follows a deterministic pattern to allow multiple users and CAESES sessions to run in parallel without conflicts.

Naming Scheme

The job directory name is built step by step:

  1. Project name → Base identifier

  2. CAESES instance ID → Appended with an underscore. Every time CAESES is started, it generates a new unique ID.

    <projectName>_<instanceID>
  3. Design engine name and ID (if applicable) → Appended to distinguish design engine jobs from project-level jobs.

    <projectName>_<instanceID>_<designEngineName>_<instanceID>
  4. Design name → Final part of the directory name.

    <projectName>_<instanceID>_<designEngineName>_<instanceID>_<designName>
Example
MyProject_1234_DE1_1234_DesignA

Behavior with removeInputDir

  • If removeInputDir=1 → Existing directories are removed and replaced.

  • If removeInputDir=0 → Directories are reused if empty; otherwise, a numeric counter is appended:

    MyProject_1234_DesignA
    MyProject_1234_DesignA_1
    MyProject_1234_DesignA_2
    ...

Execution

  • Input files from CAESES are copied into the job directory inside work/.
  • Slurm jobs are executed via sbatch using scripts from the scripts/ directory.

Job Submission to Slurm

When CAESES submits a job through the Slurm Bridge, the plug prepares arguments for Slurm and sets them via the environment variable JOB_ARGS before invoking sbatch.

Argument Mapping

  • -job-name=<name>
    • The name of the job in the form <ProjectName> - <DesignName>, all spaces in the name will be replaced by _.
  • -ntasks=<tasks>
    • Number of tasks configured via the Number of Tasks input field in CAESES.
  • --nodes=<nodes>
    • Number of nodes added if the Number of Nodes field is greater than 0.
  • --cpus-per-task=<cpus>
    • CPUs per task added if the CPUs per Task field is greater than 0.
  • --partition=<name>
    • Partition added if a partition name is specified.
  • --oversubscribe
    • Oversubscribe added if the oversubscribe option is set in the configuration.
Example Environment Variable
export JOB_ARGS="--job-name=Test_-_baseline -n 16 --nodes=2 --cpus-per-task=4 --partition=short --oversubscribe"
sbatch $JOB_ARGS my_job_script.sh

Job Tracking

  • After submission, Slurm Bridge records the Slurm Job ID returned by sbatch.

  • Job state queries are executed by the CTC Server using:

    squeue --job <jobID>
  • Cancel requests are handled by:

    scancel <jobID>

All results are passed back to CAESES and can be viewed using the webinterface. Slurm logfiles slurm-???.out are saved to the Job Directory itself. After the job finishes, they are copied back to the host on which CAESES is running.

Local Data Storage

The Slurm Bridge maintains a local configuration in the CTC Server database in the installation folder.

  • Stored in SQLite format
  • Includes:
    • Job counter (tracks submitted jobs)
    • Mapping of CAESES job IDs to Slurm job IDs
      • Ensures that if CAESES crashes or is restarted, existing Slurm jobs can still be tracked and controlled
    • Other local runtime configuration data

This database ensures that job numbering and state mappings remain consistent between sessions, supporting crash recovery and reliable monitoring.

Authentication & Permissions

  • The CTC Server communicates with Slurm via the command line interface.
  • The user account running the server must have permission to run Slurm commands such as squeue, scontrol, sbatch, etc.
  • No direct password/SSH key handling is needed — permissions are inherited from the user environment.

Web Interface

The Slurm Bridge includes a comprehensive web interface for real-time job monitoring, management, and cluster utilization tracking.

CTC-slurm-webinterface

Access & Authentication

  • URL: http://<hostname>:<port>/static/slurmbridge
  • Port: Configured via Core/port (default: 5170)
  • Bind address: Configured via Core/bind_address
  • Protocol: HTTP (no HTTPS support yet)
  • Authentication: Required for all access

Job Monitoring Features

Real-Time Job List

  • Automatic refresh for live updates
  • Jobs are organized in a sortable, filterable table
  • Job grouping by run name with collapsible groups (click run name badge to expand/collapse)

Job Information Display

Each job entry shows:

  • Job ID: Internal CTC Server job identifier
  • Slurm ID: SLURM job identifier for tracking in the cluster
  • Start Time: When the job was submitted
  • Project Name: CAESES project name
  • Design Name: Name of the design being evaluated
  • Status: Current job state with color-coded badges
  • Job Directory: Full path to the job working directory
  • Allocated Nodes: Number of nodes assigned by SLURM

Job States

Jobs can be in the following states (with visual color coding):

  • new - Job created but not yet submitted
  • running - Currently executing on the cluster
  • queued - Waiting in SLURM queue
  • tobesubmitted - Prepared for submission
  • finalized - Execution complete, results ready
  • finished - Successfully completed
  • failed - Execution failed or was cancelled
  • unknown - Status cannot be determined
  • waiting for input - Waiting for user input or data

Column Filters

Each column has an individual filter field:

  • Project Name: Text search to filter by project
  • Design Name: Text search to filter by design
  • Status: Dropdown with all available states
  • Job Directory: Text search for directory paths

Job Management

Individual Job Operations

  • Cancel Job: Click the cancel/trash icon on any job row
    • Confirmation dialog prevents accidental cancellation
    • Cancels the job in SLURM and removes it from the queue

Bulk Operations

  • Multi-select: Use checkboxes to select multiple jobs
  • Bulk Remove: Remove all selected jobs at once
  • Remove All Failed: One-click button to remove all jobs in "failed" state
    • Can be applied per run or globally across all runs
Job Removal

Removing a job from the web interface performs two actions:

  1. Cancels the job in SLURM using scancel
  2. Removes the job from the CTC Server queue

Table Customization

Column Management

  • Show/Hide Columns: Dropdown menu to toggle column visibility
  • Column Reordering: Drag and drop column headers to reorder
  • Column Resizing: Drag column borders to adjust width
  • Persistence: Column settings are saved in your browser session

Sorting

  • Click any column header to sort ascending/descending
  • Multi-column sorting supported (hold Shift while clicking)

User Management

Password Management

  • Change Password: Access via user menu in the top navigation
  • Opens a dialog to change your password securely
  • Current password verification required

User List (Admin Only)

  • View all registered users in the system
  • Access via user menu or settings
  • Current version supports only the admin user

Logout

  • Click the user icon in the top navigation
  • Select "Logout" to end your session
  • Redirects to the login page

Troubleshooting

Cannot Access Web Interface

Symptoms: Browser cannot connect to http://<hostname>:<port>/static/slurmbridge

Solutions:

  1. Verify the CTC Server is running:

    systemctl status ctc-server
  2. Check the port configuration:

    • Review Core/port in ctcconfig.ini (default: 5170)
    • Ensure no other service is using this port
  3. Check bind address:

    • If Core/bind_address=127.0.0.1, the server only accepts local connections
    • Change to 0.0.0.0 to accept connections from other machines
  4. Firewall rules:

    • Ensure firewall allows connections on the configured port
    • Example: sudo ufw allow 5170/tcp
  5. Network connectivity:

    • Verify you can reach the host: ping <hostname>
    • Try accessing from localhost first: http://localhost:5170/static/slurmbridge

Authentication Failed Errors

Symptoms: Login fails with "Invalid credentials" or "Authentication failed"

Solutions:

  1. Verify credentials: Default is admin / admin
  2. Check if password was changed: Ask the administrator for current credentials
  3. Review authentication logs: Check <logpath>/ctcserver.log for authentication errors
  4. Database issues: Verify ctcserver.db file exists and is readable
  5. Clear browser cache: Old tokens or cached login data may cause issues

Jobs Not Appearing in Queue

Symptoms: CAESES submits jobs but they don't appear in the web interface

Solutions:

  1. Check CAESES connection:

    • Verify Resource Manager address: http://<hostname>:<port>/RPC2
    • Confirm credentials match CTC Server credentials
    • Test connection from CAESES
  2. Verify SLURM is working:

    squeue # Should show SLURM jobs
    sinfo # Should show cluster status
  3. Check CTC Server logs:

    • Location: <logpath>/ctcserver.log
    • Look for submission errors or SLURM communication issues
  4. Database permissions:

    • Ensure the service user can read/write ctcserver.db
    • Check directory permissions in datadir
  5. Refresh the web interface:

    • Jobs auto-refresh every 300ms
    • Try clearing filters (click reset/clear filter button)
    • Check if jobs are grouped under a run name (expand the group)

Jobs Stuck in "New" or "tobesubmitted" State

Symptoms: Jobs remain in initial state and never transition to "queued" or "running"

Solutions:

  1. SLURM submission errors:

    • Check <logpath>/ctcserver.log for sbatch errors
    • Verify the bash script in scripts/ directory is executable
    • Test manual submission: sbatch <script_path>
  2. SLURM permissions:

    • Ensure the service user can run SLURM commands
    • Test: squeue, sbatch, scancel as the service user
  3. Resource constraints:

    • Check if requested resources exceed cluster capacity
    • Review SLURM logs: /var/log/slurm-llnl/slurmd.log
  4. Directory permissions:

    • Verify datadir/work/ is writable
    • Check job directory was created successfully

Web Interface Shows "Unknown" Job Status

Symptoms: Jobs display status as "unknown" instead of actual state

Solutions:

  1. SLURM communication issue:

    • CTC Server cannot query job status from SLURM
    • Verify SLURM is running: systemctl status slurmd
  2. Job no longer in SLURM queue:

    • Job may have completed or been cancelled externally
    • Check SLURM history: sacct -j <slurm_job_id>
  3. Permissions issue:

    • Service user may lack permission to run squeue --job <jobID>
    • Test manually as service user
  4. Database sync issue:

    • Restart CTC Server: systemctl restart ctc-server
    • This resyncs job states with SLURM

Log File Locations

For debugging CTC Server and Slurm Bridge issues:

  • CTC Server logs: <logpath>/ctcserver.log (configured in Core/logpath)
  • Default log location: <installation_dir>/bin/logs/ (if logpath not configured)
  • SLURM job output: <datadir>/work/<job_directory>/slurm-<jobid>.out
  • SLURM system logs: /var/log/slurm-llnl/ (distribution-dependent)
  • systemd service logs: journalctl -u ctc-server

Check these logs for detailed error messages and diagnostic information.