Skip to content

Auto-Submit/Resubmit Tasks

Lin Ziyue

These two scripts are highlighted separately because they are extremely useful in actual work:
* The auto-submit script can control the number of queued tasks;
* The resubmit script can monitor the runtime of submitted tasks and automatically cancel/resubmit when needed.

Below are the core concepts and sample code. The actual commands to execute should be filled in at the TODO locations.


Auto-Submit

Only two things:

  1. Monitor the scheduler queue until enough quota is available to continue;
  2. Check if local marker files meet conditions to decide whether to proceed.
#!/usr/bin/env bash
#
# queue-watch + continue-or-not
#
# ────────────────────────────────────────────────
# Directory location (modify as needed)
workdir="$(pwd)"
cd "$workdir" || exit 1
# ────────────────────────────────────────────────
# ① Monitor queue: wait if current user's job count ≥ 21
#    Note: Replace 21, sleep time and other thresholds as needed
while true; do
    # Count only your own jobs in queue, removing header line
    njob=$(squeue -u "$USER" | tail -n +2 | wc -l)
    if (( njob < 21 )); then              # Queue has space, proceed
        break
    fi
    printf "Queue still has %d tasks, sleeping 20 s...\n" "$njob"
    sleep 20
done
# ────────────────────────────────────────────────
# ② Decide whether to continue: proceed if file doesn't exist or doesn't contain keyword
flagfile="zToten.dia"
if [[ ! -f $flagfile ]] || ! grep -q "TOTEN" "$flagfile"; then
    echo "⚠️ Need to continue subsequent operations (file missing or incomplete)"
    # =====================================================
    # TODO: Write the actual commands to execute here, e.g.:
    # sbatch Runscript
    # python postprocess.py
    # ...
    # =====================================================
else
    echo "✅ Normal completion marker detected, skipping subsequent operations"
fi

Key Points

  • squeue -u "$USER" only checks your own jobs; replace with appropriate command if not using Slurm.
  • tail -n +2 removes header line to ensure wc -l returns pure numbers.
  • grep -q is silent mode, only checks if keyword matches.
  • Threshold (21), sleep interval, keyword and marker filename can be adjusted per project needs.

Resubmit Tasks

This script only retains:

  1. Monitor SLURM queue task runtime;
  2. Filter tasks to process based on job name and working directory keywords;
  3. (Optional) Double-check via ps aux that task belongs to current account;
  4. Provide two choices: "only delete (scancel) timeout tasks / continue custom operations".

The actual "subsequent operations" are left in TODO placeholders for you to fill in as needed.

#!/usr/bin/env python3
"""
watch_slurm_timeout.py

• Monitor current user's SLURM job runtime
• Support filtering by job name and working directory keywords
• Provides:
  --only-cancel  Only cancel timeout tasks
  Interactive [C]ancel / [O]ther / [S]kip selection
"""
import argparse
import os
import re
import subprocess
import sys
from typing import List, Tuple

# ---------- General utilities ---------- #
def run(cmd: str) -> str:
    """Execute shell command and return string result (exit directly on error)."""
    try:
        return subprocess.check_output(
            cmd, shell=True, text=True, stderr=subprocess.STDOUT
        )
    except subprocess.CalledProcessError as e:
        sys.exit(f"[ERROR] Command failed: {cmd}\n{e.output}")

def hms_to_seconds(t: str) -> int:
    """
    Convert squeue TIME column to seconds.
    Possible formats:
        D-HH:MM:SS
        HH:MM:SS
        MM:SS
        SS
    """
    if "-" in t:                           # D-HH:MM:SS
        days, hms = t.split("-")
        h, m, s = map(int, hms.split(":"))
        return int(days) * 86400 + h * 3600 + m * 60 + s
    parts = list(map(int, t.split(":")))
    if len(parts) == 3:                    # HH:MM:SS
        h, m, s = parts
    elif len(parts) == 2:                  # MM:SS
        h, m, s = 0, *parts
    else:                                  # SS
        h, m, s = 0, 0, parts[0]
    return h * 3600 + m * 60 + s

# ---------- SLURM related ---------- #
def get_job_workdir(jobid: str) -> str:
    """Extract job working directory via scontrol (WorkDir or parent directory of StdOut path)."""
    info = run(f"scontrol show job {jobid}")
    # WorkDir field
    m = re.search(r"WorkDir=(\S+)", info)
    if m:
        return m.group(1)
    # Fallback: StdOut=.../slurm-xxx.out → extract directory
    m = re.search(r"StdOut=(.*)/slurm-\d+\.out", info)
    if m:
        return m.group(1)
    return "UNKNOWN"

def list_timeout_jobs(
    threshold_sec: int,
    job_name: str,
    path_keyword: str
) -> List[Tuple[str, str, str]]:
    """Return list of (jobid, workdir, raw_time_string) that meet criteria."""
    user = os.getenv("USER")
    sq = run(f"squeue -u {user}")
    lines = sq.strip().splitlines()[1:]        # Skip header
    result = []
    for ln in lines:
        cols = ln.split()
        if len(cols) < 6:
            continue
        jobid, name, time_used = cols[0], cols[2], cols[5]
        if job_name and name != job_name:
            continue
        if hms_to_seconds(time_used) < threshold_sec:
            continue
        workdir = get_job_workdir(jobid)
        if path_keyword and path_keyword not in workdir:
            continue
        # ------------- Optional: ps aux double-check ------------- #
        ps_ok = any(jobid in p for p in run("ps aux").splitlines())
        if not ps_ok:
            continue
        # --------------------------------------------------------- #
        result.append((jobid, workdir, time_used))
    return result

def cancel_job(jobid: str):
    subprocess.run(f"scancel {jobid}", shell=True, check=False)

# ---------- Main program ---------- #
def main():
    parser = argparse.ArgumentParser(
        description="Monitor and handle SLURM jobs that exceed runtime"
    )
    parser.add_argument(
        "-t", "--threshold", type=float, default=5,
        help="Timeout duration (hours), default 5h"
    )
    parser.add_argument(
        "-n", "--name", default="Runscrip",
        help="Only match specified job name"
    )
    parser.add_argument(
        "-k", "--keyword",
        default="sc90805/autobackup_BSCC-A2-sc90805/lzy",
        help="Substring keyword required in working directory"
    )
    parser.add_argument(
        "--only-cancel", action="store_true",
        help="Directly scancel when timeout detected, no subsequent operations"
    )
    args = parser.parse_args()
    timeout_sec = int(args.threshold * 3600)
    jobs = list_timeout_jobs(timeout_sec, args.name, args.keyword)
    if not jobs:
        print(">> No timeout tasks found matching criteria.")
        return

    print("\n============= Timeout Tasks Found =============")
    for jid, wdir, tstr in jobs:
        print(f"{jid:>8} TIME={tstr:<10} DIR={wdir}")
    print("===============================================\n")

    # --- Cancel only ---
    if args.only_cancel:
        for jid, *_ in jobs:
            cancel_job(jid)
        print(">> All timeout tasks cancelled.")
        return

    # --- Interactive ---
    choice = input(
        "[C]ancel only / [O]ther operations / [S]kip ? "
    ).strip().lower()

    if choice.startswith("c"):
        for jid, *_ in jobs:
            cancel_job(jid)
        print(">> Selected tasks cancelled.")

    elif choice.startswith("o"):
        for jid, wdir, _ in jobs:
            print(f"\n>>> Processing Job {jid} ({wdir})")
            cancel_job(jid)
            # ========================================================
            # TODO: Write your own file operations, resubmission logic here
            # For example:
            # subprocess.run("sbatch Runscript", cwd=wdir, shell=True)
            # ========================================================
    else:
        print(">> No operations performed.")

if __name__ == "__main__":
    main()

Usage Examples

  1. Directly kill tasks running ≥ 6 h with job name Runscrip

    python watch_slurm_timeout.py -t 6 --only-cancel
    

  2. Detect first, then manually choose subsequent action

    python watch_slurm_timeout.py -t 4
    

The script structure completely decouples "finding tasks & (optionally) canceling" from "subsequent custom operations".
To do what you want, simply add your own code in the # TODO sections.