Skip to content

Safety Systems

VM Cleanup (vm_guard.py)

Quadruple safety net ensures VMs are always deleted:

Scenario Handler VM deleted?
Normal exit atexit Yes
Exception in with block __exit__ Yes
Ctrl+C SIGINT handler Yes
taskkill (graceful) SIGTERM + SIGBREAK Yes
Console window closed SetConsoleCtrlHandler (Windows) Yes
Windows logoff/shutdown SetConsoleCtrlHandler Yes
taskkill /F (force kill) Nothing runs Next startup via session_state.json
Power cut Nothing runs Next startup via session_state.json

How it works

with VMGuard(config, instance_name) as guard:
    # ... do work ...
    delete_instance(config, instance_name)
    guard.disarm()  # prevent double-delete

If anything goes wrong inside the with block, VMGuard catches it and deletes the VM.

Crash Recovery (session_state.json)

Written on every VM create/delete and job start/finish.

On GUI startup:

  1. Read session_state.json (what we think is running)
  2. Query GCP (what actually exists)
  3. Cross-reference: clean up stale entries, warn about orphan VMs
  4. Reset interrupted downloads for auto-retry

Cross-referencing

session_state says GCP says Action
VM registered VM running Warn user, enable panic buttons
VM registered VM gone Clean session state (Google terminated it)
Nothing VM running Warn user (orphan from another session)
Job running VM gone Warn about interrupted analysis

Cost Protection

max_run_duration

Every VM gets a max_run_duration — GCP auto-kills it after this time even if all other cleanup fails.

  • Build VMs: 2 hours
  • Analysis VMs: sum of all analysis timeouts + last upload timeout + 1 min (rounded up to whole hours)

Status monitoring

ops-hpc status           # shows running VMs and costs
ops-hpc status --usage   # shows per-VM cost breakdown for last 30 days

The GUI startup check also warns about running VMs.

On-demand by default

Standard (non-preemptible) VMs are the default. Spot VMs require explicit --spot flag. This prevents data loss from preemption during long analyses.

Results Protection

Download verification

Before deleting from GCS, every file is verified: - Local file size must match GCS blob size - If any mismatch: keeps files in GCS, marks as failed, retries later

Download tracking

download_state.json tracks every result set through its lifecycle:

pending → downloading → downloaded → cleaned
                    ↘ failed (auto-retry)

Survives GUI restarts. Failed downloads auto-retry on next poll.

Emergency download

If analysis is killed, partial results on the VM disk can be saved:

  1. ABORT ANALYSIS — kills OpenSees, VM stays alive
  2. EMERGENCY DOWNLOAD — uploads whatever exists to GCS
  3. STOP VM NOW — deletes the VM