Safety Systems¶
VM Cleanup (vm_guard.py)¶
Quadruple safety net ensures VMs are always deleted:
| Scenario | Handler | VM deleted? |
|---|---|---|
| Normal exit | atexit |
Yes |
Exception in with block |
__exit__ |
Yes |
| Ctrl+C | SIGINT handler |
Yes |
taskkill (graceful) |
SIGTERM + SIGBREAK |
Yes |
| Console window closed | SetConsoleCtrlHandler (Windows) |
Yes |
| Windows logoff/shutdown | SetConsoleCtrlHandler |
Yes |
taskkill /F (force kill) |
Nothing runs | Next startup via session_state.json |
| Power cut | Nothing runs | Next startup via session_state.json |
How it works¶
with VMGuard(config, instance_name) as guard:
# ... do work ...
delete_instance(config, instance_name)
guard.disarm() # prevent double-delete
If anything goes wrong inside the with block, VMGuard catches it and deletes the VM.
Crash Recovery (session_state.json)¶
Written on every VM create/delete and job start/finish.
On GUI startup:
- Read
session_state.json(what we think is running) - Query GCP (what actually exists)
- Cross-reference: clean up stale entries, warn about orphan VMs
- Reset interrupted downloads for auto-retry
Cross-referencing¶
| session_state says | GCP says | Action |
|---|---|---|
| VM registered | VM running | Warn user, enable panic buttons |
| VM registered | VM gone | Clean session state (Google terminated it) |
| Nothing | VM running | Warn user (orphan from another session) |
| Job running | VM gone | Warn about interrupted analysis |
Cost Protection¶
max_run_duration¶
Every VM gets a max_run_duration — GCP auto-kills it after this time even if all other cleanup fails.
- Build VMs: 2 hours
- Analysis VMs: sum of all analysis timeouts + last upload timeout + 1 min (rounded up to whole hours)
Status monitoring¶
ops-hpc status # shows running VMs and costs
ops-hpc status --usage # shows per-VM cost breakdown for last 30 days
The GUI startup check also warns about running VMs.
On-demand by default¶
Standard (non-preemptible) VMs are the default. Spot VMs require explicit --spot flag. This prevents data loss from preemption during long analyses.
Results Protection¶
Download verification¶
Before deleting from GCS, every file is verified:
- Local file size must match GCS blob size
- If any mismatch: keeps files in GCS, marks as failed, retries later
Download tracking¶
download_state.json tracks every result set through its lifecycle:
pending → downloading → downloaded → cleaned
↘ failed (auto-retry)
Survives GUI restarts. Failed downloads auto-retry on next poll.
Emergency download¶
If analysis is killed, partial results on the VM disk can be saved:
- ABORT ANALYSIS — kills OpenSees, VM stays alive
- EMERGENCY DOWNLOAD — uploads whatever exists to GCS
- STOP VM NOW — deletes the VM