~20KB
Never waste a training run. Your model watchdog. 20KB.
Checkpoint watchdog that detects training failures with real inference.
The problem
Training runs cost hundreds to thousands of dollars in GPU time. When gradients explode at 2AM, the training keeps running for hours, burning money on corrupted checkpoints. TensorBoard shows loss curves but cannot detect semantic degradation — the loss can look fine while the model generates garbage. Weights & Biases requires a cloud account and Python.
The solution
OmniGuardian watches your checkpoint directory using inotify. When a new checkpoint appears, it scans every tensor for NaN/Inf, measures weight drift from baseline, and runs a quick inference test to verify the model still generates coherent output. If quality degrades, it alerts immediately. Runs on CPU — never competes with your GPU training.
Why Bare-Metal Matters
A training watchdog must be invisible — it cannot steal GPU memory, compete for CPU, or add latency. At 20KB with zero dependencies, OmniGuardian uses mmap to read checkpoints and runs inference on CPU. It is the only tool that can verify semantic quality during training without touching the GPU.
Technical Specifications
| Feature | Value |
|---|---|
| Binary Size | ~20KB |
| Function | Training checkpoint watchdog with inference testing |
| Detection | NaN/Inf, weight drift, semantic degradation |
| Dependencies | None — runs on CPU alongside GPU training |
| Monitoring | inotify — detects new checkpoints instantly |
| Alert | Stdout + optional webhook |
Comparison
| OmniGuardian | TensorBoard | Weights & Biases | |
|---|---|---|---|
| Size | ~20KB | Python + TensorFlow | Python + cloud agent |
| Runs inference | Yes — detects semantic degradation | No — only loss curves | No — only metrics |
| NaN detection | Scans every tensor in checkpoint | Only if training logs it | Only if training logs it |
| Uses GPU | No — CPU only, does not compete with training | No | No |
| Dependencies | None | Python, TensorFlow | Python, cloud account |
| Works offline | Yes | Yes | No (cloud required) |
Use Cases
Overnight Training
Launch a training run and go to sleep. OmniGuardian watches every checkpoint. If the model degrades, you get an alert. Last good checkpoint is always identified.
Fine-tune Quality Gate
During fine-tuning, verify that the model is learning the target domain without forgetting general capability. Detect catastrophic forgetting early.
Distributed Training
Run on each training node. Verify that gradient synchronization is not causing divergence across workers.
Try Now — Free
Coming Soon
This product is under active development. Contact us for early access or to be notified when binaries are available.
Talk to the Team