~80KB
Where is your model slow? Layer by layer. CPU-cycle precision. 80KB.
Per-layer inference profiler with hardware bottleneck analysis.
The problem
Your model generates at 2 tokens per second. Why? Is it attention? FFN? Memory bandwidth? Compute? Today, answering this requires NVIDIA Nsight (2GB+ CUDA toolkit) or PyTorch profiler (4GB+ install). Neither is portable. Neither works without a full development environment.
The solution
OmniProbe runs your model and measures every layer, every operation, with CPU cycle precision (rdtsc). It identifies whether the bottleneck is compute or memory bandwidth, which layers are slowest, and what hardware changes would improve performance. 80KB, portable to any machine.
Why Bare-Metal Matters
Profiling inference at the hardware level requires measuring CPU cycles, cache behavior, and memory bandwidth. Tools built on Python or CUDA add their own overhead to the measurement. OmniProbe runs the transformer with zero overhead between the code and the hardware — the measurements are the ground truth.
Technical Specifications
| Feature | Value |
|---|---|
| Binary Size | ~80KB |
| Function | Per-layer inference profiler with hardware analysis |
| Precision | CPU cycle-level (rdtsc) |
| Dependencies | None — no NVIDIA toolkit, no Python |
| Output | Layer timing, bandwidth, bottleneck ID |
| Portable | scp to any machine, run immediately |
Comparison
| OmniProbe | NVIDIA Nsight | PyTorch Profiler | |
|---|---|---|---|
| Size | ~80KB | 2GB+ (CUDA toolkit) | 4GB+ (PyTorch) |
| Installation | wget (80KB) | CUDA toolkit + account | pip install torch |
| CPU profiling | Yes (rdtsc cycle-level) | GPU focused | Yes |
| Per-layer timing | Built-in | Manual instrumentation | Manual instrumentation |
| Portable | scp + run | No (requires toolkit) | No (requires Python) |
| Bottleneck analysis | Bandwidth + compute breakdown | GPU kernel analysis | Op-level timing |
Use Cases
Optimization Research
Identify which layers and operations consume the most time. Know exactly where to focus optimization effort.
Hardware Selection
Profile the same model on different machines (scp + run). Compare DDR4 vs DDR5, Intel vs AMD, x86 vs ARM.
Quantization Decisions
See exactly how much slower Q6_K layers are vs Q4_K. Make informed decisions about which tensors to keep at higher precision.
Try Now — Free
Coming Soon
This product is under active development. Contact us for early access or to be notified when binaries are available.
Talk to the Team