OmniProbe Coming Soon

~80KB

Where is your model slow? Layer by layer. CPU-cycle precision. 80KB.

Per-layer inference profiler with hardware bottleneck analysis.

LinuxLinux

The problem

Your model generates at 2 tokens per second. Why? Is it attention? FFN? Memory bandwidth? Compute? Today, answering this requires NVIDIA Nsight (2GB+ CUDA toolkit) or PyTorch profiler (4GB+ install). Neither is portable. Neither works without a full development environment.

The solution

OmniProbe runs your model and measures every layer, every operation, with CPU cycle precision (rdtsc). It identifies whether the bottleneck is compute or memory bandwidth, which layers are slowest, and what hardware changes would improve performance. 80KB, portable to any machine.

Why Bare-Metal Matters

Profiling inference at the hardware level requires measuring CPU cycles, cache behavior, and memory bandwidth. Tools built on Python or CUDA add their own overhead to the measurement. OmniProbe runs the transformer with zero overhead between the code and the hardware — the measurements are the ground truth.

Technical Specifications

Feature Value
Binary Size ~80KB
Function Per-layer inference profiler with hardware analysis
Precision CPU cycle-level (rdtsc)
Dependencies None — no NVIDIA toolkit, no Python
Output Layer timing, bandwidth, bottleneck ID
Portable scp to any machine, run immediately

Comparison

OmniProbe NVIDIA Nsight PyTorch Profiler
Size ~80KB 2GB+ (CUDA toolkit)4GB+ (PyTorch)
Installation wget (80KB) CUDA toolkit + accountpip install torch
CPU profiling Yes (rdtsc cycle-level) GPU focusedYes
Per-layer timing Built-in Manual instrumentationManual instrumentation
Portable scp + run No (requires toolkit)No (requires Python)
Bottleneck analysis Bandwidth + compute breakdown GPU kernel analysisOp-level timing

Use Cases

Optimization Research

Identify which layers and operations consume the most time. Know exactly where to focus optimization effort.

Hardware Selection

Profile the same model on different machines (scp + run). Compare DDR4 vs DDR5, Intel vs AMD, x86 vs ARM.

Quantization Decisions

See exactly how much slower Q6_K layers are vs Q4_K. Make informed decisions about which tensors to keep at higher precision.

Try Now — Free

Coming Soon

This product is under active development. Contact us for early access or to be notified when binaries are available.

Talk to the Team