Where is your model slow? Layer by layer. CPU-cycle precision. 80KB.

Per-layer inference profiler with hardware bottleneck analysis.

Linux

The problem

Your model generates at 2 tokens per second. Why? Is it attention? FFN? Memory bandwidth? Compute? Today, answering this requires NVIDIA Nsight (2GB+ CUDA toolkit) or PyTorch profiler (4GB+ install). Neither is portable. Neither works without a full development environment.

The solution

OmniProbe runs your model and measures every layer, every operation, with CPU cycle precision (rdtsc). It identifies whether the bottleneck is compute or memory bandwidth, which layers are slowest, and what hardware changes would improve performance. 80KB, portable to any machine.

Why Bare-Metal Matters

Profiling inference at the hardware level requires measuring CPU cycles, cache behavior, and memory bandwidth. Tools built on Python or CUDA add their own overhead to the measurement. OmniProbe runs the transformer with zero overhead between the code and the hardware — the measurements are the ground truth.

Technical Specifications

Feature	Value
Binary Size	~80KB
Function	Per-layer inference profiler with hardware analysis
Precision	CPU cycle-level (rdtsc)
Dependencies	None — no NVIDIA toolkit, no Python
Output	Layer timing, bandwidth, bottleneck ID
Portable	scp to any machine, run immediately

Comparison

	OmniProbe	NVIDIA Nsight	PyTorch Profiler
Size	~80KB	2GB+ (CUDA toolkit)	4GB+ (PyTorch)
Installation	wget (80KB)	CUDA toolkit + account	pip install torch
CPU profiling	Yes (rdtsc cycle-level)	GPU focused	Yes
Per-layer timing	Built-in	Manual instrumentation	Manual instrumentation
Portable	scp + run	No (requires toolkit)	No (requires Python)
Bottleneck analysis	Bandwidth + compute breakdown	GPU kernel analysis	Op-level timing

Use Cases

Optimization Research

Identify which layers and operations consume the most time. Know exactly where to focus optimization effort.

Hardware Selection

Profile the same model on different machines (scp + run). Compare DDR4 vs DDR5, Intel vs AMD, x86 vs ARM.

Quantization Decisions

See exactly how much slower Q6_K layers are vs Q4_K. Make informed decisions about which tensors to keep at higher precision.

Try Now — Free

Coming Soon

This product is under active development. Contact us for early access or to be notified when binaries are available.

Talk to the Team