pprof, benchmarks, and finding the real hotspot
Performance work in Go is unusually pleasant because the toolchain ships with everything you need. testing.B, pprof, the race detector, the trace viewer. No third-party agent, no APM bill, no language-specific quirks to learn. The discipline is to use them in the right order: benchmark to confirm there is a problem, profile to find the hotspot, then change code, then re-benchmark.
The number one mistake juniors make is optimising by intuition. The number two mistake is running go test -bench=. and trusting the timing without -benchmem. The number three mistake is optimising code that runs once when the actual bottleneck is the hot loop two functions away. We will fix all three by following the data instead of the gut.
Benchmark anatomy
A benchmark is a function in a _test.go file that takes a *testing.B. The framework calls it with growing values of b.N until the runtime stabilises.
func BenchmarkConcat(b *testing.B) {
parts := []string{"a", "b", "c", "d", "e"}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
var s string
for _, p := range parts {
s += p
}
_ = s
}
}Three lines deserve attention. b.ReportAllocs() makes the benchmark report allocs/op and B/op columns. b.ResetTimer() discards setup time so only the inner loop counts. The _ = s keeps the compiler from optimising the work away because the result is unused.
Run it:
go test -bench=. -benchmem -benchtime=2s ./...The output looks like:
BenchmarkConcat-8 3000000 412 ns/op 80 B/op 4 allocs/op
The -8 is GOMAXPROCS. The three numbers are nanoseconds, bytes, and allocations per iteration. The allocations column is the one juniors ignore and seniors stare at.
Interview trap:
go test -bench=.without-benchmemhides 80 percent of the truth. Memory allocations drive GC pressure, GC pressure drives tail latency, tail latency drives your SLO. Always pass-benchmem.
Comparing alternatives
Benchmarks shine when you compare two implementations side by side.
func BenchmarkConcatBuilder(b *testing.B) {
parts := []string{"a", "b", "c", "d", "e"}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
var sb strings.Builder
for _, p := range parts {
sb.WriteString(p)
}
_ = sb.String()
}
}Run both, eyeball the difference, then run benchstat for statistical significance:
go test -bench=. -benchmem -count=10 > old.txt
# change code
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txtbenchstat runs a t-test and tells you the delta is real and not noise. Without -count=10, you are reading one noisy sample and concluding things.
CPU profiling: where the seconds go
The -cpuprofile flag tells the test binary to record a sampling profile.
go test -bench=BenchmarkHotPath -cpuprofile=cpu.out -benchmem
go tool pprof cpu.outInside the pprof shell:
(pprof) top
(pprof) top -cum
(pprof) list HotFunction
(pprof) web
top ranks functions by self-time. top -cum ranks by cumulative time including callees, which is what you want for finding wrappers. list FuncName shows source code with per-line cost. web opens an SVG callgraph in your browser.
Senior rule: when reading
top, look for surprises. The function you expected to dominate is rarely the actual hotspot. The hotspot is almost alwaysruntime.mallocgc,runtime.gcBgMarkWorker, or afmt.Sprintfyou forgot you were calling in a loop.
Memory profiling: where the bytes go
go test -bench=BenchmarkHotPath -memprofile=mem.out
go tool pprof -alloc_space mem.out-alloc_space shows total bytes allocated over the run. -inuse_space shows live bytes at the moment the profile was captured. For tuning GC pressure, you want -alloc_space, because every allocation eventually becomes garbage.
Common allocation hotspots:
// String concatenation in a loop: O(N^2) copies, N allocations
s := ""
for _, p := range parts { s += p }
// fmt.Sprintf on the hot path: reflection + heap allocation per call
key := fmt.Sprintf("user:%d", id)
// Returning a pointer to a small struct: escapes to heap
func newPoint(x, y int) *Point { return &Point{x, y} }
// Appending to a slice without preallocating: log(N) reallocations
var out []int
for _, v := range in { out = append(out, v*2) }The fixes are mechanical. strings.Builder for concatenation. strconv.AppendInt plus a reusable buffer instead of Sprintf. Return value types when the struct is small. make([]int, 0, len(in)) to preallocate.
Senior rule: optimise for memory before CPU. Most "slow Go code" is GC pressure in disguise. Halving allocations usually halves p99 latency, even when the per-operation CPU cost looks unchanged.
Profiling a live service
For HTTP services, import net/http/pprof and the profile endpoints register themselves on http.DefaultServeMux.
import (
"net/http"
_ "net/http/pprof"
)
func main() {
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
runApp()
}Now fetch profiles live:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/goroutineThe goroutine profile is especially useful for tracking down leaks. If your goroutine count climbs unboundedly, fetch this profile and top will tell you which call site is spawning them.
For non-HTTP services, use runtime/pprof to write profiles to disk on demand or on a signal.
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()Reading flame graphs
go tool pprof -http=:8080 cpu.out opens a web UI with a flame graph. Width is time, stacking is call depth. Find the widest box that is your code (not runtime, not stdlib). That is your hotspot. Click it, read the source, ask "do I need to call this at all?" before "can I make this faster?".
Most real wins come from eliminating work, not speeding up work. Cache the result. Hoist the computation out of the loop. Move the JSON encoding to a background worker. Replace a per-request regexp.Compile with a package-level var re = regexp.MustCompile(...) so the parsing happens once. These changes routinely move p99 latency by 50 percent and they do not require any clever algorithm; they require attention to where time is actually spent.
A second tip on reading profiles: always compare against a known-good baseline. Capture a profile of the same workload before your change, capture another after, and diff them with go tool pprof -base old.out new.out. The diff view shows only what changed, which makes it obvious whether your fix actually helped or just moved the cost somewhere else.
Performance toolkit
- b.N
- Iteration count the testing framework picks so the benchmark runs long enough for stable timing.
- b.ReportAllocs
- Tells the benchmark to include B/op and allocs/op columns. Pair with -benchmem flag.
- pprof
- Go's built-in sampling profiler. Captures CPU, heap, goroutine, mutex, and block profiles.
- benchstat
- Command-line tool that runs a t-test across multiple benchmark runs to confirm a delta is statistically real.
- escape analysis
- Compiler pass that decides whether a value lives on the stack or escapes to the heap. Pointers to locals that outlive the function force a heap allocation.
- flame graph
- Stacked horizontal bar visualisation of a profile. Width = time, stack = call depth. The widest box at your level is your hotspot.
What we will cover next
The next lesson moves from benchmark microcosm to production reality: HTTP server patterns. Middleware, graceful shutdown, timeouts, and the one configuration knob (ReadHeaderTimeout) that protects you from a Slowloris attack. The pprof skills from this lesson plug in directly: once your service is running, you profile it the same way you profile a benchmark.
Watching quietly. Tap me if you want a tip.
Go cannot run natively in a browser. Run copies your code and opens go.dev/play ; paste and click Run there.
Try this (0 of 1 done)
- 1
Predict both outputs.
show answer
// already in the editor