ToolPopToolPop
Go · Lesson 19 of 22

pprof, benchmarks, and finding the real hotspot

8 min readUpdated 25 Jun 2026

Performance work in Go is unusually pleasant because the toolchain ships with everything you need. testing.B, pprof, the race detector, the trace viewer. No third-party agent, no APM bill, no language-specific quirks to learn. The discipline is to use them in the right order: benchmark to confirm there is a problem, profile to find the hotspot, then change code, then re-benchmark.

The number one mistake juniors make is optimising by intuition. The number two mistake is running go test -bench=. and trusting the timing without -benchmem. The number three mistake is optimising code that runs once when the actual bottleneck is the hot loop two functions away. We will fix all three by following the data instead of the gut.

Benchmark anatomy

A benchmark is a function in a _test.go file that takes a *testing.B. The framework calls it with growing values of b.N until the runtime stabilises.

go
func BenchmarkConcat(b *testing.B) {
    parts := []string{"a", "b", "c", "d", "e"}
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        var s string
        for _, p := range parts {
            s += p
        }
        _ = s
    }
}

Three lines deserve attention. b.ReportAllocs() makes the benchmark report allocs/op and B/op columns. b.ResetTimer() discards setup time so only the inner loop counts. The _ = s keeps the compiler from optimising the work away because the result is unused.

Run it:

bash
go test -bench=. -benchmem -benchtime=2s ./...

The output looks like:

code
BenchmarkConcat-8   3000000   412 ns/op   80 B/op   4 allocs/op

The -8 is GOMAXPROCS. The three numbers are nanoseconds, bytes, and allocations per iteration. The allocations column is the one juniors ignore and seniors stare at.

Interview trap: go test -bench=. without -benchmem hides 80 percent of the truth. Memory allocations drive GC pressure, GC pressure drives tail latency, tail latency drives your SLO. Always pass -benchmem.

Comparing alternatives

Benchmarks shine when you compare two implementations side by side.

go
func BenchmarkConcatBuilder(b *testing.B) {
    parts := []string{"a", "b", "c", "d", "e"}
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        var sb strings.Builder
        for _, p := range parts {
            sb.WriteString(p)
        }
        _ = sb.String()
    }
}

Run both, eyeball the difference, then run benchstat for statistical significance:

bash
go test -bench=. -benchmem -count=10 > old.txt
# change code
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txt

benchstat runs a t-test and tells you the delta is real and not noise. Without -count=10, you are reading one noisy sample and concluding things.

CPU profiling: where the seconds go

The -cpuprofile flag tells the test binary to record a sampling profile.

bash
go test -bench=BenchmarkHotPath -cpuprofile=cpu.out -benchmem
go tool pprof cpu.out

Inside the pprof shell:

code
(pprof) top
(pprof) top -cum
(pprof) list HotFunction
(pprof) web

top ranks functions by self-time. top -cum ranks by cumulative time including callees, which is what you want for finding wrappers. list FuncName shows source code with per-line cost. web opens an SVG callgraph in your browser.

Senior rule: when reading top, look for surprises. The function you expected to dominate is rarely the actual hotspot. The hotspot is almost always runtime.mallocgc, runtime.gcBgMarkWorker, or a fmt.Sprintf you forgot you were calling in a loop.

Memory profiling: where the bytes go

bash
go test -bench=BenchmarkHotPath -memprofile=mem.out
go tool pprof -alloc_space mem.out

-alloc_space shows total bytes allocated over the run. -inuse_space shows live bytes at the moment the profile was captured. For tuning GC pressure, you want -alloc_space, because every allocation eventually becomes garbage.

Common allocation hotspots:

go
// String concatenation in a loop: O(N^2) copies, N allocations
s := ""
for _, p := range parts { s += p }
 
// fmt.Sprintf on the hot path: reflection + heap allocation per call
key := fmt.Sprintf("user:%d", id)
 
// Returning a pointer to a small struct: escapes to heap
func newPoint(x, y int) *Point { return &Point{x, y} }
 
// Appending to a slice without preallocating: log(N) reallocations
var out []int
for _, v := range in { out = append(out, v*2) }

The fixes are mechanical. strings.Builder for concatenation. strconv.AppendInt plus a reusable buffer instead of Sprintf. Return value types when the struct is small. make([]int, 0, len(in)) to preallocate.

Senior rule: optimise for memory before CPU. Most "slow Go code" is GC pressure in disguise. Halving allocations usually halves p99 latency, even when the per-operation CPU cost looks unchanged.

Profiling a live service

For HTTP services, import net/http/pprof and the profile endpoints register themselves on http.DefaultServeMux.

go
import (
    "net/http"
    _ "net/http/pprof"
)
 
func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    runApp()
}

Now fetch profiles live:

bash
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/goroutine

The goroutine profile is especially useful for tracking down leaks. If your goroutine count climbs unboundedly, fetch this profile and top will tell you which call site is spawning them.

For non-HTTP services, use runtime/pprof to write profiles to disk on demand or on a signal.

go
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
Diagram
rendering diagram...
The performance loop: benchmark, profile, change, re-benchmark

Reading flame graphs

go tool pprof -http=:8080 cpu.out opens a web UI with a flame graph. Width is time, stacking is call depth. Find the widest box that is your code (not runtime, not stdlib). That is your hotspot. Click it, read the source, ask "do I need to call this at all?" before "can I make this faster?".

Most real wins come from eliminating work, not speeding up work. Cache the result. Hoist the computation out of the loop. Move the JSON encoding to a background worker. Replace a per-request regexp.Compile with a package-level var re = regexp.MustCompile(...) so the parsing happens once. These changes routinely move p99 latency by 50 percent and they do not require any clever algorithm; they require attention to where time is actually spent.

A second tip on reading profiles: always compare against a known-good baseline. Capture a profile of the same workload before your change, capture another after, and diff them with go tool pprof -base old.out new.out. The diff view shows only what changed, which makes it obvious whether your fix actually helped or just moved the cost somewhere else.

Performance toolkit

b.N
Iteration count the testing framework picks so the benchmark runs long enough for stable timing.
b.ReportAllocs
Tells the benchmark to include B/op and allocs/op columns. Pair with -benchmem flag.
pprof
Go's built-in sampling profiler. Captures CPU, heap, goroutine, mutex, and block profiles.
benchstat
Command-line tool that runs a t-test across multiple benchmark runs to confirm a delta is statistically real.
escape analysis
Compiler pass that decides whether a value lives on the stack or escapes to the heap. Pointers to locals that outlive the function force a heap allocation.
flame graph
Stacked horizontal bar visualisation of a profile. Width = time, stack = call depth. The widest box at your level is your hotspot.

What we will cover next

The next lesson moves from benchmark microcosm to production reality: HTTP server patterns. Middleware, graceful shutdown, timeouts, and the one configuration knob (ReadHeaderTimeout) that protects you from a Slowloris attack. The pprof skills from this lesson plug in directly: once your service is running, you profile it the same way you profile a benchmark.

Chai0/1 done

Watching quietly. Tap me if you want a tip.

Go Playground

Go cannot run natively in a browser. Run copies your code and opens go.dev/play ; paste and click Run there.

Try this (0 of 1 done)

  1. 1

    Predict both outputs.

    show answer
    // already in the editor