Real-World Scenarios: Argus vs Traditional Toolchains
Twelve production incidents that a senior JVM engineer recognises on sight. For each one we show the commands you would actually type today, then the single Argus command that collapses the same evidence into one screen. We are honest about the cases where Argus does not replace the host shell — the OS-boundary cases (safepoint logs, swap, file descriptors, kernel OOM events) still need vmstat, lsof, or dmesg alongside Argus.
1. G1GC Humongous Object Allocation Bomb
Symptom. Heap looks fine — 6 GB committed, 40% usage — yet every few minutes the application stalls for 2–4 seconds. The culprit is multi-megabyte byte[] allocations from an Excel reader or a streaming file API. G1 promotes them straight to humongous regions, fragmentation builds up, and eventually an evacuation failure triggers a multi-second STW.
Metrics required to diagnose:
- Humongous allocation events — frequency, region count, source call sites
- G1 region fragmentation and evacuation failure markers in the GC log
- Allocation hotspots, ordered by allocated bytes per call site
Traditional toolchain
# Enable verbose G1 logging, then re-correlate against allocation profile
$ jcmd 12345 VM.flags | grep -E 'UseG1GC|G1HeapRegionSize'
uintx G1HeapRegionSize = 4194304
bool UseG1GC = true
$ jcmd 12345 GC.heap_info
garbage-first heap total 6291456K, used 2516201K [0x0000...]
region size 4096K, 14 young (57344K), 3 survivors (12288K)
Metaspace used 84321K, committed 86016K, reserved 1130496K
# Restart the app with -Xlog:gc*,gc+humongous=trace and re-grep
$ grep -E 'Humongous|to-space exhausted' gc.log | tail -20
[3214.876s][info ][gc,humongous] GC(412) Humongous region: 1 -> 1
[3214.880s][info ][gc ] GC(412) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[3401.122s][warn ][gc ] GC(488) To-space exhausted
# Get the call sites
$ async-profiler -e alloc -d 30s -f /tmp/alloc.html 12345
$ open /tmp/alloc.html # click into HumongousObjAllocator path
Argus
$ argus doctor 12345
╭─ argus doctor ──────────────────────────────────────────────╮
│ JVM Health Report pid:12345 HotSpot uptime:1h 14m │
│ │
│ Heap: 2.4 GB/6.0 GB (40%) | CPU: 22% | Threads: 184 │
│ GC: 14.6% overhead │
│ ──────────────────────────────────────────────────────────── │
│ │
│ 1 critical 1 warning │
│ │
│ ✘ CRITICAL: G1 humongous allocations dominating pauses │
│ 14 humongous events in last 5 min; max pause 2,840 ms │
│ → Reduce object size below G1HeapRegionSize/2 (2 MB) │
│ → Or raise -XX:G1HeapRegionSize=8m │
│ │
│ ⚠ WARNING: GC overhead 14.6% (threshold 10%) │
│ → argus profile 12345 --event alloc --duration 30 │
│ │
╰─ ✘ critical ────────────────────────────────────────────────╯
$ argus profile 12345 --event alloc --duration 30
Top allocation sites (n=18,204 samples, 30s window)
1. com.acme.report.XlsxReader.readSheet 41.8% byte[]
2. java.util.zip.Inflater.inflateBytes 17.2% byte[]
3. com.acme.report.XlsxReader.rowBuffer 9.6% byte[]
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Humongous event count | ⚠️ requires -Xlog:gc* + restart | ✅ surfaced by doctor |
| Allocation hotspots | ⚠️ async-profiler + HTML viewer | ✅ argus profile --event alloc |
| Region size + flag fix | ✅ jcmd VM.flags | ✅ included as recommendation |
One Argus run replaces a restart-for-logs cycle plus a separate profiler session.
2. Time To Safepoint (TTSP) Black Hole
Symptom. The GC log says Pause Young: 11 ms but the application reports stop-the-world freezes of several seconds. The culprit is TTSP: a counted loop scanning a giant array does not poll for safepoints, so every other thread sits idle waiting for the laggard to check in.
Metrics required to diagnose:
- Safepoint entry latency — time-to-reach-safepoint per VM operation
- Wall-clock samples of application threads during the stall
- JIT-compiled methods using counted loops on large arrays
Traditional toolchain
# Add the safepoint logging flag and restart
$ java -Xlog:safepoint*=info:file=safepoint.log:time,uptime,level ...
$ grep 'Total time for which application threads were stopped' safepoint.log
[uptime=621.142s] Total time for which application threads were stopped: 4.8214s
Stopping threads took: 4.8101s
# 4.8 of 4.82 seconds were spent reaching safepoint, not in the GC itself
$ async-profiler -e wall -d 10s -f /tmp/wall.html 12345
# Open in flame-graph viewer, look for hot frames during the stall window
Argus
$ argus profile 12345 --event wall --duration 10
╭─ argus profile --event wall ────────────────────────────────╮
│ Wall-clock profile pid:12345 duration:10s samples:9,873 │
│ │
│ Top wall-time methods │
│ 1. com.acme.search.MatrixScan.find 58.4% │
│ 2. java.util.HashMap.getNode 6.1% │
│ 3. sun.nio.ch.EPollSelectorImpl.doSelect 5.8% │
│ │
│ ⚠ A single method dominates wall time — review for │
│ counted loops without safepoint polls (TTSP risk). │
│ │
╰──────────────────────────────────────────────────────────────╯
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Safepoint entry latency | ✅ -Xlog:safepoint | ❌ Argus does not parse safepoint logs |
| Wall-clock hotspots | ⚠️ async-profiler + HTML viewer | ✅ argus profile --event wall |
| Root cause identification | ✅ once both signals correlated | ⚠️ partial — points at the suspect method |
Honest gap: Argus shows the dominant wall-time method, which is usually the offending counted loop, but you still want -Xlog:safepoint in production to prove the stop-the-world time is in safepoint entry, not in the collector.
3. Linux OOMKilled — The Silent Assassin
Symptom. Heap stays at 50%, no OutOfMemoryError is thrown, yet the pod is killed and restarted. The kernel OOM-killer fired because resident memory blew past the container limit — usually Netty DirectByteBuffer, gRPC native channels, or jemalloc fragmentation, all of which live outside the heap.
Metrics required to diagnose:
- Native memory committed by category (Class, Thread, Code, GC, Internal, …)
- Container working-set vs limit (cgroup memory.max)
- Kernel kill event in
dmesg/ journalctl
Traditional toolchain
# Confirm the kernel killed it
$ dmesg | grep -i 'killed process'
[7421.882] Out of memory: Killed process 12345 (java) total-vm:9120384kB,
anon-rss:7842112kB, file-rss:0kB, shmem-rss:0kB
# Re-enable NMT in the manifest and capture a baseline on next start
$ jcmd 12345 VM.native_memory baseline
$ # ... 30 min later ...
$ jcmd 12345 VM.native_memory summary.diff
Native Memory Tracking:
Total: reserved=9628700KB +384200KB, committed=8420112KB +402112KB
- Class (reserved=180224KB, committed=178892KB)
- Thread (reserved=520304KB +12000KB, committed=520304KB +12000KB)
- Internal (reserved=124188KB +321000KB, committed=124188KB +321000KB)
... scan all 18 categories manually ...
Argus
$ argus nmt 12345 --save baseline.json
Saved NMT baseline to: /home/ops/baseline.json
# ... 30 minutes of traffic ...
$ argus nmt 12345 --diff baseline.json
╭─ argus nmt --diff ──────────────────────────────────────────╮
│ NMT diff pid:12345 source:jdk elapsed:31m 04s │
│ │
│ Since 2026-05-14 09:12:01: committed +392.7 MB, reserved │
│ +375.2 MB │
│ │
│ Category Reserved Δ Committed Δ Reserved now │
│ ────────────────────────────────────────────────────── │
│ Internal +321 MB +321 MB 445 MB │
│ Thread +12 MB +12 MB 520 MB │
│ GC +9 MB +9 MB 142 MB │
│ Class +6 MB +5 MB 178 MB │
│ │
│ ✘ Internal (DirectByteBuffer + JNI scratch) is the │
│ dominant grower — usually Netty/gRPC native pools. │
│ │
╰──────────────────────────────────────────────────────────────╯
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| NMT baseline + diff | ⚠️ manual scan of 18 categories | ✅ banner + sorted growers |
| Container working-set | ✅ kubectl top / cAdvisor | ❌ out of scope |
| Kernel kill event | ✅ dmesg | ❌ out of scope — still need shell |
Honest gap: Argus identifies the JVM-side grower in seconds; dmesg and container metrics are still required to confirm the kernel killed the process.
4. JIT Deoptimization Storm
Symptom. CPU usage spikes and throughput collapses without any code change. Heavy use of runtime-generated lambdas and reflection fills the CodeCache, the JIT marks hot methods made-not-entrant, the JVM falls back to the interpreter, the loop repeats.
Metrics required to diagnose:
- CodeCache used vs total, max-used watermark
- nmethod count, deoptimization counter trend
- Compiler queue depth
Traditional toolchain
$ jcmd 12345 Compiler.codecache
CodeHeap 'non-profiled nmethods': size=120000Kb used=119742Kb max_used=119988Kb free=258Kb
CodeHeap 'profiled nmethods': size=120000Kb used=119410Kb max_used=119410Kb free=590Kb
CodeHeap 'non-nmethods': size=5760Kb used=4012Kb max_used=4040Kb free=1748Kb
# Restart with -XX:+PrintCompilation to see made-not-entrant frequency
$ grep 'made not entrant' compilation.log | wc -l
21847
Argus
$ argus compiler 12345
╭─ argus compiler ────────────────────────────────────────────╮
│ JIT Compiler pid:12345 source:jdk │
│ │
│ ✔ Compilation enabled │
│ │
│ Code cache ████████████████████ 239.4 MB / 245.7 MB (97%)│
│ Max used: 239.4 MB Free: 6.3 MB │
│ │
│ Total blobs: 28,412 nmethods: 24,108 adapters: 1,820 │
│ Compiler queue: 312 │
│ Deoptimizations: 21,847 │
│ │
│ ⚠ Code cache > 80% full — JIT compiler may stop. Increase │
│ -XX:ReservedCodeCacheSize. │
│ │
╰──────────────────────────────────────────────────────────────╯
$ argus doctor 12345
✘ CRITICAL: CodeCache near exhaustion (97%)
21,847 deoptimizations since boot; compiler queue 312
→ -XX:ReservedCodeCacheSize=512m
→ -XX:-UseCodeCacheFlushing (only after raising the size)
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| CodeCache used / max | ✅ Compiler.codecache | ✅ progress bar + watermark |
| Deoptimization count | ⚠️ requires PrintCompilation restart | ✅ surfaced live |
| Suggested fix | ⚠️ requires JVM-tuning knowledge | ✅ doctor emits the flag |
5. Phantom Thread Swamp
Symptom. A payment-gateway client was deployed without a read timeout. Tomcat worker threads accumulate stuck in socketRead0, the pool saturates, every other endpoint starts returning 503. Load average is low; the JVM looks idle.
Metrics required to diagnose:
- Thread state distribution (RUNNABLE vs WAITING vs BLOCKED)
- Frequency of
socketRead0in stacks - Per-pool concentration of stuck threads
Traditional toolchain
$ jstack 12345 > /tmp/dump1.txt
$ sleep 5
$ jstack 12345 > /tmp/dump2.txt
$ sleep 5
$ jstack 12345 > /tmp/dump3.txt
$ grep -c 'socketRead0' /tmp/dump1.txt /tmp/dump2.txt /tmp/dump3.txt
/tmp/dump1.txt:182
/tmp/dump2.txt:184
/tmp/dump3.txt:183
# Confirm they are Tomcat workers
$ grep -B1 'socketRead0' /tmp/dump2.txt | grep '"http-nio'
"http-nio-8080-exec-12" ...
"http-nio-8080-exec-13" ...
... (183 occurrences) ...
Argus
$ argus threads 12345
╭─ argus threads ─────────────────────────────────────────────╮
│ Thread Dump pid:12345 source:jdk │
│ │
│ Total: 248 Virtual: 0 Platform: 248 Peak: 252 │
│ │
│ RUNNABLE ████░░░░░░░░░░░░ 24 ( 10%) │
│ WAITING █████████████░░░ 195 ( 79%) │
│ TIMED_WAITING ██░░░░░░░░░░░░░░ 27 ( 11%) │
│ BLOCKED ░░░░░░░░░░░░░░░░ 2 ( 1%) │
│ │
╰──────────────────────────────────────────────────────────────╯
$ argus pool 12345
Pool Count State
──────────────────────────────────────────────────────────
http-nio-8080-exec- 184 WAIT:183 RUN:1
catalina-utility- 8 WAIT:8
ForkJoinPool.commonPool- 8 WAIT:8
scheduling- 4 WAIT:4
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| State distribution | ⚠️ 3× jstack + grep | ✅ inline bars |
| Stuck-thread pool | ⚠️ manual aggregation | ✅ argus pool |
| Confirm I/O wait | ✅ stack grep | ⚠️ requires opening one stack |
6. Dynamic Proxy Betrayal (Metaspace Leak)
Symptom. A long-running service slowly grows Metaspace until it hits OutOfMemoryError: Metaspace a few weeks after deploy. Usually a Spring or Hibernate proxy class is pinned by a ThreadLocal that nobody clears, so its ClassLoader cannot be unloaded.
Metrics required to diagnose:
- Metaspace used / committed trend
- Loaded class count growth
- Top ClassLoaders by class count
Traditional toolchain
$ jcmd 12345 VM.metaspace
Total Usage - 9 loaders, 84,221 classes: ...
Used: 412.3 MB Committed: 418.0 MB Reserved: 1.1 GB
$ jcmd 12345 GC.class_histogram | head -20
num #instances #bytes class name
1: 482,114 38,569,120 $Proxy188
2: 312,002 24,960,160 $Proxy42
... (no information about which ClassLoader is the offender)
# To get the GC-root path you must dump the heap and load it in MAT
$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
# Open in Eclipse MAT, run Leak Suspects ... 20 min later ...
Argus
$ argus metaspace 12345
╭─ argus metaspace ───────────────────────────────────────────╮
│ Metaspace pid:12345 source:jdk │
│ │
│ Used ████████████████████ 412.3 MB / 418.0 MB (99%) │
│ Reserved: 1.1 GB │
│ │
│ Space Used Committed Reserved │
│ ────────────────────────────────────────────────────── │
│ Metaspace 318 MB 320 MB 1.0 GB │
│ ClassSpace 94 MB 98 MB 128 MB │
│ │
│ ⚠ Metaspace usage above 90% — risk of OOM:Metaspace. │
│ │
╰──────────────────────────────────────────────────────────────╯
$ argus classloader 12345 --top 5
ClassLoader Classes Δ since boot
─────────────────────────────────────────────────────────────────────
AppClassLoader 24,118 +18
sun.reflect.DelegatingClassLoader 482,114 +480,002
org.springframework.cglib.core.ReflectUtils 12,402 +12,400
jdk.internal.loader.ClassLoaders$AppClass... 1,402 +0
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Metaspace usage trend | ✅ VM.metaspace | ✅ progress bar + warning |
| Loaded class growth | ⚠️ GC.class_histogram only | ✅ argus classloader |
| Offending ClassLoader | ❌ heap dump + MAT (20 min) | ✅ top-N by class count |
7. Fatal OS Swapping
Symptom. GC pauses that used to be 30–50 ms balloon to tens of seconds. The reason is not the collector — the JVM RSS exceeds physical RAM, the kernel pages out heap regions, and every GC mark phase has to fault them back in. Garbage-collecting paged-out memory is catastrophic.
Metrics required to diagnose:
- Host swap-in / swap-out rates (
vmstat si/so) - GC pause
User=vsSys=ratio (high Sys = page faults) - JVM GC overhead trend
Traditional toolchain
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
4 0 812224 31200 4012 142000 8412 9120 9408 9412 3211 4022 18 22 38 22
5 0 819440 29040 4012 141880 9120 8804 9120 8800 3402 4188 17 25 36 22
# si/so > 0 sustained = the JVM is paging
$ grep 'User=' gc.log | tail -3
[Times: user=0.42 sys=3.81, real=4.18 secs]
# sys time eclipses user time → kernel is dominating, the collector is waiting on I/O
Argus
$ argus doctor 12345
✘ CRITICAL: GC overhead 38.2% (threshold 10%)
Mean pause 4,180 ms; last cause Allocation Failure
→ Check host memory pressure: vmstat 1 (look at si/so)
→ If swapping, the fix is at the OS layer, not the JVM
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| GC pause anomaly | ✅ GC log review | ✅ doctor flags overhead |
| vmstat si/so | ✅ vmstat 1 | ❌ Argus does not read OS counters |
| User vs Sys time | ✅ GC log parsing | ❌ not exposed |
Honest gap: Argus shouts "your GC overhead is wrong" within seconds, but the proof that the cause is swap-out lives in vmstat and the GC log timing fields. Argus and the host shell are complementary here.
8. Finalizer Queue Backpressure
Symptom. An older third-party library uses finalize() for stream cleanup. Allocation rate is fine but heap usage climbs anyway, because objects scheduled for finalization sit in the java.lang.ref.Finalizer queue waiting for the single ReferenceHandler thread to drain them.
Metrics required to diagnose:
- Pending finalization count
- Finalizer thread state (should be WAITING; if RUNNABLE constantly, it cannot keep up)
- Reference-processing phase time in GC log
Traditional toolchain
$ jcmd 12345 GC.heap_info | grep -i 'pending'
# No direct surface — you must dump the heap.
$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
$ # Open in MAT, navigate to java.lang.ref.Finalizer.queue, count nodes
$ grep '\[Reference Processing\]' gc.log | tail -5
[gc,ref] GC(721) Reference Processing 284.318ms
# 284 ms in reference processing alone is the smoking gun
Argus
$ argus finalizer 12345
╭─ argus finalizer ───────────────────────────────────────────╮
│ Finalizer queue pid:12345 source:jdk │
│ │
│ ⚠ 4,812 objects pending finalization │
│ │
│ Finalizer thread state RUNNABLE │
│ │
│ ⚠ Pending count above 100 — finalizers cannot keep up. │
│ Replace finalize() with try-with-resources / Cleaner. │
│ │
╰──────────────────────────────────────────────────────────────╯
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Pending finalizer count | ❌ heap dump + MAT | ✅ shown inline |
| Finalizer thread state | ⚠️ jstack grep | ✅ shown inline |
| Reference-processing time | ✅ GC log | ⚠️ visible via argus gc |
9. FD Exhaustion + CLOSE_WAIT
Symptom. The application starts throwing java.net.SocketException: Too many open files. The HTTP client never calls close() on its response bodies; sockets pile up in CLOSE_WAIT on the kernel side, until the process file-descriptor ceiling is hit and every new connection fails.
Metrics required to diagnose:
- Process FD count vs limit
- TCP socket state distribution — count of
CLOSE_WAITentries - Threads stuck in HTTP read on the application side
Traditional toolchain
$ lsof -p 12345 | wc -l
65411
$ cat /proc/12345/limits | grep 'open files'
Max open files 65536 65536 files
$ ss -tnp | awk '{print $1}' | sort | uniq -c
1 State
18 ESTAB
8412 CLOSE-WAIT
# CLOSE_WAIT dominates — the application is not closing its half of each socket
Argus
# Argus does not have an lsof equivalent. The application-side hint is:
$ argus threads 12345
RUNNABLE ░░░░░░░░░░░░░░░░ 4 ( 2%)
WAITING ███████████████░ 192 ( 96%) ← almost all HTTP-read waits
$ argus pool 12345
http-client-pool- 180 WAIT:180
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Open FD count | ✅ lsof / /proc | ❌ out of scope |
| TCP socket state | ✅ ss / netstat | ❌ out of scope |
| Application-side stuck threads | ⚠️ jstack grep | ✅ argus threads + pool |
Honest gap: this scenario lives below the JVM. Argus can confirm the application threads are stuck on socket reads (consistent with a leak), but the FD count and the kernel-side socket states require lsof / ss on the host.
10. Invisible Implicit Deadlock — Lock Contention
Symptom. Throughput collapses, threads are not deadlocked in the JMX sense (there is no cycle), yet ninety percent of workers sit BLOCKED. The cause is a single synchronized cache map that serialises every read.
Metrics required to diagnose:
- BLOCKED thread count over time
- Lock-acquire wait hotspots — which monitor and from which call site
- Pool concentration of the blocked threads
Traditional toolchain
$ for i in 1 2 3; do jstack 12345 > /tmp/d$i.txt; sleep 5; done
$ grep -E '^"|waiting to lock' /tmp/d2.txt | grep -B1 'waiting to lock 0x000000076ab1b9d0' | head
"http-nio-8080-exec-42"
- waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
"http-nio-8080-exec-43"
- waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
# ... 178 more threads blocked on the same monitor
Argus
$ argus threads 12345
BLOCKED ███████████████░ 178 ( 72%) ⚠
$ argus profile 12345 --event lock --duration 20
╭─ argus profile --event lock ────────────────────────────────╮
│ Lock contention profile pid:12345 duration:20s │
│ │
│ Top contended monitors │
│ 1. com.acme.cache.HotCache.get 68.4% │
│ java.util.HashMap@0x76ab1b9d0 │
│ 2. java.util.concurrent.ConcurrentHashMap 8.1% │
│ 3. ch.qos.logback.core.OutputStreamAppender 3.2% │
│ │
╰──────────────────────────────────────────────────────────────╯
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| BLOCKED count | ⚠️ jstack × N + count | ✅ inline bar + warning |
| Contended monitor | ⚠️ manual aggregation across dumps | ✅ profile --event lock |
| Owning call site | ✅ once you grep the holder | ✅ top-N call sites |
11. False Sharing in the Multi-Core Era
Symptom. Independent counters live in adjacent fields of the same object. They land in the same 64-byte CPU cache line. Two cores end up invalidating each other's line on every write. Ops/sec collapses well below what the CPU should deliver, and no amount of pure-Java profiling tells you why.
Metrics required to diagnose:
- L1/LLC cache-miss rate
- Instructions per cycle (IPC)
- Hot method that correlates with the cache-miss events
Traditional toolchain
$ perf stat -e cache-misses,instructions,cycles -p 12345 sleep 10
Performance counter stats for process id '12345':
1,824,210,033 cache-misses # 42.18 % of all cache refs
4,212,902,118 instructions # 0.41 insn per cycle
10,302,002,418 cycles
# 0.41 IPC and 42% miss rate is a giant red flag. Now correlate with hot methods:
$ perf record -e cache-misses -p 12345 sleep 10
$ perf report --no-children | head -10
38.42% com.acme.metrics.Counters.recordHit
12.18% com.acme.metrics.Counters.recordMiss
Argus
$ argus profile 12345 --event cache-misses --duration 10
╭─ argus profile --event cache-misses ────────────────────────╮
│ Hardware profile (Linux PMU) pid:12345 duration:10s │
│ │
│ Top cache-miss methods │
│ 1. com.acme.metrics.Counters.recordHit 38.4% │
│ 2. com.acme.metrics.Counters.recordMiss 12.2% │
│ 3. java.util.concurrent.atomic.LongAdder 4.1% │
│ │
╰──────────────────────────────────────────────────────────────╯
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Cache-miss rate | ✅ perf stat | ⚠️ method-level only (no global rate) |
| IPC | ✅ perf stat | ❌ not surfaced |
| Hot method ↔ misses | ✅ perf record/report | ✅ argus profile --event cache-misses |
Honest gap: Argus's cache-misses event surfaces the hot method on Linux, but interpreting it as false sharing still requires knowing that adjacent fields share a cache line. @Contended or padding is the fix; Argus points the finger but does not diagnose the pattern.
12. Direct Buffer Exhaustion (Off-Heap OOM)
Symptom. A high-traffic Netty service throws OutOfMemoryError: Direct buffer memory. Heap is healthy, the JVM is well below container limit, but the direct buffer pool has hit -XX:MaxDirectMemorySize.
Metrics required to diagnose:
- Direct buffer count, total capacity, memory used
- Mapped buffer counters for comparison
- Allocation profile during the spike
Traditional toolchain
# Read the BufferPool MBean by hand
$ jcmd 12345 ManagementAgent.status
# Then connect with jconsole or a custom JMX client to read
# java.nio:type=BufferPool,name=direct -> Count, MemoryUsed, TotalCapacity
# Or capture a JFR allocation profile
$ jcmd 12345 JFR.start name=direct duration=30s settings=profile
$ jcmd 12345 JFR.stop name=direct filename=/tmp/direct.jfr
$ jmc /tmp/direct.jfr # GC > Allocation tab, filter on DirectByteBuffer
Argus
$ argus buffers 12345
╭─ argus buffers ─────────────────────────────────────────────╮
│ NIO Buffer Pools pid:12345 source:jdk │
│ │
│ Pool Count Capacity Used │
│ ────────────────────────────────────────────────────── │
│ direct 24,812 1.0 GB 1.0 GB │
│ mapped 18 412.0 MB 412.0 MB │
│ mapped - 'non-volatile' 0 0 0 │
│ │
│ Total 24,830 1.4 GB 1.4 GB │
│ │
╰──────────────────────────────────────────────────────────────╯
$ argus doctor 12345
✘ CRITICAL: Direct buffer pool at capacity (1.0 GB / 1.0 GB)
24,812 outstanding direct buffers — unreleased Netty allocations
→ Raise -XX:MaxDirectMemorySize after confirming a real leak
→ argus profile 12345 --event alloc to see allocation sites
Verdict.
| Metric | Traditional | Argus |
|---|---|---|
| Direct buffer count / size | ⚠️ JMX MBean read | ✅ argus buffers |
| Cross-correlation with GC | ⚠️ JFR + JMC | ✅ doctor rule |
| Allocation call sites | ✅ JFR allocation profile | ✅ argus profile --event alloc |
Closing. Argus collapses roughly nine of these twelve scenarios into a single command answer (1, 4, 5, 6, 8, 10, 12, plus the JVM-side of 3 and the application-side of 9). The remaining cases — TTSP safepoint logs, OS swap, and FD/socket exhaustion — sit below the JVM boundary, where Argus consciously stops. The intended workflow is to run Argus alongside the host shell, not in place of it.