JVM Observability Platform GitHub

Real-World Scenarios: Argus vs Traditional Toolchains

Twelve production incidents that a senior JVM engineer recognises on sight. For each one we show the commands you would actually type today, then the single Argus command that collapses the same evidence into one screen. We are honest about the cases where Argus does not replace the host shell — the OS-boundary cases (safepoint logs, swap, file descriptors, kernel OOM events) still need vmstat, lsof, or dmesg alongside Argus.

1. G1GC Humongous Object Allocation Bomb

Symptom. Heap looks fine — 6 GB committed, 40% usage — yet every few minutes the application stalls for 2–4 seconds. The culprit is multi-megabyte byte[] allocations from an Excel reader or a streaming file API. G1 promotes them straight to humongous regions, fragmentation builds up, and eventually an evacuation failure triggers a multi-second STW.

Metrics required to diagnose:

  • Humongous allocation events — frequency, region count, source call sites
  • G1 region fragmentation and evacuation failure markers in the GC log
  • Allocation hotspots, ordered by allocated bytes per call site

Traditional toolchain

# Enable verbose G1 logging, then re-correlate against allocation profile
$ jcmd 12345 VM.flags | grep -E 'UseG1GC|G1HeapRegionSize'
   uintx G1HeapRegionSize                          = 4194304
   bool  UseG1GC                                    = true

$ jcmd 12345 GC.heap_info
 garbage-first heap   total 6291456K, used 2516201K [0x0000...]
  region size 4096K, 14 young (57344K), 3 survivors (12288K)
 Metaspace       used 84321K, committed 86016K, reserved 1130496K

# Restart the app with -Xlog:gc*,gc+humongous=trace and re-grep
$ grep -E 'Humongous|to-space exhausted' gc.log | tail -20
[3214.876s][info ][gc,humongous] GC(412) Humongous region: 1 -> 1
[3214.880s][info ][gc           ] GC(412) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[3401.122s][warn ][gc           ] GC(488) To-space exhausted

# Get the call sites
$ async-profiler -e alloc -d 30s -f /tmp/alloc.html 12345
$ open /tmp/alloc.html   # click into HumongousObjAllocator path

Argus

$ argus doctor 12345
╭─ argus doctor ──────────────────────────────────────────────╮
│ JVM Health Report     pid:12345  HotSpot  uptime:1h 14m     │
│                                                              │
│   Heap: 2.4 GB/6.0 GB (40%)  |  CPU: 22%  |  Threads: 184    │
│   GC: 14.6% overhead                                         │
│ ──────────────────────────────────────────────────────────── │
│                                                              │
│   1 critical  1 warning                                      │
│                                                              │
│   ✘ CRITICAL: G1 humongous allocations dominating pauses     │
│     14 humongous events in last 5 min; max pause 2,840 ms    │
│     → Reduce object size below G1HeapRegionSize/2 (2 MB)     │
│     → Or raise -XX:G1HeapRegionSize=8m                       │
│                                                              │
│   ⚠ WARNING: GC overhead 14.6% (threshold 10%)               │
│     → argus profile 12345 --event alloc --duration 30        │
│                                                              │
╰─ ✘ critical ────────────────────────────────────────────────╯

$ argus profile 12345 --event alloc --duration 30
Top allocation sites (n=18,204 samples, 30s window)
  1. com.acme.report.XlsxReader.readSheet            41.8%  byte[]
  2. java.util.zip.Inflater.inflateBytes             17.2%  byte[]
  3. com.acme.report.XlsxReader.rowBuffer             9.6%  byte[]

Verdict.

MetricTraditionalArgus
Humongous event count⚠️ requires -Xlog:gc* + restart✅ surfaced by doctor
Allocation hotspots⚠️ async-profiler + HTML viewerargus profile --event alloc
Region size + flag fixjcmd VM.flags✅ included as recommendation

One Argus run replaces a restart-for-logs cycle plus a separate profiler session.

2. Time To Safepoint (TTSP) Black Hole

Symptom. The GC log says Pause Young: 11 ms but the application reports stop-the-world freezes of several seconds. The culprit is TTSP: a counted loop scanning a giant array does not poll for safepoints, so every other thread sits idle waiting for the laggard to check in.

Metrics required to diagnose:

  • Safepoint entry latency — time-to-reach-safepoint per VM operation
  • Wall-clock samples of application threads during the stall
  • JIT-compiled methods using counted loops on large arrays

Traditional toolchain

# Add the safepoint logging flag and restart
$ java -Xlog:safepoint*=info:file=safepoint.log:time,uptime,level ...

$ grep 'Total time for which application threads were stopped' safepoint.log
[uptime=621.142s] Total time for which application threads were stopped: 4.8214s
                  Stopping threads took: 4.8101s
# 4.8 of 4.82 seconds were spent reaching safepoint, not in the GC itself

$ async-profiler -e wall -d 10s -f /tmp/wall.html 12345
# Open in flame-graph viewer, look for hot frames during the stall window

Argus

$ argus profile 12345 --event wall --duration 10
╭─ argus profile --event wall ────────────────────────────────╮
│ Wall-clock profile   pid:12345  duration:10s  samples:9,873  │
│                                                              │
│   Top wall-time methods                                      │
│     1. com.acme.search.MatrixScan.find          58.4%        │
│     2. java.util.HashMap.getNode                 6.1%        │
│     3. sun.nio.ch.EPollSelectorImpl.doSelect     5.8%        │
│                                                              │
│   ⚠ A single method dominates wall time — review for         │
│     counted loops without safepoint polls (TTSP risk).       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

MetricTraditionalArgus
Safepoint entry latency-Xlog:safepoint❌ Argus does not parse safepoint logs
Wall-clock hotspots⚠️ async-profiler + HTML viewerargus profile --event wall
Root cause identification✅ once both signals correlated⚠️ partial — points at the suspect method

Honest gap: Argus shows the dominant wall-time method, which is usually the offending counted loop, but you still want -Xlog:safepoint in production to prove the stop-the-world time is in safepoint entry, not in the collector.

3. Linux OOMKilled — The Silent Assassin

Symptom. Heap stays at 50%, no OutOfMemoryError is thrown, yet the pod is killed and restarted. The kernel OOM-killer fired because resident memory blew past the container limit — usually Netty DirectByteBuffer, gRPC native channels, or jemalloc fragmentation, all of which live outside the heap.

Metrics required to diagnose:

  • Native memory committed by category (Class, Thread, Code, GC, Internal, …)
  • Container working-set vs limit (cgroup memory.max)
  • Kernel kill event in dmesg / journalctl

Traditional toolchain

# Confirm the kernel killed it
$ dmesg | grep -i 'killed process'
[7421.882] Out of memory: Killed process 12345 (java) total-vm:9120384kB,
           anon-rss:7842112kB, file-rss:0kB, shmem-rss:0kB

# Re-enable NMT in the manifest and capture a baseline on next start
$ jcmd 12345 VM.native_memory baseline
$ # ... 30 min later ...
$ jcmd 12345 VM.native_memory summary.diff
Native Memory Tracking:
Total: reserved=9628700KB +384200KB, committed=8420112KB +402112KB
-                 Class (reserved=180224KB, committed=178892KB)
-                Thread (reserved=520304KB +12000KB, committed=520304KB +12000KB)
-              Internal (reserved=124188KB +321000KB, committed=124188KB +321000KB)
... scan all 18 categories manually ...

Argus

$ argus nmt 12345 --save baseline.json
Saved NMT baseline to: /home/ops/baseline.json

# ... 30 minutes of traffic ...

$ argus nmt 12345 --diff baseline.json
╭─ argus nmt --diff ──────────────────────────────────────────╮
│ NMT diff   pid:12345  source:jdk  elapsed:31m 04s            │
│                                                              │
│   Since 2026-05-14 09:12:01: committed +392.7 MB, reserved   │
│   +375.2 MB                                                  │
│                                                              │
│   Category          Reserved Δ   Committed Δ   Reserved now  │
│   ──────────────────────────────────────────────────────     │
│   Internal              +321 MB       +321 MB       445 MB   │
│   Thread                 +12 MB        +12 MB       520 MB   │
│   GC                      +9 MB         +9 MB       142 MB   │
│   Class                   +6 MB         +5 MB       178 MB   │
│                                                              │
│   ✘ Internal (DirectByteBuffer + JNI scratch) is the         │
│     dominant grower — usually Netty/gRPC native pools.       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

MetricTraditionalArgus
NMT baseline + diff⚠️ manual scan of 18 categories✅ banner + sorted growers
Container working-setkubectl top / cAdvisor❌ out of scope
Kernel kill eventdmesg❌ out of scope — still need shell

Honest gap: Argus identifies the JVM-side grower in seconds; dmesg and container metrics are still required to confirm the kernel killed the process.

4. JIT Deoptimization Storm

Symptom. CPU usage spikes and throughput collapses without any code change. Heavy use of runtime-generated lambdas and reflection fills the CodeCache, the JIT marks hot methods made-not-entrant, the JVM falls back to the interpreter, the loop repeats.

Metrics required to diagnose:

  • CodeCache used vs total, max-used watermark
  • nmethod count, deoptimization counter trend
  • Compiler queue depth

Traditional toolchain

$ jcmd 12345 Compiler.codecache
CodeHeap 'non-profiled nmethods': size=120000Kb  used=119742Kb  max_used=119988Kb  free=258Kb
CodeHeap 'profiled nmethods':     size=120000Kb  used=119410Kb  max_used=119410Kb  free=590Kb
CodeHeap 'non-nmethods':           size=5760Kb   used=4012Kb    max_used=4040Kb   free=1748Kb

# Restart with -XX:+PrintCompilation to see made-not-entrant frequency
$ grep 'made not entrant' compilation.log | wc -l
21847

Argus

$ argus compiler 12345
╭─ argus compiler ────────────────────────────────────────────╮
│ JIT Compiler   pid:12345  source:jdk                         │
│                                                              │
│   ✔ Compilation enabled                                      │
│                                                              │
│   Code cache  ████████████████████  239.4 MB / 245.7 MB (97%)│
│     Max used: 239.4 MB     Free: 6.3 MB                      │
│                                                              │
│   Total blobs: 28,412    nmethods: 24,108    adapters: 1,820 │
│   Compiler queue: 312                                        │
│   Deoptimizations: 21,847                                    │
│                                                              │
│   ⚠ Code cache > 80% full — JIT compiler may stop. Increase  │
│     -XX:ReservedCodeCacheSize.                               │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus doctor 12345
   ✘ CRITICAL: CodeCache near exhaustion (97%)
     21,847 deoptimizations since boot; compiler queue 312
     → -XX:ReservedCodeCacheSize=512m
     → -XX:-UseCodeCacheFlushing (only after raising the size)

Verdict.

MetricTraditionalArgus
CodeCache used / maxCompiler.codecache✅ progress bar + watermark
Deoptimization count⚠️ requires PrintCompilation restart✅ surfaced live
Suggested fix⚠️ requires JVM-tuning knowledgedoctor emits the flag

5. Phantom Thread Swamp

Symptom. A payment-gateway client was deployed without a read timeout. Tomcat worker threads accumulate stuck in socketRead0, the pool saturates, every other endpoint starts returning 503. Load average is low; the JVM looks idle.

Metrics required to diagnose:

  • Thread state distribution (RUNNABLE vs WAITING vs BLOCKED)
  • Frequency of socketRead0 in stacks
  • Per-pool concentration of stuck threads

Traditional toolchain

$ jstack 12345 > /tmp/dump1.txt
$ sleep 5
$ jstack 12345 > /tmp/dump2.txt
$ sleep 5
$ jstack 12345 > /tmp/dump3.txt

$ grep -c 'socketRead0' /tmp/dump1.txt /tmp/dump2.txt /tmp/dump3.txt
/tmp/dump1.txt:182
/tmp/dump2.txt:184
/tmp/dump3.txt:183

# Confirm they are Tomcat workers
$ grep -B1 'socketRead0' /tmp/dump2.txt | grep '"http-nio'
"http-nio-8080-exec-12"  ...
"http-nio-8080-exec-13"  ...
... (183 occurrences) ...

Argus

$ argus threads 12345
╭─ argus threads ─────────────────────────────────────────────╮
│ Thread Dump   pid:12345  source:jdk                          │
│                                                              │
│ Total: 248    Virtual: 0    Platform: 248    Peak: 252       │
│                                                              │
│ RUNNABLE      ████░░░░░░░░░░░░     24  ( 10%)                │
│ WAITING       █████████████░░░    195  ( 79%)                │
│ TIMED_WAITING ██░░░░░░░░░░░░░░     27  ( 11%)                │
│ BLOCKED       ░░░░░░░░░░░░░░░░      2  (  1%)                │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus pool 12345
  Pool                              Count  State
  ──────────────────────────────────────────────────────────
  http-nio-8080-exec-                 184    WAIT:183  RUN:1
  catalina-utility-                     8    WAIT:8
  ForkJoinPool.commonPool-              8    WAIT:8
  scheduling-                           4    WAIT:4

Verdict.

MetricTraditionalArgus
State distribution⚠️ 3× jstack + grep✅ inline bars
Stuck-thread pool⚠️ manual aggregationargus pool
Confirm I/O wait✅ stack grep⚠️ requires opening one stack

6. Dynamic Proxy Betrayal (Metaspace Leak)

Symptom. A long-running service slowly grows Metaspace until it hits OutOfMemoryError: Metaspace a few weeks after deploy. Usually a Spring or Hibernate proxy class is pinned by a ThreadLocal that nobody clears, so its ClassLoader cannot be unloaded.

Metrics required to diagnose:

  • Metaspace used / committed trend
  • Loaded class count growth
  • Top ClassLoaders by class count

Traditional toolchain

$ jcmd 12345 VM.metaspace
Total Usage - 9 loaders, 84,221 classes: ...
  Used: 412.3 MB  Committed: 418.0 MB  Reserved: 1.1 GB

$ jcmd 12345 GC.class_histogram | head -20
 num     #instances         #bytes  class name
   1:         482,114       38,569,120  $Proxy188
   2:         312,002       24,960,160  $Proxy42
... (no information about which ClassLoader is the offender)

# To get the GC-root path you must dump the heap and load it in MAT
$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
# Open in Eclipse MAT, run Leak Suspects ... 20 min later ...

Argus

$ argus metaspace 12345
╭─ argus metaspace ───────────────────────────────────────────╮
│ Metaspace   pid:12345  source:jdk                            │
│                                                              │
│   Used  ████████████████████  412.3 MB / 418.0 MB (99%)      │
│     Reserved: 1.1 GB                                         │
│                                                              │
│   Space           Used     Committed     Reserved            │
│   ──────────────────────────────────────────────────────     │
│   Metaspace      318 MB       320 MB         1.0 GB          │
│   ClassSpace      94 MB        98 MB        128 MB           │
│                                                              │
│   ⚠ Metaspace usage above 90% — risk of OOM:Metaspace.       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus classloader 12345 --top 5
  ClassLoader                                  Classes    Δ since boot
  ─────────────────────────────────────────────────────────────────────
  AppClassLoader                                 24,118        +18
  sun.reflect.DelegatingClassLoader             482,114    +480,002
  org.springframework.cglib.core.ReflectUtils    12,402     +12,400
  jdk.internal.loader.ClassLoaders$AppClass...    1,402         +0

Verdict.

MetricTraditionalArgus
Metaspace usage trendVM.metaspace✅ progress bar + warning
Loaded class growth⚠️ GC.class_histogram onlyargus classloader
Offending ClassLoader❌ heap dump + MAT (20 min)✅ top-N by class count

7. Fatal OS Swapping

Symptom. GC pauses that used to be 30–50 ms balloon to tens of seconds. The reason is not the collector — the JVM RSS exceeds physical RAM, the kernel pages out heap regions, and every GC mark phase has to fault them back in. Garbage-collecting paged-out memory is catastrophic.

Metrics required to diagnose:

  • Host swap-in / swap-out rates (vmstat si/so)
  • GC pause User= vs Sys= ratio (high Sys = page faults)
  • JVM GC overhead trend

Traditional toolchain

$ vmstat 1 5
procs  -----------memory----------  ---swap--  -----io----  -system--  ----cpu----
 r  b   swpd   free   buff  cache    si   so    bi    bo    in   cs  us sy id wa
 4  0  812224  31200  4012 142000   8412 9120  9408  9412  3211 4022  18 22 38 22
 5  0  819440  29040  4012 141880   9120 8804  9120  8800  3402 4188  17 25 36 22
# si/so > 0 sustained = the JVM is paging

$ grep 'User=' gc.log | tail -3
[Times: user=0.42 sys=3.81, real=4.18 secs]
# sys time eclipses user time → kernel is dominating, the collector is waiting on I/O

Argus

$ argus doctor 12345
   ✘ CRITICAL: GC overhead 38.2% (threshold 10%)
     Mean pause 4,180 ms; last cause Allocation Failure
     → Check host memory pressure: vmstat 1 (look at si/so)
     → If swapping, the fix is at the OS layer, not the JVM

Verdict.

MetricTraditionalArgus
GC pause anomaly✅ GC log reviewdoctor flags overhead
vmstat si/sovmstat 1❌ Argus does not read OS counters
User vs Sys time✅ GC log parsing❌ not exposed

Honest gap: Argus shouts "your GC overhead is wrong" within seconds, but the proof that the cause is swap-out lives in vmstat and the GC log timing fields. Argus and the host shell are complementary here.

8. Finalizer Queue Backpressure

Symptom. An older third-party library uses finalize() for stream cleanup. Allocation rate is fine but heap usage climbs anyway, because objects scheduled for finalization sit in the java.lang.ref.Finalizer queue waiting for the single ReferenceHandler thread to drain them.

Metrics required to diagnose:

  • Pending finalization count
  • Finalizer thread state (should be WAITING; if RUNNABLE constantly, it cannot keep up)
  • Reference-processing phase time in GC log

Traditional toolchain

$ jcmd 12345 GC.heap_info | grep -i 'pending'
# No direct surface — you must dump the heap.

$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
$ # Open in MAT, navigate to java.lang.ref.Finalizer.queue, count nodes
$ grep '\[Reference Processing\]' gc.log | tail -5
[gc,ref] GC(721) Reference Processing                      284.318ms
# 284 ms in reference processing alone is the smoking gun

Argus

$ argus finalizer 12345
╭─ argus finalizer ───────────────────────────────────────────╮
│ Finalizer queue   pid:12345  source:jdk                      │
│                                                              │
│   ⚠ 4,812 objects pending finalization                       │
│                                                              │
│   Finalizer thread state  RUNNABLE                           │
│                                                              │
│   ⚠ Pending count above 100 — finalizers cannot keep up.     │
│     Replace finalize() with try-with-resources / Cleaner.    │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

MetricTraditionalArgus
Pending finalizer count❌ heap dump + MAT✅ shown inline
Finalizer thread state⚠️ jstack grep✅ shown inline
Reference-processing time✅ GC log⚠️ visible via argus gc

9. FD Exhaustion + CLOSE_WAIT

Symptom. The application starts throwing java.net.SocketException: Too many open files. The HTTP client never calls close() on its response bodies; sockets pile up in CLOSE_WAIT on the kernel side, until the process file-descriptor ceiling is hit and every new connection fails.

Metrics required to diagnose:

  • Process FD count vs limit
  • TCP socket state distribution — count of CLOSE_WAIT entries
  • Threads stuck in HTTP read on the application side

Traditional toolchain

$ lsof -p 12345 | wc -l
65411
$ cat /proc/12345/limits | grep 'open files'
Max open files            65536                65536                files

$ ss -tnp | awk '{print $1}' | sort | uniq -c
   1 State
  18 ESTAB
8412 CLOSE-WAIT
# CLOSE_WAIT dominates — the application is not closing its half of each socket

Argus

# Argus does not have an lsof equivalent. The application-side hint is:
$ argus threads 12345
 RUNNABLE       ░░░░░░░░░░░░░░░░      4  (  2%)
 WAITING        ███████████████░    192  ( 96%)   ← almost all HTTP-read waits

$ argus pool 12345
  http-client-pool-                   180    WAIT:180

Verdict.

MetricTraditionalArgus
Open FD countlsof / /proc❌ out of scope
TCP socket statess / netstat❌ out of scope
Application-side stuck threads⚠️ jstack grepargus threads + pool

Honest gap: this scenario lives below the JVM. Argus can confirm the application threads are stuck on socket reads (consistent with a leak), but the FD count and the kernel-side socket states require lsof / ss on the host.

10. Invisible Implicit Deadlock — Lock Contention

Symptom. Throughput collapses, threads are not deadlocked in the JMX sense (there is no cycle), yet ninety percent of workers sit BLOCKED. The cause is a single synchronized cache map that serialises every read.

Metrics required to diagnose:

  • BLOCKED thread count over time
  • Lock-acquire wait hotspots — which monitor and from which call site
  • Pool concentration of the blocked threads

Traditional toolchain

$ for i in 1 2 3; do jstack 12345 > /tmp/d$i.txt; sleep 5; done
$ grep -E '^"|waiting to lock' /tmp/d2.txt | grep -B1 'waiting to lock 0x000000076ab1b9d0' | head
"http-nio-8080-exec-42"
        - waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
"http-nio-8080-exec-43"
        - waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
# ... 178 more threads blocked on the same monitor

Argus

$ argus threads 12345
 BLOCKED       ███████████████░    178  ( 72%)   ⚠

$ argus profile 12345 --event lock --duration 20
╭─ argus profile --event lock ────────────────────────────────╮
│ Lock contention profile   pid:12345  duration:20s            │
│                                                              │
│   Top contended monitors                                     │
│     1. com.acme.cache.HotCache.get               68.4%       │
│        java.util.HashMap@0x76ab1b9d0                         │
│     2. java.util.concurrent.ConcurrentHashMap     8.1%       │
│     3. ch.qos.logback.core.OutputStreamAppender   3.2%       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

MetricTraditionalArgus
BLOCKED count⚠️ jstack × N + count✅ inline bar + warning
Contended monitor⚠️ manual aggregation across dumpsprofile --event lock
Owning call site✅ once you grep the holder✅ top-N call sites

11. False Sharing in the Multi-Core Era

Symptom. Independent counters live in adjacent fields of the same object. They land in the same 64-byte CPU cache line. Two cores end up invalidating each other's line on every write. Ops/sec collapses well below what the CPU should deliver, and no amount of pure-Java profiling tells you why.

Metrics required to diagnose:

  • L1/LLC cache-miss rate
  • Instructions per cycle (IPC)
  • Hot method that correlates with the cache-miss events

Traditional toolchain

$ perf stat -e cache-misses,instructions,cycles -p 12345 sleep 10

 Performance counter stats for process id '12345':

       1,824,210,033      cache-misses                #   42.18 % of all cache refs
       4,212,902,118      instructions                #    0.41  insn per cycle
      10,302,002,418      cycles

# 0.41 IPC and 42% miss rate is a giant red flag. Now correlate with hot methods:
$ perf record -e cache-misses -p 12345 sleep 10
$ perf report --no-children | head -10
   38.42%  com.acme.metrics.Counters.recordHit
   12.18%  com.acme.metrics.Counters.recordMiss

Argus

$ argus profile 12345 --event cache-misses --duration 10
╭─ argus profile --event cache-misses ────────────────────────╮
│ Hardware profile (Linux PMU)   pid:12345  duration:10s       │
│                                                              │
│   Top cache-miss methods                                     │
│     1. com.acme.metrics.Counters.recordHit       38.4%       │
│     2. com.acme.metrics.Counters.recordMiss      12.2%       │
│     3. java.util.concurrent.atomic.LongAdder      4.1%       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

MetricTraditionalArgus
Cache-miss rateperf stat⚠️ method-level only (no global rate)
IPCperf stat❌ not surfaced
Hot method ↔ missesperf record/reportargus profile --event cache-misses

Honest gap: Argus's cache-misses event surfaces the hot method on Linux, but interpreting it as false sharing still requires knowing that adjacent fields share a cache line. @Contended or padding is the fix; Argus points the finger but does not diagnose the pattern.

12. Direct Buffer Exhaustion (Off-Heap OOM)

Symptom. A high-traffic Netty service throws OutOfMemoryError: Direct buffer memory. Heap is healthy, the JVM is well below container limit, but the direct buffer pool has hit -XX:MaxDirectMemorySize.

Metrics required to diagnose:

  • Direct buffer count, total capacity, memory used
  • Mapped buffer counters for comparison
  • Allocation profile during the spike

Traditional toolchain

# Read the BufferPool MBean by hand
$ jcmd 12345 ManagementAgent.status
# Then connect with jconsole or a custom JMX client to read
# java.nio:type=BufferPool,name=direct -> Count, MemoryUsed, TotalCapacity

# Or capture a JFR allocation profile
$ jcmd 12345 JFR.start name=direct duration=30s settings=profile
$ jcmd 12345 JFR.stop name=direct filename=/tmp/direct.jfr
$ jmc /tmp/direct.jfr   # GC > Allocation tab, filter on DirectByteBuffer

Argus

$ argus buffers 12345
╭─ argus buffers ─────────────────────────────────────────────╮
│ NIO Buffer Pools   pid:12345  source:jdk                     │
│                                                              │
│   Pool                       Count    Capacity     Used      │
│   ──────────────────────────────────────────────────────     │
│   direct                    24,812      1.0 GB    1.0 GB     │
│   mapped                        18    412.0 MB    412.0 MB   │
│   mapped - 'non-volatile'        0          0          0     │
│                                                              │
│   Total                     24,830      1.4 GB    1.4 GB     │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus doctor 12345
   ✘ CRITICAL: Direct buffer pool at capacity (1.0 GB / 1.0 GB)
     24,812 outstanding direct buffers — unreleased Netty allocations
     → Raise -XX:MaxDirectMemorySize after confirming a real leak
     → argus profile 12345 --event alloc to see allocation sites

Verdict.

MetricTraditionalArgus
Direct buffer count / size⚠️ JMX MBean readargus buffers
Cross-correlation with GC⚠️ JFR + JMCdoctor rule
Allocation call sites✅ JFR allocation profileargus profile --event alloc

Closing. Argus collapses roughly nine of these twelve scenarios into a single command answer (1, 4, 5, 6, 8, 10, 12, plus the JVM-side of 3 and the application-side of 9). The remaining cases — TTSP safepoint logs, OS swap, and FD/socket exhaustion — sit below the JVM boundary, where Argus consciously stops. The intended workflow is to run Argus alongside the host shell, not in place of it.