Real-World Scenarios

Real-World Scenarios: Argus vs Traditional Toolchains

Twelve production incidents that a senior JVM engineer recognises on sight. For each one we show the commands you would actually type today, then the single Argus command that collapses the same evidence into one screen. We are honest about the cases where Argus does not replace the host shell — the OS-boundary cases (safepoint logs, swap, file descriptors, kernel OOM events) still need vmstat, lsof, or dmesg alongside Argus.

1. G1GC Humongous Object Allocation Bomb

Symptom. Heap looks fine — 6 GB committed, 40% usage — yet every few minutes the application stalls for 2–4 seconds. The culprit is multi-megabyte byte[] allocations from an Excel reader or a streaming file API. G1 promotes them straight to humongous regions, fragmentation builds up, and eventually an evacuation failure triggers a multi-second STW.

Metrics required to diagnose:

Humongous allocation events — frequency, region count, source call sites
G1 region fragmentation and evacuation failure markers in the GC log
Allocation hotspots, ordered by allocated bytes per call site

Traditional toolchain

# Enable verbose G1 logging, then re-correlate against allocation profile
$ jcmd 12345 VM.flags | grep -E 'UseG1GC|G1HeapRegionSize'
   uintx G1HeapRegionSize                          = 4194304
   bool  UseG1GC                                    = true

$ jcmd 12345 GC.heap_info
 garbage-first heap   total 6291456K, used 2516201K [0x0000...]
  region size 4096K, 14 young (57344K), 3 survivors (12288K)
 Metaspace       used 84321K, committed 86016K, reserved 1130496K

# Restart the app with -Xlog:gc*,gc+humongous=trace and re-grep
$ grep -E 'Humongous|to-space exhausted' gc.log | tail -20
[3214.876s][info ][gc,humongous] GC(412) Humongous region: 1 -> 1
[3214.880s][info ][gc           ] GC(412) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[3401.122s][warn ][gc           ] GC(488) To-space exhausted

# Get the call sites
$ async-profiler -e alloc -d 30s -f /tmp/alloc.html 12345
$ open /tmp/alloc.html   # click into HumongousObjAllocator path

Argus

$ argus doctor 12345
╭─ argus doctor ──────────────────────────────────────────────╮
│ JVM Health Report     pid:12345  HotSpot  uptime:1h 14m     │
│                                                              │
│   Heap: 2.4 GB/6.0 GB (40%)  |  CPU: 22%  |  Threads: 184    │
│   GC: 14.6% overhead                                         │
│ ──────────────────────────────────────────────────────────── │
│                                                              │
│   1 critical  1 warning                                      │
│                                                              │
│   ✘ CRITICAL: G1 humongous allocations dominating pauses     │
│     14 humongous events in last 5 min; max pause 2,840 ms    │
│     → Reduce object size below G1HeapRegionSize/2 (2 MB)     │
│     → Or raise -XX:G1HeapRegionSize=8m                       │
│                                                              │
│   ⚠ WARNING: GC overhead 14.6% (threshold 10%)               │
│     → argus profile 12345 --event alloc --duration 30        │
│                                                              │
╰─ ✘ critical ────────────────────────────────────────────────╯

$ argus profile 12345 --event alloc --duration 30
Top allocation sites (n=18,204 samples, 30s window)
  1. com.acme.report.XlsxReader.readSheet            41.8%  byte[]
  2. java.util.zip.Inflater.inflateBytes             17.2%  byte[]
  3. com.acme.report.XlsxReader.rowBuffer             9.6%  byte[]

Verdict.

Metric	Traditional	Argus
Humongous event count	⚠️ requires `-Xlog:gc*` + restart	✅ surfaced by `doctor`
Allocation hotspots	⚠️ async-profiler + HTML viewer	✅ `argus profile --event alloc`
Region size + flag fix	✅ `jcmd VM.flags`	✅ included as recommendation

One Argus run replaces a restart-for-logs cycle plus a separate profiler session.

2. Time To Safepoint (TTSP) Black Hole

Symptom. The GC log says Pause Young: 11 ms but the application reports stop-the-world freezes of several seconds. The culprit is TTSP: a counted loop scanning a giant array does not poll for safepoints, so every other thread sits idle waiting for the laggard to check in.

Metrics required to diagnose:

Safepoint entry latency — time-to-reach-safepoint per VM operation
Wall-clock samples of application threads during the stall
JIT-compiled methods using counted loops on large arrays

Traditional toolchain

# Add the safepoint logging flag and restart
$ java -Xlog:safepoint*=info:file=safepoint.log:time,uptime,level ...

$ grep 'Total time for which application threads were stopped' safepoint.log
[uptime=621.142s] Total time for which application threads were stopped: 4.8214s
                  Stopping threads took: 4.8101s
# 4.8 of 4.82 seconds were spent reaching safepoint, not in the GC itself

$ async-profiler -e wall -d 10s -f /tmp/wall.html 12345
# Open in flame-graph viewer, look for hot frames during the stall window

Argus

$ argus profile 12345 --event wall --duration 10
╭─ argus profile --event wall ────────────────────────────────╮
│ Wall-clock profile   pid:12345  duration:10s  samples:9,873  │
│                                                              │
│   Top wall-time methods                                      │
│     1. com.acme.search.MatrixScan.find          58.4%        │
│     2. java.util.HashMap.getNode                 6.1%        │
│     3. sun.nio.ch.EPollSelectorImpl.doSelect     5.8%        │
│                                                              │
│   ⚠ A single method dominates wall time — review for         │
│     counted loops without safepoint polls (TTSP risk).       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

Metric	Traditional	Argus
Safepoint entry latency	✅ `-Xlog:safepoint`	❌ Argus does not parse safepoint logs
Wall-clock hotspots	⚠️ async-profiler + HTML viewer	✅ `argus profile --event wall`
Root cause identification	✅ once both signals correlated	⚠️ partial — points at the suspect method

Honest gap: Argus shows the dominant wall-time method, which is usually the offending counted loop, but you still want -Xlog:safepoint in production to prove the stop-the-world time is in safepoint entry, not in the collector.

3. Linux OOMKilled — The Silent Assassin

Symptom. Heap stays at 50%, no OutOfMemoryError is thrown, yet the pod is killed and restarted. The kernel OOM-killer fired because resident memory blew past the container limit — usually Netty DirectByteBuffer, gRPC native channels, or jemalloc fragmentation, all of which live outside the heap.

Metrics required to diagnose:

Native memory committed by category (Class, Thread, Code, GC, Internal, …)
Container working-set vs limit (cgroup memory.max)
Kernel kill event in dmesg / journalctl

Traditional toolchain

# Confirm the kernel killed it
$ dmesg | grep -i 'killed process'
[7421.882] Out of memory: Killed process 12345 (java) total-vm:9120384kB,
           anon-rss:7842112kB, file-rss:0kB, shmem-rss:0kB

# Re-enable NMT in the manifest and capture a baseline on next start
$ jcmd 12345 VM.native_memory baseline
$ # ... 30 min later ...
$ jcmd 12345 VM.native_memory summary.diff
Native Memory Tracking:
Total: reserved=9628700KB +384200KB, committed=8420112KB +402112KB
-                 Class (reserved=180224KB, committed=178892KB)
-                Thread (reserved=520304KB +12000KB, committed=520304KB +12000KB)
-              Internal (reserved=124188KB +321000KB, committed=124188KB +321000KB)
... scan all 18 categories manually ...

Argus

$ argus nmt 12345 --save baseline.json
Saved NMT baseline to: /home/ops/baseline.json

# ... 30 minutes of traffic ...

$ argus nmt 12345 --diff baseline.json
╭─ argus nmt --diff ──────────────────────────────────────────╮
│ NMT diff   pid:12345  source:jdk  elapsed:31m 04s            │
│                                                              │
│   Since 2026-05-14 09:12:01: committed +392.7 MB, reserved   │
│   +375.2 MB                                                  │
│                                                              │
│   Category          Reserved Δ   Committed Δ   Reserved now  │
│   ──────────────────────────────────────────────────────     │
│   Internal              +321 MB       +321 MB       445 MB   │
│   Thread                 +12 MB        +12 MB       520 MB   │
│   GC                      +9 MB         +9 MB       142 MB   │
│   Class                   +6 MB         +5 MB       178 MB   │
│                                                              │
│   ✘ Internal (DirectByteBuffer + JNI scratch) is the         │
│     dominant grower — usually Netty/gRPC native pools.       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Once you have identified the growing NMT category, cross-check whether the current heap sizing itself puts the pod at OOMKill risk:

$ argus rightsize 12345 --limit=2g
╭─ argus rightsize ───────────────────────────────────────────╮
│ JVM Right-Sizing   floor:312MiB  safety:1.5x                 │
│                                                              │
│   -Xmx                   -Xmx768m                            │
│   -Xms                   -Xms256m                            │
│   Container request       900 MiB                            │
│   Container limit        1100 MiB                            │
│   CPU request            0.50 cores                          │
│                                                              │
│   [OOMKILL RISK] total footprint (heap + off-heap) exceeds   │
│     container limit 2g — kernel OOMKill likely               │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

Metric	Traditional	Argus
NMT baseline + diff	⚠️ manual scan of 18 categories	✅ banner + sorted growers
OOMKill risk check	⚠️ manual math from kubectl + jcmd	✅ `argus rightsize --limit=<mem>`
Container working-set	✅ `kubectl top` / cAdvisor	❌ out of scope
Kernel kill event	✅ `dmesg`	❌ out of scope — still need shell

Honest gap: Argus identifies the JVM-side grower in seconds; dmesg and container metrics are still required to confirm the kernel killed the process.

4. JIT Deoptimization Storm

Symptom. CPU usage spikes and throughput collapses without any code change. Heavy use of runtime-generated lambdas and reflection fills the CodeCache, the JIT marks hot methods made-not-entrant, the JVM falls back to the interpreter, the loop repeats.

Metrics required to diagnose:

CodeCache used vs total, max-used watermark
nmethod count, deoptimization counter trend
Compiler queue depth

Traditional toolchain

$ jcmd 12345 Compiler.codecache
CodeHeap 'non-profiled nmethods': size=120000Kb  used=119742Kb  max_used=119988Kb  free=258Kb
CodeHeap 'profiled nmethods':     size=120000Kb  used=119410Kb  max_used=119410Kb  free=590Kb
CodeHeap 'non-nmethods':           size=5760Kb   used=4012Kb    max_used=4040Kb   free=1748Kb

# Restart with -XX:+PrintCompilation to see made-not-entrant frequency
$ grep 'made not entrant' compilation.log | wc -l
21847

Argus

$ argus compiler 12345
╭─ argus compiler ────────────────────────────────────────────╮
│ JIT Compiler   pid:12345  source:jdk                         │
│                                                              │
│   ✔ Compilation enabled                                      │
│                                                              │
│   Code cache  ████████████████████  239.4 MB / 245.7 MB (97%)│
│     Max used: 239.4 MB     Free: 6.3 MB                      │
│                                                              │
│   Total blobs: 28,412    nmethods: 24,108    adapters: 1,820 │
│   Compiler queue: 312                                        │
│   Deoptimizations: 21,847                                    │
│                                                              │
│   ⚠ Code cache > 80% full — JIT compiler may stop. Increase  │
│     -XX:ReservedCodeCacheSize.                               │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus doctor 12345
   ✘ CRITICAL: CodeCache near exhaustion (97%)
     21,847 deoptimizations since boot; compiler queue 312
     → -XX:ReservedCodeCacheSize=512m
     → -XX:-UseCodeCacheFlushing (only after raising the size)

Verdict.

Metric	Traditional	Argus
CodeCache used / max	✅ `Compiler.codecache`	✅ progress bar + watermark
Deoptimization count	⚠️ requires PrintCompilation restart	✅ surfaced live
Suggested fix	⚠️ requires JVM-tuning knowledge	✅ `doctor` emits the flag

5. Phantom Thread Swamp

Symptom. A payment-gateway client was deployed without a read timeout. Tomcat worker threads accumulate stuck in socketRead0, the pool saturates, every other endpoint starts returning 503. Load average is low; the JVM looks idle.

Metrics required to diagnose:

Thread state distribution (RUNNABLE vs WAITING vs BLOCKED)
Frequency of socketRead0 in stacks
Per-pool concentration of stuck threads

Traditional toolchain

$ jstack 12345 > /tmp/dump1.txt
$ sleep 5
$ jstack 12345 > /tmp/dump2.txt
$ sleep 5
$ jstack 12345 > /tmp/dump3.txt

$ grep -c 'socketRead0' /tmp/dump1.txt /tmp/dump2.txt /tmp/dump3.txt
/tmp/dump1.txt:182
/tmp/dump2.txt:184
/tmp/dump3.txt:183

# Confirm they are Tomcat workers
$ grep -B1 'socketRead0' /tmp/dump2.txt | grep '"http-nio'
"http-nio-8080-exec-12"  ...
"http-nio-8080-exec-13"  ...
... (183 occurrences) ...

Argus

$ argus threads 12345
╭─ argus threads ─────────────────────────────────────────────╮
│ Thread Dump   pid:12345  source:jdk                          │
│                                                              │
│ Total: 248    Virtual: 0    Platform: 248    Peak: 252       │
│                                                              │
│ RUNNABLE      ████░░░░░░░░░░░░     24  ( 10%)                │
│ WAITING       █████████████░░░    195  ( 79%)                │
│ TIMED_WAITING ██░░░░░░░░░░░░░░     27  ( 11%)                │
│ BLOCKED       ░░░░░░░░░░░░░░░░      2  (  1%)                │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus pool 12345
  Pool                              Count  State
  ──────────────────────────────────────────────────────────
  http-nio-8080-exec-                 184    WAIT:183  RUN:1
  catalina-utility-                     8    WAIT:8
  ForkJoinPool.commonPool-              8    WAIT:8
  scheduling-                           4    WAIT:4

Verdict.

Metric	Traditional	Argus
State distribution	⚠️ 3× `jstack` + grep	✅ inline bars
Stuck-thread pool	⚠️ manual aggregation	✅ `argus pool`
Confirm I/O wait	✅ stack grep	⚠️ requires opening one stack

6. Dynamic Proxy Betrayal (Metaspace Leak)

Symptom. A long-running service slowly grows Metaspace until it hits OutOfMemoryError: Metaspace a few weeks after deploy. Usually a Spring or Hibernate proxy class is pinned by a ThreadLocal that nobody clears, so its ClassLoader cannot be unloaded.

Metrics required to diagnose:

Metaspace used / committed trend
Loaded class count growth
Top ClassLoaders by class count

Traditional toolchain

$ jcmd 12345 VM.metaspace
Total Usage - 9 loaders, 84,221 classes: ...
  Used: 412.3 MB  Committed: 418.0 MB  Reserved: 1.1 GB

$ jcmd 12345 GC.class_histogram | head -20
 num     #instances         #bytes  class name
   1:         482,114       38,569,120  $Proxy188
   2:         312,002       24,960,160  $Proxy42
... (no information about which ClassLoader is the offender)

# To get the GC-root path you must dump the heap and load it in MAT
$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
# Open in Eclipse MAT, run Leak Suspects ... 20 min later ...

Argus

$ argus metaspace 12345
╭─ argus metaspace ───────────────────────────────────────────╮
│ Metaspace   pid:12345  source:jdk                            │
│                                                              │
│   Used  ████████████████████  412.3 MB / 418.0 MB (99%)      │
│     Reserved: 1.1 GB                                         │
│                                                              │
│   Space           Used     Committed     Reserved            │
│   ──────────────────────────────────────────────────────     │
│   Metaspace      318 MB       320 MB         1.0 GB          │
│   ClassSpace      94 MB        98 MB        128 MB           │
│                                                              │
│   ⚠ Metaspace usage above 90% — risk of OOM:Metaspace.       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus classleak 12345 --top 5
  ClassLoader                                  Classes    Δ since boot
  ─────────────────────────────────────────────────────────────────────
  AppClassLoader                                 24,118        +18
  sun.reflect.DelegatingClassLoader             482,114    +480,002
  org.springframework.cglib.core.ReflectUtils    12,402     +12,400
  jdk.internal.loader.ClassLoaders$AppClass...    1,402         +0

Verdict.

Metric	Traditional	Argus
Metaspace usage trend	✅ `VM.metaspace`	✅ progress bar + warning
Loaded class growth	⚠️ `GC.class_histogram` only	✅ `argus classleak`
Offending ClassLoader	❌ heap dump + MAT (20 min)	✅ top-N by class count

7. Fatal OS Swapping

Symptom. GC pauses that used to be 30–50 ms balloon to tens of seconds. The reason is not the collector — the JVM RSS exceeds physical RAM, the kernel pages out heap regions, and every GC mark phase has to fault them back in. Garbage-collecting paged-out memory is catastrophic.

Metrics required to diagnose:

Host swap-in / swap-out rates (vmstat si/so)
GC pause User= vs Sys= ratio (high Sys = page faults)
JVM GC overhead trend

Traditional toolchain

$ vmstat 1 5
procs  -----------memory----------  ---swap--  -----io----  -system--  ----cpu----
 r  b   swpd   free   buff  cache    si   so    bi    bo    in   cs  us sy id wa
 4  0  812224  31200  4012 142000   8412 9120  9408  9412  3211 4022  18 22 38 22
 5  0  819440  29040  4012 141880   9120 8804  9120  8800  3402 4188  17 25 36 22
# si/so > 0 sustained = the JVM is paging

$ grep 'User=' gc.log | tail -3
[Times: user=0.42 sys=3.81, real=4.18 secs]
# sys time eclipses user time → kernel is dominating, the collector is waiting on I/O

Argus

$ argus doctor 12345
   ✘ CRITICAL: GC overhead 38.2% (threshold 10%)
     Mean pause 4,180 ms; last cause Allocation Failure
     → Check host memory pressure: vmstat 1 (look at si/so)
     → If swapping, the fix is at the OS layer, not the JVM

Verdict.

Metric	Traditional	Argus
GC pause anomaly	✅ GC log review	✅ `doctor` flags overhead
vmstat si/so	✅ `vmstat 1`	❌ Argus does not read OS counters
User vs Sys time	✅ GC log parsing	❌ not exposed

Honest gap: Argus shouts "your GC overhead is wrong" within seconds, but the proof that the cause is swap-out lives in vmstat and the GC log timing fields. Argus and the host shell are complementary here.

8. Finalizer Queue Backpressure

Symptom. An older third-party library uses finalize() for stream cleanup. Allocation rate is fine but heap usage climbs anyway, because objects scheduled for finalization sit in the java.lang.ref.Finalizer queue waiting for the single ReferenceHandler thread to drain them.

Metrics required to diagnose:

Pending finalization count
Finalizer thread state (should be WAITING; if RUNNABLE constantly, it cannot keep up)
Reference-processing phase time in GC log

Traditional toolchain

$ jcmd 12345 GC.heap_info | grep -i 'pending'
# No direct surface — you must dump the heap.

$ jcmd 12345 GC.heap_dump /tmp/heap.hprof
$ # Open in MAT, navigate to java.lang.ref.Finalizer.queue, count nodes
$ grep '\[Reference Processing\]' gc.log | tail -5
[gc,ref] GC(721) Reference Processing                      284.318ms
# 284 ms in reference processing alone is the smoking gun

Argus

$ argus finalizer 12345
╭─ argus finalizer ───────────────────────────────────────────╮
│ Finalizer queue   pid:12345  source:jdk                      │
│                                                              │
│   ⚠ 4,812 objects pending finalization                       │
│                                                              │
│   Finalizer thread state  RUNNABLE                           │
│                                                              │
│   ⚠ Pending count above 100 — finalizers cannot keep up.     │
│     Replace finalize() with try-with-resources / Cleaner.    │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

Metric	Traditional	Argus
Pending finalizer count	❌ heap dump + MAT	✅ shown inline
Finalizer thread state	⚠️ `jstack` grep	✅ shown inline
Reference-processing time	✅ GC log	⚠️ visible via `argus gc`

9. FD Exhaustion + CLOSE_WAIT

Symptom. The application starts throwing java.net.SocketException: Too many open files. The HTTP client never calls close() on its response bodies; sockets pile up in CLOSE_WAIT on the kernel side, until the process file-descriptor ceiling is hit and every new connection fails.

Metrics required to diagnose:

Process FD count vs limit
TCP socket state distribution — count of CLOSE_WAIT entries
Threads stuck in HTTP read on the application side

Traditional toolchain

$ lsof -p 12345 | wc -l
65411
$ cat /proc/12345/limits | grep 'open files'
Max open files            65536                65536                files

$ ss -tnp | awk '{print $1}' | sort | uniq -c
   1 State
  18 ESTAB
8412 CLOSE-WAIT
# CLOSE_WAIT dominates — the application is not closing its half of each socket

Argus

# Argus does not have an lsof equivalent. The application-side hint is:
$ argus threads 12345
 RUNNABLE       ░░░░░░░░░░░░░░░░      4  (  2%)
 WAITING        ███████████████░    192  ( 96%)   ← almost all HTTP-read waits

$ argus pool 12345
  http-client-pool-                   180    WAIT:180

Verdict.

Metric	Traditional	Argus
Open FD count	✅ `lsof` / `/proc`	❌ out of scope
TCP socket state	✅ `ss` / `netstat`	❌ out of scope
Application-side stuck threads	⚠️ `jstack` grep	✅ `argus threads` + `pool`

Honest gap: this scenario lives below the JVM. Argus can confirm the application threads are stuck on socket reads (consistent with a leak), but the FD count and the kernel-side socket states require lsof / ss on the host.

10. Invisible Implicit Deadlock — Lock Contention

Symptom. Throughput collapses, threads are not deadlocked in the JMX sense (there is no cycle), yet ninety percent of workers sit BLOCKED. The cause is a single synchronized cache map that serialises every read.

Metrics required to diagnose:

BLOCKED thread count over time
Lock-acquire wait hotspots — which monitor and from which call site
Pool concentration of the blocked threads

Traditional toolchain

$ for i in 1 2 3; do jstack 12345 > /tmp/d$i.txt; sleep 5; done
$ grep -E '^"|waiting to lock' /tmp/d2.txt | grep -B1 'waiting to lock 0x000000076ab1b9d0' | head
"http-nio-8080-exec-42"
        - waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
"http-nio-8080-exec-43"
        - waiting to lock <0x000000076ab1b9d0> (a java.util.HashMap)
# ... 178 more threads blocked on the same monitor

Argus

$ argus threads 12345
 BLOCKED       ███████████████░    178  ( 72%)   ⚠

$ argus profile 12345 --event lock --duration 20
╭─ argus profile --event lock ────────────────────────────────╮
│ Lock contention profile   pid:12345  duration:20s            │
│                                                              │
│   Top contended monitors                                     │
│     1. com.acme.cache.HotCache.get               68.4%       │
│        java.util.HashMap@0x76ab1b9d0                         │
│     2. java.util.concurrent.ConcurrentHashMap     8.1%       │
│     3. ch.qos.logback.core.OutputStreamAppender   3.2%       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

Metric	Traditional	Argus
BLOCKED count	⚠️ `jstack` × N + count	✅ inline bar + warning
Contended monitor	⚠️ manual aggregation across dumps	✅ `profile --event lock`
Owning call site	✅ once you grep the holder	✅ top-N call sites

Optional: confirm the exact hot synchronized path. Once profile --event lock names the suspect method, use argus instrument watch to capture per-invocation args, return value, and wall-clock cost without restarting the process:

# Requires --enable-instrument and ~/.argus/lib/argus-instrument.jar
$ argus instrument watch 12345 com.acme.Foo#method --enable-instrument
[argus-instrument]  com.acme.Foo#method  args=[...]  return=...  duration=4ms
[argus-instrument]  com.acme.Foo#method  args=[...]  return=...  duration=6ms
# Ctrl-C — bytecode restored automatically on detach

The agent is default-OFF, attaches on demand, and resets all transformed bytecode on detach — no residual instrumentation is left in the target JVM.

11. False Sharing in the Multi-Core Era

Symptom. Independent counters live in adjacent fields of the same object. They land in the same 64-byte CPU cache line. Two cores end up invalidating each other's line on every write. Ops/sec collapses well below what the CPU should deliver, and no amount of pure-Java profiling tells you why.

Metrics required to diagnose:

L1/LLC cache-miss rate
Instructions per cycle (IPC)
Hot method that correlates with the cache-miss events

Traditional toolchain

$ perf stat -e cache-misses,instructions,cycles -p 12345 sleep 10

 Performance counter stats for process id '12345':

       1,824,210,033      cache-misses                #   42.18 % of all cache refs
       4,212,902,118      instructions                #    0.41  insn per cycle
      10,302,002,418      cycles

# 0.41 IPC and 42% miss rate is a giant red flag. Now correlate with hot methods:
$ perf record -e cache-misses -p 12345 sleep 10
$ perf report --no-children | head -10
   38.42%  com.acme.metrics.Counters.recordHit
   12.18%  com.acme.metrics.Counters.recordMiss

Argus

$ argus profile 12345 --event cache-misses --duration 10
╭─ argus profile --event cache-misses ────────────────────────╮
│ Hardware profile (Linux PMU)   pid:12345  duration:10s       │
│                                                              │
│   Top cache-miss methods                                     │
│     1. com.acme.metrics.Counters.recordHit       38.4%       │
│     2. com.acme.metrics.Counters.recordMiss      12.2%       │
│     3. java.util.concurrent.atomic.LongAdder      4.1%       │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

Verdict.

Metric	Traditional	Argus
Cache-miss rate	✅ `perf stat`	⚠️ method-level only (no global rate)
IPC	✅ `perf stat`	❌ not surfaced
Hot method ↔ misses	✅ `perf record/report`	✅ `argus profile --event cache-misses`

Honest gap: Argus's cache-misses event surfaces the hot method on Linux, but interpreting it as false sharing still requires knowing that adjacent fields share a cache line. @Contended or padding is the fix; Argus points the finger but does not diagnose the pattern.

12. Direct Buffer Exhaustion (Off-Heap OOM)

Symptom. A high-traffic Netty service throws OutOfMemoryError: Direct buffer memory. Heap is healthy, the JVM is well below container limit, but the direct buffer pool has hit -XX:MaxDirectMemorySize.

Metrics required to diagnose:

Direct buffer count, total capacity, memory used
Mapped buffer counters for comparison
Allocation profile during the spike

Traditional toolchain

# Read the BufferPool MBean by hand
$ jcmd 12345 ManagementAgent.status
# Then connect with jconsole or a custom JMX client to read
# java.nio:type=BufferPool,name=direct -> Count, MemoryUsed, TotalCapacity

# Or capture a JFR allocation profile
$ jcmd 12345 JFR.start name=direct duration=30s settings=profile
$ jcmd 12345 JFR.stop name=direct filename=/tmp/direct.jfr
$ jmc /tmp/direct.jfr   # GC > Allocation tab, filter on DirectByteBuffer

Argus

$ argus buffers 12345
╭─ argus buffers ─────────────────────────────────────────────╮
│ NIO Buffer Pools   pid:12345  source:jdk                     │
│                                                              │
│   Pool                       Count    Capacity     Used      │
│   ──────────────────────────────────────────────────────     │
│   direct                    24,812      1.0 GB    1.0 GB     │
│   mapped                        18    412.0 MB    412.0 MB   │
│   mapped - 'non-volatile'        0          0          0     │
│                                                              │
│   Total                     24,830      1.4 GB    1.4 GB     │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus doctor 12345
   ✘ CRITICAL: Direct buffer pool at capacity (1.0 GB / 1.0 GB)
     24,812 outstanding direct buffers — unreleased Netty allocations
     → Raise -XX:MaxDirectMemorySize after confirming a real leak
     → argus profile 12345 --event alloc to see allocation sites

Verdict.

Metric	Traditional	Argus
Direct buffer count / size	⚠️ JMX MBean read	✅ `argus buffers`
Cross-correlation with GC	⚠️ JFR + JMC	✅ `doctor` rule
Allocation call sites	✅ JFR allocation profile	✅ `argus profile --event alloc`

13. Virtual-Thread Carrier Saturation / Pinning

Symptom. A Java 21+ service migrated to virtual threads shows healthy request counts under low load but throughput collapses under concurrency. Heap, GC, and CPU look fine. The root cause is that synchronized blocks or legacy blocking I/O inside virtual threads pin them to their carrier platform threads. Once all carriers are pinned, the scheduler cannot mount new virtual threads and the entire pool stalls — even though hundreds of virtual threads are nominally "runnable".

Metrics required to diagnose:

Virtual thread count vs. carrier (platform) thread count
Carrier thread states — all WAITING/BLOCKED means they are all pinned
Pinning event rate and taxonomy: native-frame pinning, foreign-call pinning, or object-monitor pinning (post-JEP-491 JDK 24+ breakdown)
jdk.VirtualThreadPinned and jdk.VirtualThreadSubmitFailed JFR events

Traditional toolchain

# Enable JVM pinning trace at startup and tail the log
$ java -Djdk.tracePinnedThreads=full -jar service.jar 2>&1 | grep -A4 'PinnedThreads'
Thread[#42,ForkJoinPool-1-worker-1,5,CarrierThreads]
    com.acme.PaymentService.charge(PaymentService.java:88) <== monitors:1>
    com.acme.PaymentService$$Lambda.run(...)
# Requires restart; output is noisy with no count/rate aggregation.

# Capture a JFR recording and open in JMC
$ jcmd 12345 JFR.start name=vt duration=60s settings=profile
$ jcmd 12345 JFR.stop name=vt filename=/tmp/vt.jfr
$ jmc /tmp/vt.jfr   # Events tab → filter jdk.VirtualThreadPinned

Argus

$ argus threads 12345
╭─ argus threads ─────────────────────────────────────────────╮
│ Thread Dump   pid:12345  source:jdk                          │
│                                                              │
│ Total: 1,284    Virtual: 1,240    Platform: 44    Peak: 48   │
│                                                              │
│ RUNNABLE      ██░░░░░░░░░░░░░░     44  (  3%)   ← carriers  │
│ WAITING       █████████████░░░  1,228  ( 96%)   ← vt queue  │
│ TIMED_WAITING ░░░░░░░░░░░░░░░░      8  (  1%)               │
│ BLOCKED       ░░░░░░░░░░░░░░░░      4  (  0%)               │
│                                                              │
│ ⚠ All carrier threads busy — virtual thread scheduler        │
│   stall likely. Check for pinning events.                    │
│                                                              │
╰──────────────────────────────────────────────────────────────╯

$ argus doctor 12345
╭─ argus doctor ──────────────────────────────────────────────╮
│ JVM Health Report     pid:12345  HotSpot  uptime:22m         │
│                                                              │
│   Heap: 1.1 GB/4.0 GB (28%)  |  CPU: 18%  |  Threads: 1,284  │
│   GC: 1.2% overhead                                          │
│ ──────────────────────────────────────────────────────────── │
│                                                              │
│   1 critical                                                  │
│                                                              │
│   ✘ CRITICAL: Virtual-thread carrier saturation              │
│     44 carriers all active; 1,228 virtual threads waiting    │
│     Pinning rate: 312/min (jdk.VirtualThreadPinned)          │
│     Pinning taxonomy: object-monitor (287), native-frame (25) │
│     → Replace synchronized blocks with ReentrantLock in      │
│       pinning hot paths (see Dashboard → Virtual Threads)    │
│     → Or raise carrier parallelism:                          │
│       -Djdk.virtualThreadScheduler.parallelism=N             │
│                                                              │
╰─ ✘ critical ────────────────────────────────────────────────╯

The dashboard's Virtual Threads panel shows the jdk.VirtualThreadPinned event stream with stack traces (identifying the offending synchronized block) and the jdk.VirtualThreadSubmitFailed rate (carrier-pool submission failures, which appear when all carriers are saturated and cannot accept new mounts).

Fix. Replace synchronized with java.util.concurrent.locks.ReentrantLock (or ReentrantReadWriteLock) in the pinning hot paths — virtual threads unmount cleanly at lock.lock() whereas synchronized pins the carrier. On JDK 24+ with JEP 491, object-monitor operations no longer pin; updating the runtime eliminates the largest pinning category without code changes.

Verdict.

Metric	Traditional	Argus
Carrier saturation	⚠️ inferred from `jstack` + carrier thread names	✅ `threads` shows virtual vs. carrier counts + warning
Pinning event rate	⚠️ JFR + JMC or restart with `tracePinnedThreads`	✅ `doctor` surfaces rate + taxonomy inline
Pinning stack traces	✅ `-Djdk.tracePinnedThreads=full` (requires restart)	✅ dashboard Virtual Threads panel (`jdk.VirtualThreadPinned`)
`VirtualThreadSubmitFailed`	⚠️ JFR + manual event filter	✅ dashboard Virtual Threads panel

Honest gap: Argus surfaces the pinning rate and taxonomy live; confirming the exact line of the offending synchronized block still requires inspecting the stack traces in the dashboard's Virtual Threads panel or in a JFR recording.

Closing. Argus collapses roughly ten of these thirteen scenarios into a single command answer (1, 4, 5, 6, 8, 10, 12, 13, plus the JVM-side of 3 and the application-side of 9). The remaining cases — TTSP safepoint logs, OS swap, and FD/socket exhaustion — sit below the JVM boundary, where Argus consciously stops. The intended workflow is to run Argus alongside the host shell, not in place of it.