Resolving memory leaks in a legacy Node.js monolith to reduce AWS compute spend
Context: A B2B SaaS provider ran a long-lived Node.js monolith behind auto-scaling groups. Weekly releases masked a creeping heap issue until garbage collection pauses began tripping synthetic monitors.
Node diagnosis: Heap snapshots revealed retained buffers tied to an internal fan-out cache without eviction discipline. Complementary event loop traces showed periodic stalls aligning with major GC cycles.
Intervention: Rewrote cache semantics with explicit TTL and maximum cardinality; isolated serialization-heavy routes behind worker threads for CPU peaks; added guarded pooling for outbound HTTP agents.
Outcome: After stabilization, the client reported roughly forty percent lower AWS EC2 spend attributable to fewer Node instances required for the same peak traffic envelope—and materially fewer latency spikes during weekday business hours.
NodeJS gateway optimization that typically improves backend latency during bursts
Context: An ecommerce brand routed mobile traffic through a Node.js gateway performing redundant JWT introspection and oversized payload validation on every hop.
Node diagnosis: Flamegraphs highlighted synchronous crypto-adjacent operations combined with JSON schema compilation occurring per-request in hot paths.
Intervention: Introduced memoized validators, moved introspection to short-lived edge caches with explicit revocation handling, and collapsed duplicate downstream calls via guarded batch endpoints.
Outcome: p95 gateway latency improved enough to eliminate emergency pre-scale rituals ahead of promotional drops—tail behavior remained sensitive to inventory services but Node ceased being the dominant bottleneck.
Monolith to Node microservices migration without a reliability regression
Context: A fintech-adjacent platform needed to extract pricing logic from a sprawling Node monolith without freezing feature delivery.
Node diagnosis: Boundary ambiguity caused duplicated domain rules and unstable contracts—classic precursors to distributed failures once traffic splits.
Intervention: Delivered a strangler roadmap with synthetic dual-run comparisons, OpenTelemetry propagation standards across Node services, and incremental traffic shifting guarded by error budgets.
Outcome: Post-cutover, on-call pages for cascading timeouts dropped sharply relative to the six-month baseline—not zero incidents, but a credible Node topology executives could reason about.