症状

生产环境业务页面打不开,看错误日志是 OutOfDirectMemoryError (OOM)了,详情报错信息如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
2022-03-21 06:00:00.541 [,] [tafprx-asyrecv_9] ERROR r.c.p.Operators - [error,314] - Operator called default onErrorDropped
java.lang.OutOfMemoryError: Direct buffer memory
        at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
        at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
        at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
        at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:632)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:607)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202)
        at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:172)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:134)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
        at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:395)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
        at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
        at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:71)
        at org.springframework.core.io.buffer.NettyDataBufferFactory.allocateBuffer(NettyDataBufferFactory.java:39)
        at org.springframework.http.codec.json.AbstractJackson2Encoder.encodeValue(AbstractJackson2Encoder.java:236)
        at org.springframework.http.codec.json.AbstractJackson2Encoder.lambda$encode$0(AbstractJackson2Encoder.java:150)
        at reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:113)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1815)
        at reactor.core.publisher.MonoCompletionStage.lambda$subscribe$0(MonoCompletionStage.java:82)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:883)
        at java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2251)
        at java.base/java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:143)
        at reactor.core.publisher.MonoCompletionStage.subscribe(MonoCompletionStage.java:57)
        at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:157)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1815)
        at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:249)
        at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
        at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:199)
        at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:199)
        at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.complete(MonoIgnoreThen.java:284)
        at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onNext(MonoIgnoreThen.java:187)
        at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1815)
        at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:151)

根据报错的堆栈,找到 AbstractJackson2Encoder:236 行,调试发现框架使用的buffer在最后都进行了释放,看着没问题。

JDK版本

1
2
3
openjdk version "11.0.4" 2019-07-16 LTS
OpenJDK Runtime Environment Corretto-11.0.4.11.1 (build 11.0.4+11-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.4.11.1 (build 11.0.4+11-LTS, mixed mode)Bluff@2015

JVM配置

1
2
3
-Dspring.profiles.active=prod -Dreactor.netty.http.server.accessLogEnabled=true \
-Xms6G -Xmx6G -XX:MaxDirectMemorySize=6G -XX:+HeapDumpOnOutOfMemoryError \
-XX:+UnlockExperimentalVMOptions -XX:+UseZGC

线上复现

堆外内存OOM还是第一次遇到,有点无从查起。想着先在线上看一下堆外内存的使用情况,在JVM中增加以下配置:

1
-XX:NativeMemoryTracking=detail

满心欢喜上机器上执行查看堆外内存的命令

1
jcmd 933 VM.native_memory

被现实当头一棒,没有权限获取内存使用信息,后来查了一下堆外内存由操作系统直接管理,要查看需要有root权限,此路不通。

本地复现

线上不行,只能本地进行复现了。项目使用的是SpringWebflux,通信框架使用的Netty,错误信息也描述的很详细,Netty喊着内存不够了。本地复现主要是控制堆外内存配置的小一点,方便问题出现。

配置验证使用的JVM参数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-DSpring.profiles.active=dev
-Dreactor.netty.http.server.accessLogEnabled=true
-Dlogback.path=/tmp
-XX:NativeMemoryTracking=detail
-Xms1G
-Xmx1G
-XX:MaxDirectMemorySize=10m
-Dio.netty.maxDirectMemory=204800 //给netty的内存小一点
--add-opens java.base/jdk.internal.misc=ALL-UNNAMED
-Dio.netty.tryReflectionSetAccessible=true
-XX:+HeapDumpOnOutOfMemoryError
-Dio.netty.leakDetection.level=advanced
-XX:+UnlockExperimentalVMOptions

发送请求查看日志

内存增加,在响应结束后内存并未减少,起初是使用 130k ,并发请求后,直接到 198k ,然后OOM了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
2022-03-30 20:10:29.320 [,] [pool-4-thread-1] INFO  c.y.c.r.DirectMemReporter - [doReport,51] - netty_direct_memory size:133120b,130k, max:204800

io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 2048 byte(s) of direct memory (used: 202816, max: 204800)
    at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802)
    at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
    at io.netty.buffer.UnpooledDirectByteBuf.<init>(UnpooledDirectByteBuf.java:64)
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:41)
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
    at io.netty.buffer.UnsafeByteBufUtil.newUnsafeDirectByteBuf(UnsafeByteBufUtil.java:632)
    at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
    at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
    at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
    at io.netty.channel.kqueue.KQueueRecvByteAllocatorHandle.allocate(KQueueRecvByteAllocatorHandle.java:63)
    at io.netty.channel.kqueue.AbstractKQueueStreamChannel$KQueueStreamUnsafe.readReady(AbstractKQueueStreamChannel.java:529)
    at io.netty.channel.kqueue.AbstractKQueueChannel$AbstractKQueueUnsafe.readEOF(AbstractKQueueChannel.java:485)
    at io.netty.channel.kqueue.KQueueEventLoop.processReady(KQueueEventLoop.java:213)
    at io.netty.channel.kqueue.KQueueEventLoop.run(KQueueEventLoop.java:289)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:834)
2022-03-30 20:09:09.813 [,] [pool-4-thread-1] INFO  c.y.c.r.DirectMemReporter - [doReport,51] - netty_direct_memory size:202816b,198k, max:204800

难道netty真的没有进行回收?

增加监控(双管齐下)

明确有问题,考虑先把监控加到线上,这样可以及时发现问题并做相应(重启?哈哈)处理。

堆外监控代码上报

主要是获取到Netty中 PlatformDependent 类的 DIRECT_MEMORY_COUNTER 字段,这个字段记录了当前Netty使用的内存信息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
public DirectMemReporter() {

    Field field = ReflectionUtils.findField(PlatformDependent.class, "DIRECT_MEMORY_COUNTER");
    assert field != null;
    field.setAccessible(true);
    try {
        directMem = ((AtomicLong) field.get(PlatformDependent.class));
    } catch (Exception e) {
        log.error("got PlatformDependent.class DIRECT_MEMORY_COUNTER error", e);
    }
}

private void doReport() {
    try {
        long size = directMem.get();
        log.info("{} size:{}b,{}k, max:{}", BUSINESS_KEY, size, (int) (size / _1k),
                 PlatformDependent.maxDirectMemory());
    } catch (Exception e) {
        log.error("doReport error", e);
    }
}

接入Premethues监控

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@Override
public void bindTo(final MeterRegistry registry) {

    Gauge.builder("netty.pooled.used.memory", this.metric, PooledByteBufAllocatorMetric::usedDirectMemory)
            .tags("type", "direct")
            .description("The used memory")
            .baseUnit(BYTES_UNIT).register(registry);

    Gauge.builder("netty.pooled.used.memory", this.metric, PooledByteBufAllocatorMetric::usedHeapMemory)
            .tags("type", "heap")
            .description("The used memory")
            .baseUnit(BYTES_UNIT).register(registry);
}

new NettyPooledAllocatorMetrics(PooledByteBufAllocator.DEFAULT.metric()).bindTo(prometheusRegistry);

先看看监控

看监控信息,涨的还挺有节奏,这个和业务量有关,如果量再大问题就出现的更频繁了。

尝试解决,却都未解决

升级springboot版本

第一想到的升级版本,没有升级解决不了的,说整就整,springboot版本从2.5.2升级到2.6.3,测试后问题依旧。

让Netty使用堆内存

1
-Dio.netty.noPreferDirect=true

线上配置上线一台机器后,内存还在涨,无回收迹象(不是说使用的弱引用吗,GC也干不掉?)。

Redis的问题?

项目使用的是 Lettuce ,正常应该使用全异步方式进行代码编写,但是由于项目前期时间有限,且对异步编程认识有限,未敢尝试,现在心想着就都重新写成异步的方式吧。改造完成后,内存增加依旧,不过代码看起来更顺畅了。

其它无效尝试

尝试使用netty pooled分配

不行,不能使用 NettyCustomer ,还得看一下如何自定义,得深入Spring。

是Filter的问题?自定义Filter的方式不对?

参照Spring WebFlux Filters with Controllers and Functional Routers看了一下,Fiter处理的没问题。

netty启动报错

netty启动时报 direct buffer constructor: unavailable ,是因为JDK9之后分模块的问题,参照JDK从8升级到11时Netty报错增加模块权限即可,不影响Netty使用。

1
--add-opens java.base/jdk.internal.misc=ALL-UNNAMED -Dio.netty.tryReflectionSetAccessible=true

最终解决,代码问题

经过上述尝试后,开始梳理代码发现了问题所在。下面是一段操作web响应内容的 Fiter 类代码,主要功能是抽取返回值加上通用登录字段信息最终返回,dataBuffer在被抽取后没有被释放,导致内存泄漏。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@Override
public Mono<Void> writeWith(Publisher<? extends DataBuffer> body) {

    final Flux<? extends DataBuffer> buffer = Flux.from(body);
    return super.writeWith(buffer
            .flatMap(dataBuffer -> {
                String response = extractResponse(dataBuffer);
                String result = response;
                if (response.startsWith("{")) {
                    int isLogin = UserUtil.isLogin(appRequest.getUin()) ? 1 : 0;
                    String loginPrefix = "{\"isLogin\":" + isLogin + ",";
                    result = loginPrefix + response.substring(1);
                    HttpHeaders headers = getDelegate().getHeaders();
                    headers.setContentLength(headers.getContentLength() + loginPrefix.length() - 1);
                }
                setUserCookie();
                return Mono.just(exchange.getResponse().bufferFactory().wrap(result.getBytes(StandardCharsets.UTF_8)));
            })
            .doOnNext(dataBuffer -> {
                try {
                    final RLog rlog = buildRlog();
                    rlog.setResp(extractResponse(dataBuffer));
                    logService.accessLog(rlog);

                } catch (Exception e) {
                    log.error("record access log error", e);
                }
            }));
}

进行修正

最终定位代码问题,长叹一声,改!增加 DataBufferUtils.release(dataBuffer) 方法对内存进行释放。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@Override
public Mono<Void> writeWith(Publisher<? extends DataBuffer> body) {

    final Flux<? extends DataBuffer> buffer = Flux.from(body);
    return super.writeWith(buffer
            .flatMap(dataBuffer -> {
                String response = extractResponse(dataBuffer);
                DataBufferUtils.release(dataBuffer);  //增加的内存释放代码
                String result = response;
                if (response.startsWith("{")) {
                    int isLogin = UserUtil.isLogin(appRequest.getUin()) ? 1 : 0;
                    String loginPrefix = "{\"isLogin\":" + isLogin + ",";
                    result = loginPrefix + response.substring(1);
                    HttpHeaders headers = getDelegate().getHeaders();
                    headers.setContentLength(headers.getContentLength() + loginPrefix.length() - 1);
                }
                setUserCookie();
                return Mono.just(exchange.getResponse().bufferFactory().wrap(result.getBytes(StandardCharsets.UTF_8)));
            })
            .doOnNext(dataBuffer -> {
                try {
                    final RLog rlog = buildRlog();
                    rlog.setResp(extractResponse(dataBuffer));
                    logService.accessLog(rlog);

                } catch (Exception e) {
                    log.error("record access log error", e);
                }
            }));
}

问题修复一天后的监控

经过一天的观察,内存使用保持在1个G左右,问题解决,长出一口气。

排查问题中产生的疑问

在排查过程中有一些疑问未解,在此记录一下:

  1. Netty配置使用堆内存,有内存泄漏,在JVM Full GC时会不会回收掉?
  2. 默认lettuce使用是异步操作,为什么使用同步的redisTemplate也可以呢?

总结

此次排查耗时一周多,费心费力,中间还是走了一些弯路的,还好学习到很多,也深深意识到自己基础知识的不足,主要还是对netty、reactor和springwebflux认识有限,后续要深入学习一下。

参考

multithreading - OutOfDirectMemoryError using Spring-Data-Redis with Lettuce in a multi-threading context - Stack Overflow

命中 SpringCloudGateway 组件 BUG | 彩虹马的博客

Netty堆外内存泄露排查盛宴 - 美团技术团队

netty堆外内存监控_colie_li的博客-CSDN博客_netty内存监控

JDK从8升级到11时Netty报错 - 身处寒夜,把握星光。

Spring WebFlux Filters with Controllers and Functional Routers - amitph