排查记录: Java所有接口卡死
2024-11-07 14:01:56 4
线上接口, 每天凌晨三四点后开始卡死, 重启后恢复
调用后等待两分钟, 然后504
两分钟正好是nginx反代的超时时间
排查过程
日志
把nginx和应用日志都找出来看了下, 没发现什么问题
应用机器状态
- 看了下cpu占用率/内存, 没问题
- 磁盘占用也没问题
应用状态
- 通过
jps
查找到java进程 - 然后
jstack 进程id > result.txt
- 发现
http-nio
线程绝大部分阻塞在java.io.BufferedOutputStream.write(BufferedOutputStream.java:94)
, 都是在等待锁 - 查看代码, 发现controller有aop, 接口处理完后, 会发送相关记录到mq
- 再用
arthas
的thread -b
命令找出阻塞其他线程的线程
"http-nio-8080-exec-10" Id=142 BLOCKED on java.io.BufferedOutputStream@4b194cd3 owned by "http-nio-8080-exec-9" Id=141
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:94)
- blocked on java.io.BufferedOutputStream@4b194cd3
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at com.rabbitmq.client.impl.Frame.writeTo(Frame.java:185)
at com.rabbitmq.client.impl.SocketFrameHandler.writeFrame(SocketFrameHandler.java:171)
- locked java.io.DataOutputStream@19e9fe54 <---- but blocks 199 other threads!
at com.rabbitmq.client.impl.AMQConnection.writeFrame(AMQConnection.java:542)
at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:104)
- locked com.rabbitmq.client.impl.CommandAssembler@35ace671
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:337)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:319)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.quiescingRpc(AMQChannel.java:254)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.rpc(AMQChannel.java:245)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:222)
at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:117)
at com.rabbitmq.client.impl.ChannelN.open(ChannelN.java:132)
at com.rabbitmq.client.impl.ChannelManager.createChannel(ChannelManager.java:176)
at com.rabbitmq.client.impl.AMQConnection.createChannel(AMQConnection.java:533)
at org.springframework.amqp.rabbit.connection.SimpleConnection.createChannel(SimpleConnection.java:55)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$ChannelCachingConnectionProxy.createBareChannel(CachingConnectionFactory.java:1154)
...
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:615)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:818)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1623)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@52e0d8d8
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@cb644e
http-nio-8080-exec-9
这个虽然拿到了锁, 但貌似一直写不进数据, 所以导致阻塞- 然后查看
rabbitmq
的状态. 登录上对应mq服务的控制台, 发现节点磁盘已经满了, 控制台也有这么个警告The disk free space alarm for this node has gone off. It will block incoming network traffic untilthe amount of free space exceeds the limit.
大概意思就是此节点的磁盘不够了, 后续所有传入的数据都会阻塞,直到磁盘有足够空间 - 然后清理磁盘, 服务恢复
tips
rabbitmq发送消息是要拿到锁才能写数据, 因为channel共享了同一个tcp连接
com.rabbitmq.client.impl.SocketFrameHandler.writeFrame
@Override
public void writeFrame(Frame frame) throws IOException {
synchronized (_outputStream) {
frame.writeTo(_outputStream);
}
}