排查记录: Java所有接口卡死
2024-11-07 14:01:56 11
线上接口, 每天凌晨三四点后开始卡死, 重启后恢复
调用后等待两分钟, 然后504
两分钟正好是nginx反代的超时时间
排查过程
日志
把nginx和应用日志都找出来看了下, 没发现什么问题
应用机器状态
- 看了下cpu占用率/内存, 没问题
- 磁盘占用也没问题
应用状态
- 通过
jps查找到java进程 - 然后
jstack 进程id > result.txt - 发现
http-nio线程绝大部分阻塞在java.io.BufferedOutputStream.write(BufferedOutputStream.java:94), 都是在等待锁 - 查看代码, 发现controller有aop, 接口处理完后, 会发送相关记录到mq
- 再用
arthas的thread -b命令找出阻塞其他线程的线程
"http-nio-8080-exec-10" Id=142 BLOCKED on java.io.BufferedOutputStream@4b194cd3 owned by "http-nio-8080-exec-9" Id=141
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:94)
- blocked on java.io.BufferedOutputStream@4b194cd3
at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
at com.rabbitmq.client.impl.Frame.writeTo(Frame.java:185)
at com.rabbitmq.client.impl.SocketFrameHandler.writeFrame(SocketFrameHandler.java:171)
- locked java.io.DataOutputStream@19e9fe54 <---- but blocks 199 other threads!
at com.rabbitmq.client.impl.AMQConnection.writeFrame(AMQConnection.java:542)
at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:104)
- locked com.rabbitmq.client.impl.CommandAssembler@35ace671
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:337)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:319)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.quiescingRpc(AMQChannel.java:254)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.rpc(AMQChannel.java:245)
- locked java.lang.Object@14917eea
at com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:222)
at com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:117)
at com.rabbitmq.client.impl.ChannelN.open(ChannelN.java:132)
at com.rabbitmq.client.impl.ChannelManager.createChannel(ChannelManager.java:176)
at com.rabbitmq.client.impl.AMQConnection.createChannel(AMQConnection.java:533)
at org.springframework.amqp.rabbit.connection.SimpleConnection.createChannel(SimpleConnection.java:55)
at org.springframework.amqp.rabbit.connection.CachingConnectionFactory$ChannelCachingConnectionProxy.createBareChannel(CachingConnectionFactory.java:1154)
...
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:615)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:818)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1623)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper@52e0d8d8
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@cb644ehttp-nio-8080-exec-9这个虽然拿到了锁, 但貌似一直写不进数据, 所以导致阻塞- 然后查看
rabbitmq的状态. 登录上对应mq服务的控制台, 发现节点磁盘已经满了, 控制台也有这么个警告The disk free space alarm for this node has gone off. It will block incoming network traffic untilthe amount of free space exceeds the limit.大概意思就是此节点的磁盘不够了, 后续所有传入的数据都会阻塞,直到磁盘有足够空间 - 然后清理磁盘, 服务恢复
tips
rabbitmq发送消息是要拿到锁才能写数据, 因为channel共享了同一个tcp连接
com.rabbitmq.client.impl.SocketFrameHandler.writeFrame
@Override
public void writeFrame(Frame frame) throws IOException {
synchronized (_outputStream) {
frame.writeTo(_outputStream);
}
}





