RaftKeeper是一个高性能的分布式共识服务,我们在线上大规模用于ClickHouse元数据服务。这里记录一次使用LLDB分析RaftKeeper项目Core Dump文件的过程.
因为网上关于分析Core Dump文件的资料很少,希望可以对大家提供一点帮助。
一、问题现象
三个节点组成的RaftKeeper集群其中一个节点发生Segment Fault,并生成了Core Dump文件。
二、问题分析
首先打开Core Dump文件,并查看引发Core的线程的堆栈
lldb -c core-ReqFwdRecv#43-1384962-1009-1720544680 tmp/raftkeeper
(lldb) bt
* thread #1, name = 'raftkeeper', stop reason = signal SIGSEGV
* frame #0: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::shared_ptr(this=<unavailable>, __r=<unavailable>) at memory:3097:18
frame #1: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::operator=(this=0x00007fb53f3a9fd0, __r=<unavailable>) at memory:3220:5
frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
frame #3: 0x000000000063cc3e raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::__function::__policy_func<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2221:16
frame #4: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::function<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2560:12
frame #5: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(this=0x00007fb58c853998, thread_it=std::__1::list<ThreadFromGlobalPool, std::__1::allocator<ThreadFromGlobalPool> >::iterator @ 0x00007fb53f3aa0f8) at ThreadPool.cpp:265:17
然后查看关键代码,选择frame 2
(lldb) frame select 2
frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
164 ptr<ForwardConnection> connection;
165 {
166 std::lock_guard<std::mutex> lock(connections_mutex);
-> 167 connection = connections[leader][runner_id];
168 }
169
170 if (connection && connection->isConnected())
可以看到问题发生在167行connection = connections[leader][runner_id],其中connections外层是一个std::unorder_map,内层是一个std::vector。
通过堆栈的方法签名可以知道runner_id=7,接下来查看leader的值
(lldb) frame variable leader
(int32_t) leader = 3
进一步查看下connections的值,connections是`RequestForwarder`的一个字段,所以这里我们可以使用内存地址0x00007fb58d443408
(lldb) expr ((RK::RequestForwarder*)0x00007fb58d443408)->connections
(std::unordered_map<unsigned int, std::vector<std::shared_ptr<RK::ForwardConnection> > >) $0 = size=3 {
[0] = {
first = 3
second = size=0 {}
}
[1] = {
first = 1
second = size=48 {
[0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705798 strong=1 weak=1 {
__ptr_ = 0x00007fb58c705798
}
[1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705958 strong=1 weak=1 {
__ptr_ = 0x00007fb58c705958
}
...
[2] = {
first = 2
second = size=48 {
[0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700398 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700398
}
[1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700558 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700558
}
[2] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700718 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700718
}
...
可以发现connections中有3条数据,但是key为3的vector是空的,这就是问题发生的直接原因。
根据RaftKeeper的逻辑每个节点会跟其它两个节点建立链接,因为这是一个3个节点组成的集群,所以connections中应该只有2个节点。
接下来查看发生core的节点的id是多少
/// 请注意this->server是一个std::shared_pointer
(lldb) expr this->server
(std::shared_ptr<RK::KeeperServer>) $9 = std::__1::shared_ptr<RK::KeeperServer>::element_type @ 0x00007fb58c7001d8 strong=4 weak=1 {
__ptr_ = 0x00007fb58c7001d8
}
(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)
(RK::KeeperServer *) $10 = 0x00007fb58c7001d8
(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)->my_id
(int32_t) $11 = 3
根据RaftKeeper的逻辑节点不需要跟自己建立链接,所以这里有逻辑错误。
仔细查看代码逻辑
if (!server->isLeader() && server->isLeaderAlive())
{
int32_t leader = server->getLeader();
ptr<ForwardConnection> connection;
{
std::lock_guard<std::mutex> lock(connections_mutex);
connection = connections[leader][runner_id];
}
可以发现server->isLeader()发回值是false,但是在执行int32_t leader = server->getLeader()的时候,当前节点变成了leader,进而引发了内存错误。
解决方案很直接,在获取connection的时候做下leader值的校验。
三、总结
通过这个例子展示了使用lldb如何查看引起问的堆栈,如何选择frame,如何查看方法的本地变量值,如何查看类的字段的值,如何查看shared_pointer对象的内容。
详细可以参考:RaftKeeper#334。
本作品采用 知识共享署名 4.0 国际许可协议 进行许可, 转载时请注明原文链接。