使用LLDB分析Core Dump文件

RaftKeeper是一个高性能的分布式共识服务，我们在线上大规模用于ClickHouse元数据服务。这里记录一次使用LLDB分析RaftKeeper项目Core Dump文件的过程.

因为网上关于分析Core Dump文件的资料很少，希望可以对大家提供一点帮助。

一、问题现象

三个节点组成的RaftKeeper集群其中一个节点发生Segment Fault，并生成了Core Dump文件。

二、问题分析

首先打开Core Dump文件，并查看引发Core的线程的堆栈

lldb -c core-ReqFwdRecv#43-1384962-1009-1720544680 tmp/raftkeeper

(lldb) bt

* thread #1, name = 'raftkeeper', stop reason = signal SIGSEGV
  * frame #0: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::shared_ptr(this=<unavailable>, __r=<unavailable>) at memory:3097:18
    frame #1: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::operator=(this=0x00007fb53f3a9fd0, __r=<unavailable>) at memory:3220:5
    frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
    frame #3: 0x000000000063cc3e raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::__function::__policy_func<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2221:16
    frame #4: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::function<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2560:12
    frame #5: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(this=0x00007fb58c853998, thread_it=std::__1::list<ThreadFromGlobalPool, std::__1::allocator<ThreadFromGlobalPool> >::iterator @ 0x00007fb53f3aa0f8) at ThreadPool.cpp:265:17

然后查看关键代码，选择frame 2

(lldb) frame select 2

frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
   164 	                ptr<ForwardConnection> connection;
   165 	                {
   166 	                    std::lock_guard<std::mutex> lock(connections_mutex);
-> 167 	                    connection = connections[leader][runner_id];
   168 	                }
   169 	
   170 	                if (connection && connection->isConnected())

可以看到问题发生在167行connection = connections[leader][runner_id]，其中connections外层是一个std::unorder_map，内层是一个std::vector。

通过堆栈的方法签名可以知道runner_id=7，接下来查看leader的值

(lldb) frame variable leader

(int32_t) leader = 3

进一步查看下connections的值，connections是`RequestForwarder`的一个字段，所以这里我们可以使用内存地址0x00007fb58d443408

(lldb) expr ((RK::RequestForwarder*)0x00007fb58d443408)->connections

(std::unordered_map<unsigned int, std::vector<std::shared_ptr<RK::ForwardConnection> > >) $0 = size=3 {
  [0] = {
    first = 3
    second = size=0 {}
  }
  [1] = {
    first = 1
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705798 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705798
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705958 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705958
      }
      ...
  [2] = {
    first = 2
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700398 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700398
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700558 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700558
      }
      [2] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700718 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700718
      }
      ...

可以发现connections中有3条数据，但是key为3的vector是空的，这就是问题发生的直接原因。

根据RaftKeeper的逻辑每个节点会跟其它两个节点建立链接，因为这是一个3个节点组成的集群，所以connections中应该只有2个节点。

接下来查看发生core的节点的id是多少

/// 请注意this->server是一个std::shared_pointer
(lldb) expr this->server
(std::shared_ptr<RK::KeeperServer>) $9 = std::__1::shared_ptr<RK::KeeperServer>::element_type @ 0x00007fb58c7001d8 strong=4 weak=1 {
  __ptr_ = 0x00007fb58c7001d8
}

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)
(RK::KeeperServer *) $10 = 0x00007fb58c7001d8

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)->my_id
(int32_t) $11 = 3

根据RaftKeeper的逻辑节点不需要跟自己建立链接，所以这里有逻辑错误。

仔细查看代码逻辑

 if (!server->isLeader() && server->isLeaderAlive())
            {
                int32_t leader = server->getLeader();
                ptr<ForwardConnection> connection;
                {
                    std::lock_guard<std::mutex> lock(connections_mutex);
                    connection = connections[leader][runner_id];
                }

可以发现server->isLeader()发回值是false，但是在执行int32_t leader = server->getLeader()的时候，当前节点变成了leader，进而引发了内存错误。

解决方案很直接，在获取connection的时候做下leader值的校验。

三、总结

通过这个例子展示了使用lldb如何查看引起问的堆栈，如何选择frame，如何查看方法的本地变量值，如何查看类的字段的值，如何查看shared_pointer对象的内容。

详细可以参考：RaftKeeper#334。

本作品采用知识共享署名 4.0 国际许可协议进行许可，转载时请注明原文链接。

近期文章

归档

分类

一、问题现象

二、问题分析

三、总结

发表回复 取消回复

发表回复取消回复