分类
Articles

使用LLDB分析Core Dump文件

RaftKeeper是一个高性能的分布式共识服务,我们在线上大规模用于ClickHouse元数据服务。这里记录一次使用LLDB分析RaftKeeper项目Core Dump文件的过程.

因为网上关于分析Core Dump文件的资料很少,希望可以对大家提供一点帮助。

一、问题现象

三个节点组成的RaftKeeper集群其中一个节点发生Segment Fault,并生成了Core Dump文件。

二、问题分析

首先打开Core Dump文件,并查看引发Core的线程的堆栈

lldb -c core-ReqFwdRecv#43-1384962-1009-1720544680 tmp/raftkeeper

(lldb) bt

* thread #1, name = 'raftkeeper', stop reason = signal SIGSEGV
  * frame #0: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::shared_ptr(this=<unavailable>, __r=<unavailable>) at memory:3097:18
    frame #1: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::operator=(this=0x00007fb53f3a9fd0, __r=<unavailable>) at memory:3220:5
    frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
    frame #3: 0x000000000063cc3e raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::__function::__policy_func<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2221:16
    frame #4: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::function<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2560:12
    frame #5: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(this=0x00007fb58c853998, thread_it=std::__1::list<ThreadFromGlobalPool, std::__1::allocator<ThreadFromGlobalPool> >::iterator @ 0x00007fb53f3aa0f8) at ThreadPool.cpp:265:17

然后查看关键代码,选择frame 2

(lldb) frame select 2

frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
   164 	                ptr<ForwardConnection> connection;
   165 	                {
   166 	                    std::lock_guard<std::mutex> lock(connections_mutex);
-> 167 	                    connection = connections[leader][runner_id];
   168 	                }
   169 	
   170 	                if (connection && connection->isConnected())

可以看到问题发生在167行connection = connections[leader][runner_id],其中connections外层是一个std::unorder_map,内层是一个std::vector。

通过堆栈的方法签名可以知道runner_id=7,接下来查看leader的值

(lldb) frame variable leader

(int32_t) leader = 3

进一步查看下connections的值,connections是`RequestForwarder`的一个字段,所以这里我们可以使用内存地址0x00007fb58d443408

(lldb) expr ((RK::RequestForwarder*)0x00007fb58d443408)->connections

(std::unordered_map<unsigned int, std::vector<std::shared_ptr<RK::ForwardConnection> > >) $0 = size=3 {
  [0] = {
    first = 3
    second = size=0 {}
  }
  [1] = {
    first = 1
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705798 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705798
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705958 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705958
      }
      ...
  [2] = {
    first = 2
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700398 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700398
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700558 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700558
      }
      [2] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700718 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700718
      }
      ...

可以发现connections中有3条数据,但是key为3的vector是空的,这就是问题发生的直接原因。

根据RaftKeeper的逻辑每个节点会跟其它两个节点建立链接,因为这是一个3个节点组成的集群,所以connections中应该只有2个节点。

接下来查看发生core的节点的id是多少

/// 请注意this->server是一个std::shared_pointer
(lldb) expr this->server
(std::shared_ptr<RK::KeeperServer>) $9 = std::__1::shared_ptr<RK::KeeperServer>::element_type @ 0x00007fb58c7001d8 strong=4 weak=1 {
  __ptr_ = 0x00007fb58c7001d8
}

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)
(RK::KeeperServer *) $10 = 0x00007fb58c7001d8

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)->my_id
(int32_t) $11 = 3

根据RaftKeeper的逻辑节点不需要跟自己建立链接,所以这里有逻辑错误。

仔细查看代码逻辑

 if (!server->isLeader() && server->isLeaderAlive())
            {
                int32_t leader = server->getLeader();
                ptr<ForwardConnection> connection;
                {
                    std::lock_guard<std::mutex> lock(connections_mutex);
                    connection = connections[leader][runner_id];
                }

可以发现server->isLeader()发回值是false,但是在执行int32_t leader = server->getLeader()的时候,当前节点变成了leader,进而引发了内存错误。

解决方案很直接,在获取connection的时候做下leader值的校验。

三、总结

通过这个例子展示了使用lldb如何查看引起问的堆栈,如何选择frame,如何查看方法的本地变量值,如何查看类的字段的值,如何查看shared_pointer对象的内容。

详细可以参考:RaftKeeper#334

本作品采用 知识共享署名 4.0 国际许可协议 进行许可, 转载时请注明原文链接。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注