Analyzing Core Dump Files with LLDB

RaftKeeper is a high-performance distributed consensus service that we extensively use online for ClickHouse metadata services. This article documents a process of using LLDB to analyze a Core Dump file from RaftKeeper process.

Due to the limited resources available online about analyzing Core Dump files. I hope this can provide some assistance to others.

1. Problem Description

One of the nodes in a three-node RaftKeeper cluster experienced a segmentation fault and generated a Core Dump file.

2. Analysis

First, open the Core Dump file and examine the stack trace of the thread that caused the core dump.

lldb -c core-ReqFwdRecv#43-1384962-1009-1720544680 tmp/raftkeeper

(lldb) bt

* thread #1, name = 'raftkeeper', stop reason = signal SIGSEGV
  * frame #0: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::shared_ptr(this=<unavailable>, __r=<unavailable>) at memory:3097:18
    frame #1: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::operator=(this=0x00007fb53f3a9fd0, __r=<unavailable>) at memory:3220:5
    frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
    frame #3: 0x000000000063cc3e raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::__function::__policy_func<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2221:16
    frame #4: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::function<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2560:12
    frame #5: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(this=0x00007fb58c853998, thread_it=std::__1::list<ThreadFromGlobalPool, std::__1::allocator<ThreadFromGlobalPool> >::iterator @ 0x00007fb53f3aa0f8) at ThreadPool.cpp:265:17

Next, inspect the relevant code by selecting frame 2.

(lldb) frame select 2

frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
   164 	                ptr<ForwardConnection> connection;
   165 	                {
   166 	                    std::lock_guard<std::mutex> lock(connections_mutex);
-> 167 	                    connection = connections[leader][runner_id];
   168 	                }
   169 	
   170 	                if (connection && connection->isConnected())

You can see that the issue occurs at line 167: connection = connections[leader][runner_id]. Here, connections is an outer std::unordered_map and an inner std::vector.

From the stack trace method signature, we know that runner_id is 7. Next, let’s check the value of leader.

(lldb) frame variable leader

(int32_t) leader = 3

Next, let’s examine the value of connections. connections is a field of class RequestForwarder, so we can use the memory address of `this` 0x00007fb58d443408.

(lldb) expr ((RK::RequestForwarder*)0x00007fb58d443408)->connections

(std::unordered_map<unsigned int, std::vector<std::shared_ptr<RK::ForwardConnection> > >) $0 = size=3 {
  [0] = {
    first = 3
    second = size=0 {}
  }
  [1] = {
    first = 1
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705798 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705798
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705958 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c705958
      }
      ...
  [2] = {
    first = 2
    second = size=48 {
      [0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700398 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700398
      }
      [1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700558 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700558
      }
      [2] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700718 strong=1 weak=1 {
        __ptr_ = 0x00007fb58c700718
      }
      ...

We can observe that connections contains 3 entries, but the vector for the key 3 is empty. This is the direct cause of the issue.

According to RaftKeeper’s logic, each node should establish connections with the other two nodes. Since this is a 3-node cluster, connections should only contain 2 entries.

Next, let’s check the ID of the node where the core dump occurred.

/// Please notice that this->server is a std::shared_pointer
(lldb) expr this->server
(std::shared_ptr<RK::KeeperServer>) $9 = std::__1::shared_ptr<RK::KeeperServer>::element_type @ 0x00007fb58c7001d8 strong=4 weak=1 {
  __ptr_ = 0x00007fb58c7001d8
}

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)
(RK::KeeperServer *) $10 = 0x00007fb58c7001d8

(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)->my_id
(int32_t) $11 = 3

According to RaftKeeper’s logic, a node does not need to establish a connection with itself, so there is a logical error here.

Carefully review the code logic.

 if (!server->isLeader() && server->isLeaderAlive())
            {
                int32_t leader = server->getLeader();
                ptr<ForwardConnection> connection;
                {
                    std::lock_guard<std::mutex> lock(connections_mutex);
                    connection = connections[leader][runner_id];
                }

It can be observed that server->isLeader() returns false, but when executing int32_t leader = server->getLeader(), the current node becomes the leader, which then triggers a memory error.

The solution is straightforward: perform a leader value check when obtaining the connection.

3. In Summary

This example demonstrates how to use lldb to examine the stack trace causing the issue, how to select a frame, how to view the local variables of a method, how to inspect the values of a class’s fields, and how to view the contents of a shared_pointer object.

For more details, refer to: RaftKeeper#334.

Copyright Notice: This article is the result of the author’s hard work. Please credit the author and source when redistributing.

This work is licensed under a Creative Commons Attribution 4.0 International License. When redistributing, please include the original link.

近期文章

归档

分类

1. Problem Description

2. Analysis

3. In Summary

发表回复 取消回复

发表回复取消回复