RaftKeeper is a high-performance distributed consensus service that we extensively use online for ClickHouse metadata services. This article documents a process of using LLDB to analyze a Core Dump file from RaftKeeper process.
Due to the limited resources available online about analyzing Core Dump files. I hope this can provide some assistance to others.
1. Problem Description
One of the nodes in a three-node RaftKeeper cluster experienced a segmentation fault and generated a Core Dump file.
2. Analysis
First, open the Core Dump file and examine the stack trace of the thread that caused the core dump.
lldb -c core-ReqFwdRecv#43-1384962-1009-1720544680 tmp/raftkeeper
(lldb) bt
* thread #1, name = 'raftkeeper', stop reason = signal SIGSEGV
* frame #0: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::shared_ptr(this=<unavailable>, __r=<unavailable>) at memory:3097:18
frame #1: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(unsigned long) [inlined] std::__1::shared_ptr<RK::ForwardConnection>::operator=(this=0x00007fb53f3a9fd0, __r=<unavailable>) at memory:3220:5
frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
frame #3: 0x000000000063cc3e raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::__function::__policy_func<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2221:16
frame #4: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) [inlined] std::__1::function<void ()>::operator()(this=0x00007fb53f3aa100) const at functional:2560:12
frame #5: 0x000000000063cc35 raftkeeper`ThreadPoolImpl<ThreadFromGlobalPool>::worker(this=0x00007fb58c853998, thread_it=std::__1::list<ThreadFromGlobalPool, std::__1::allocator<ThreadFromGlobalPool> >::iterator @ 0x00007fb53f3aa0f8) at ThreadPool.cpp:265:17
Next, inspect the relevant code by selecting frame 2.
(lldb) frame select 2
frame #2: 0x00000000006fe12c raftkeeper`RK::RequestForwarder::runReceive(this=0x00007fb58d443408, runner_id=7) at RequestForwarder.cpp:167:32
164 ptr<ForwardConnection> connection;
165 {
166 std::lock_guard<std::mutex> lock(connections_mutex);
-> 167 connection = connections[leader][runner_id];
168 }
169
170 if (connection && connection->isConnected())
You can see that the issue occurs at line 167: connection = connections[leader][runner_id]
. Here, connections
is an outer std::unordered_map
and an inner std::vector
.
From the stack trace method signature, we know that runner_id
is 7. Next, let’s check the value of leader
.
(lldb) frame variable leader
(int32_t) leader = 3
Next, let’s examine the value of connections
. connections
is a field of class RequestForwarder, so we can use the memory address of `this` 0x00007fb58d443408.
(lldb) expr ((RK::RequestForwarder*)0x00007fb58d443408)->connections
(std::unordered_map<unsigned int, std::vector<std::shared_ptr<RK::ForwardConnection> > >) $0 = size=3 {
[0] = {
first = 3
second = size=0 {}
}
[1] = {
first = 1
second = size=48 {
[0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705798 strong=1 weak=1 {
__ptr_ = 0x00007fb58c705798
}
[1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c705958 strong=1 weak=1 {
__ptr_ = 0x00007fb58c705958
}
...
[2] = {
first = 2
second = size=48 {
[0] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700398 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700398
}
[1] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700558 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700558
}
[2] = std::__1::shared_ptr<RK::ForwardConnection>::element_type @ 0x00007fb58c700718 strong=1 weak=1 {
__ptr_ = 0x00007fb58c700718
}
...
We can observe that connections
contains 3 entries, but the vector
for the key 3
is empty. This is the direct cause of the issue.
According to RaftKeeper’s logic, each node should establish connections with the other two nodes. Since this is a 3-node cluster, connections
should only contain 2 entries.
Next, let’s check the ID of the node where the core dump occurred.
/// Please notice that this->server is a std::shared_pointer
(lldb) expr this->server
(std::shared_ptr<RK::KeeperServer>) $9 = std::__1::shared_ptr<RK::KeeperServer>::element_type @ 0x00007fb58c7001d8 strong=4 weak=1 {
__ptr_ = 0x00007fb58c7001d8
}
(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)
(RK::KeeperServer *) $10 = 0x00007fb58c7001d8
(lldb) expr ((RK::KeeperServer*)0x00007fb58c7001d8)->my_id
(int32_t) $11 = 3
According to RaftKeeper’s logic, a node does not need to establish a connection with itself, so there is a logical error here.
Carefully review the code logic.
if (!server->isLeader() && server->isLeaderAlive())
{
int32_t leader = server->getLeader();
ptr<ForwardConnection> connection;
{
std::lock_guard<std::mutex> lock(connections_mutex);
connection = connections[leader][runner_id];
}
It can be observed that server->isLeader()
returns false, but when executing int32_t leader = server->getLeader()
, the current node becomes the leader, which then triggers a memory error.
The solution is straightforward: perform a leader value check when obtaining the connection.
3. In Summary
This example demonstrates how to use lldb to examine the stack trace causing the issue, how to select a frame, how to view the local variables of a method, how to inspect the values of a class’s fields, and how to view the contents of a shared_pointer object.
For more details, refer to: RaftKeeper#334.
Copyright Notice: This article is the result of the author’s hard work. Please credit the author and source when redistributing.
This work is licensed under a Creative Commons Attribution 4.0 International License. When redistributing, please include the original link.