[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Infinite loop with concurrent ssh_connect() on CentOS 6


I'm developing a multi-host ssh tool using libssh 0.6.3.  The tool
establishes connections asynchronously and in parallel.  Intermittently,
the tool will get stuck in a busy loop with a stack trace such as the
following:

Thread 13 (Thread 0x7f65f9b41700 (LWP 13549)):
#0  0x0000003ee3adc613 in poll () from /lib64/libc.so.6
#1  0x0000003ee3b0fe3c in clntudp_call () from /lib64/libc.so.6
#2  0x0000003ee76058bb in do_ypcall () from /lib64/libnsl.so.1
#3  0x0000003ee76060ab in yp_match () from /lib64/libnsl.so.1
#4  0x00007f65f1f14f79 in _nss_nis_getpwuid_r () from /lib64/libnss_nis.so.2
#5  0x0000003ee3aaa4ed in getpwuid_r@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6  0x00007f6601b0152e in ssh_path_expand_tilde () from
/opt/hypertable/doug/0.9.8.2/lib/libssh.so.4
#7  0x00007f6601b02bc3 in ssh_options_set () from
/opt/hypertable/doug/0.9.8.2/lib/libssh.so.4
#8  0x00007f6601b036fb in ssh_options_apply () from
/opt/hypertable/doug/0.9.8.2/lib/libssh.so.4
#9  0x00007f6601af68fe in ssh_connect () from
/opt/hypertable/doug/0.9.8.2/lib/libssh.so.4
#10 0x000000000043547a in Hypertable::SshSocketHandler::handle(int, int) ()
#11 0x0000000000484b9a in
Hypertable::IOHandlerRaw::handle_event(epoll_event*, long) ()
#12 0x0000000000492ac4 in Hypertable::ReactorRunner::operator()() ()
#13 0x00007f66010f8ce3 in thread_proxy () from
/opt/hypertable/doug/0.9.8.2/lib/libboost_thread.so.1.54.0
#14 0x0000003ee42077f1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003ee3ae5ccd in clone () from /lib64/libc.so.6

I did a little digging around and came across a ticket filed against sssd
<https://fedorahosted.org/sssd/ticket/640> which I believe is the source of
the problem.  It appears that getpwuid_r() is not thread safe under certain
circumstances.

From what I can tell, ssh_connect() will use ~/.ssh as the ssh directory if
one is not explicitly supplied.  It's during the expansion of the ~
character that getpwuid_r() gets called.  The workaround is to explicitly
set the ssh directory using a path that does not include the ~ character,
for example:

char *home = getenv("HOME");
if (home == nullptr)
  error("Environment variable HOME is not set");
string ssh_dir(home);
ssh_dir.append("/.ssh");
ssh_options_set(m_ssh_session, SSH_OPTIONS_SSH_DIR, ssh_dir.c_str());
Attached is a patch to libssh that eliminates the ~ expansion for the
default case (~/.ssh).  In my test environment, the problem is very
intermittent and I don't have a reproducible test case, so I'm not 100%
sure this solution solves the problem.  However, given the evidence, I
think it's a safe bet.

- Doug

-- 
Doug Judd
www.hypertable.com

Attachment: ssh_dir_tilde_expansion_problem.patch
Description: Binary data


Follow-Ups:
Re: Infinite loop with concurrent ssh_connect() on CentOS 6Doug Judd <doug@xxxxxxxxxxxxxx>
Archive administrator: postmaster@lists.cynapses.org