[Users] memory leak in presence module?

Klaus Darilion klaus.mailinglists at pernau.at
Tue May 8 09:40:28 CEST 2007


Hi again!

This time I've started openser with only 1 child and attached with gdb 
to the UDP thread in hope that gdb will show more info about the signal 
9. But without luck:

# gdb /usr/sbin/openser 2855
...
Reading symbols from /lib/tls/i686/cmov/libnsl.so.1...done.
Loaded symbols for /lib/tls/i686/cmov/libnsl.so.1
Reading symbols from /lib/tls/i686/cmov/libnss_nis.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libnss_nis.so.2
Reading symbols from /lib/tls/i686/cmov/libnss_files.so.2...done.
Loaded symbols for /lib/tls/i686/cmov/libnss_files.so.2
Failed to read a valid object file image from memory.
0xb7f73410 in ?? ()
(gdb) c
Continuing.

<-- here the thread terminates by signal 9

Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)
Continuing.
Couldn't get registers: No such process.
(gdb)

The logfile also shows no hints:

May  8 09:04:24 debian /usr/sbin/openser[2855]: PRESENCE: 
get_subs_dialog:The query for subscribtion for [user]= klaus,[domain]= 
pernau.at for [event]= presen
ce returned no result
May  8 09:04:24 debian /usr/sbin/openser[2855]: 
PRESENCE:query_db_notify: Could not get subs_dialog from database
May  8 09:04:24 debian /usr/sbin/openser[2855]: 
PRESENCE:update_presentity: Could not send Notify
May  8 09:04:24 debian /usr/sbin/openser[2855]: 
e17197948e006b8865b78e750537073b.8e1b///2-2877 at 88.198.53.113 PUBLISH 
detected, handle_publish ... done
May  8 09:04:24 debian /usr/sbin/openser[2854]: child process 2855 
exited by a signal 9
May  8 09:04:24 debian /usr/sbin/openser[2854]: core was not generated


I want to track down the signal 9. Who sent the signal 9 to the UDP 
thread? I've searched for tools to monitor signals globally but didn't 
found tools. strace only shows which signals are sent/received by the 
traced process - but now who sent the signal.

I have run out of ideas to debug this - thus, please send me your ideas.

regards
klaus




Klaus Darilion wrote:
> Hi Bogdan!
> 
> I've attached with strace to all openser threads and waited for the 
> crash. Here is the strace log of the "attendant" process (ID=0):
> 
> Process 2340 attached - interrupt to quit
> pause()                                 = ? ERESTARTNOHAND (To be 
> restarted)
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> waitpid(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], WNOHANG) = 2344
> time([1178548261])                      = 1178548261
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> send(3, "<134>May  7 16:31:01 /usr/sbin/o"..., 86, MSG_NOSIGNAL) = 86
> time([1178548262])                      = 1178548262
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> send(3, "<134>May  7 16:31:02 /usr/sbin/o"..., 69, MSG_NOSIGNAL) = 69
> waitpid(-1, 0xbfd58ecc, WNOHANG)        = 0
> time([1178548262])                      = 1178548262
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=801, ...}) = 0
> send(3, "<134>May  7 16:31:02 /usr/sbin/o"..., 79, MSG_NOSIGNAL) = 79
> kill(0, SIGTERM)                        = 0
> --- SIGTERM (Terminated) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> rt_sigaction(SIGALRM, {0x8067830, [ALRM], SA_RESTART}, {SIG_DFL}, 8) = 0
> alarm(60)                               = 0
> wait4(-1, NULL, 0, NULL)                = 2350
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2345
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2349
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2341
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2347
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2346
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2348
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = 2342
> --- SIGCHLD (Child exited) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> wait4(-1, NULL, 0, NULL)                = ? ERESTARTSYS (To be restarted)
> --- SIGALRM (Alarm clock) @ 0 (0) ---
> kill(0, SIGKILL)                        = 0
> +++ killed by SIGKILL +++
> Process 2340 detached
> 
> 
> If I read it correct, the SIGKILL is sent by this process, after sending 
> SIGTERM to all its childs. The SIGTERM is sent, because a child exited. 
> But which child? And why?
> 
> The openser log says:
> May  7 16:31:02 debian /usr/sbin/openser[2340]: child process 2344 
> exited by a signal 9
> May  7 16:31:02 debian /usr/sbin/openser[2340]: core was not generated
> May  7 16:31:02 debian /usr/sbin/openser[2340]: INFO: terminating due to 
> SIGCHLD
> 
> To me this looks like 2344 (a UDP thread) exited with signal 9. Thus, 
> the main thread receives SIGCHLD and then sends SIGTERM and afterwards 
> SIGKILL to all other threads and itself.
> 
> But why received the thread 2344 a SIGKILL and who sent the SIGKILL?
> 
> I need some more debugging tips.
> Bogdan, you mentioned gdb - how can I debug this with gdb?
> 
> regards
> klaus
> 
> Bogdan-Andrei Iancu wrote:
>> Hi Klaus,
>>
>> I applied on SVN the fix for the TM memory leak - it should not happen 
>> anymore now, even if you do not use t_release()...
>>
>> regarding the openser stop reacting - can you attach with gdb to see 
>> what the process are done?
>>
>> regards,
>> bogdan
>>
>> Klaus Darilion wrote:
>>> Hi Daniel!
>>>
>>> Summary:
>>> - Without t_release() (no modifications to source code) openser leaks 
>>> memory.
>>> - with t_release() openser does not leak. But after some time there 
>>> is strange behaviour, e.g.:
>>>  -: openser stops reacting for some minutes and afterwards gets
>>>     terminated with signal 9. When openser stops working the load
>>>     increase to > 40. This happend 3 times now.
>>>  -: openser stops reacting for some minutes and the linux PC
>>>     where openser is running gets unresponsive. No login. Open
>>>     SSH sessions are unresponsive. I had to reboot the PC. Happend
>>>     1 time.
>>>
>>> Maybe this is not pure openser related, but a problem with openser 
>>> and Linux (as I had to reboot the server one time).
>>>
>>> Any hints how to debug this?
>>>
>>> regards
>>> klaus
>>>
>>
> 
> _______________________________________________
> Users mailing list
> Users at openser.org
> http://openser.org/cgi-bin/mailman/listinfo/users




More information about the Users mailing list