[OpenSIPS-Users] opensips tm timer core dump

Kennard_White at logitech.com Kennard_White at logitech.com
Wed Oct 27 07:21:54 CEST 2010


Hi Bogdan,

I think there is likely another race condition in the tm timer code. The
problem is when the same struct timer_link is used on two or more timer
lists. For example, uac->request.fr_timer is used on both FR_TIMER_LIST and
FR_INV_TIMER_LIST, and the call on t_reply.c:1532 moves it between timer
lists. The problem is that the FR_TIMER_LIST is not locked while the move
takes place. That is, set_timer() locks the list it is moving to, but not
the list it is moving from. The thus call to remove_timer_unsafe() is not
guarded by the appropriate lock.

I found this while searching for the earlier problem. I added check into
set_timer():
    if (new_tl->timer_list==DETACHED_LIST){
                   LM_CRIT("set_timer for %d list called on a \"detached\"
   "
                           "timer -- ignoring: %p\n", list_id, new_tl);
                   goto end;
           }
           /* make sure I'm not already on a list */
           if (new_tl->timer_list!=0 && new_tl->timer_list!=list) {
                   LM_CRIT("set_timer for %d list called while on other
   list %d",  list_id, new_tl->timer_list->id);
                   abort();
           }
           remove_timer_unsafe( new_tl );

           insert_timer_unsafe( list, new_tl, timeout +
                           ((timer_id2type[list_id]==UTIME_TYPE)?get_uticks
   ():get_ticks()));
   end:

And I got:

Oct 26 21:37:37 org-sip01 /var/run/openser/opensips-proxy[9708]:
CRITICAL:tm:set_timer: set_timer for 1 list called while on other list 0

Regards,
Kennard




From:	Bogdan-Andrei Iancu <bogdan at voice-system.ro>
To:	OpenSIPS users mailling list <users at lists.opensips.org>
Date:	10/25/2010 06:35 AM
Subject:	Re: [OpenSIPS-Users] opensips tm timer core dump
Sent by:	users-bounces at lists.opensips.org



Done!

This bug is affecting trunk only.

Regards,
Bogdan

Bogdan-Andrei Iancu wrote:
> Hi Kennard,
>
> You are perfectly right - there was a change  in moving the setT(cell)
> upper in the function (we needed it in the callbacks), and this bug
> looks like a side effect.
>
> I will make the fix asap !
>
> Thanks and regards,
> Bogdan
>
> Kennard_White at logitech.com wrote:
>
>> Hi Bogdan,
>>
>> I believe I found the problem. When sip_msg_cloner() within
>> build_cell() fails due to out-of-mem, and dangling pointer to the cell
>> is left in the global transaction pointer. Later on the post_cb() code
>> attempts to clean this up, and "resurrects" the now-free memory, and
>> in particular puts it on a wait timer. My guess is that later on this
>> memory is allocated into a new transaction which eventually uses the
>> same wait timer link, and the two threads fight it out.
>>
>> Adding a set_t(0) fixes the problem. While looking, I believe I found
>> a memory leak in uac creation code that is also triggered by out of
>> memory conditions. Please see attached patch.
>>
>> The "problem" also manifest itself as a core dump with
>> insert_timer_unsafe. Once I disabled optimization, I saw multiple
>> variations within this function. The most common was that ptr==tl
>> after the search. The sequence of events for this is: new cell
>> created, stored into global T, free'd, added to wait-list by cleanup
>> code from global T, memory re-allocated into new cell, timer link
>> zero'd, and then added again to wait list. At least that is my best
guess.
>>
>> Regards,
>> Kennard
>>
>> /(See attached file: opensips-tm-cell.patch)/
>>
>> Inactive hide details for Bogdan-Andrei Iancu ---10/22/2010 07:02:36
>> AM---Hi Kennard, I suppose the bt is the same ? do you
>> stiBogdan-Andrei Iancu ---10/22/2010 07:02:36 AM---Hi Kennard, I
>> suppose the bt is the same ? do you still have the core file ?
>>
>> From: Bogdan-Andrei Iancu <bogdan at voice-system.ro>
>> To: OpenSIPS users mailling list <users at lists.opensips.org>
>> Date: 10/22/2010 07:02 AM
>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump
>> Sent by: users-bounces at lists.opensips.org
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hi Kennard,
>>
>> I suppose the bt is the same ? do you still have the core file ?
>>
>> Regards,
>> Bogdan
>>
>> Kennard_White at logitech.com wrote:
>>
>>> Hi Bodgen,
>>>
>>> I replicated the error. Unfortunately the entire insert_timer_unsafe
>>> and been in-lined and little is available:
>>>
>>> Program terminated with signal 11, Segmentation fault.
>>> #0 0x00007f8b8356c2c2 in insert_timer_unsafe (new_tl=0x7f8b7a54e310,
>>> list_id=WT_TIMER_LIST, ext_timeout=<value optimized out>) at
timer.c:731
>>> 731 timer.c: No such file or directory.
>>> in timer.c
>>> (gdb) print tl
>>> $1 = <value optimized out>
>>> (gdb) print *tl
>>> Cannot access memory at address 0x0
>>> (gdb) print ptr
>>> $2 = <value optimized out>
>>> (gdb) print *ptr
>>> Cannot access memory at address 0x0
>>> (gdb) print *new_tl
>>> No symbol "new_tl" in current context.
>>> (gdb) up
>>> #1 set_1timer (new_tl=0x7f8b7a54e310, list_id=WT_TIMER_LIST,
>>> ext_timeout=<value optimized out>) at timer.c:904
>>> 904 in timer.c
>>> (gdb) print *new_tl
>>> $3 = {next_tl = 0x0, prev_tl = 0x0, ld_tl = 0x0, time_out = 0,
>>> timer_list = 0x0, deleted = 0}
>>> (gdb) print list
>>> $4 = <value optimized out>
>>> (gdb) print timeout
>>> $5 = 32
>>> (gdb) print new_tl
>>> $6 = (struct timer_link *) 0x7f8b7a54e310
>>>
>>> I'll keep the core for a while -- please let me know if there is
>>> anything else I can try.
>>>
>>> Thanks,
>>> Kennard
>>>
>>> Inactive hide details for Bogdan-Andrei Iancu ---10/08/2010 04:40:47
>>> AM---Hi Kennard, Ok, keep the core next time :)Bogdan-Andrei Iancu
>>> ---10/08/2010 04:40:47 AM---Hi Kennard, Ok, keep the core next time :)
>>>
>>> From: Bogdan-Andrei Iancu <bogdan at voice-system.ro>
>>> To: OpenSIPS users mailling list <users at lists.opensips.org>
>>> Date: 10/08/2010 04:40 AM
>>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump
>>> Sent by: users-bounces at lists.opensips.org
>>>
>>>
------------------------------------------------------------------------
>>>
>>>
>>>
>>> Hi Kennard,
>>>
>>> Ok, keep the core next time :)
>>>
>>> Regards,
>>> Bogdan
>>>
>>> Kennard_White at logitech.com wrote:
>>>
>>>> Hi Bogden,
>>>>
>>>> Thanks for explaining the child processes involved -- I misunderstood
>>>> what was happening.
>>>>
>>>> Unfortunately, I don't have the core anymore. My recollection is that
>>>> I couldn't print anything useful due to compiler optimization. That
>>>> said, this should re-create pretty easily, and I'll get more dumps
>>>> next time it happens.
>>>>
>>>> Regards,
>>>> Kennard
>>>>
>>>> Inactive hide details for Bogdan-Andrei Iancu ---10/05/2010 01:41:38
>>>> AM---Hi Kennard, The core was generated by process 22255:Bogdan-Andrei
>>>> Iancu ---10/05/2010 01:41:38 AM---Hi Kennard, The core was generated
>>>> by process 22255:
>>>>
>>>> From: Bogdan-Andrei Iancu <bogdan at voice-system.ro>
>>>> To: OpenSIPS users mailling list <users at lists.opensips.org>
>>>> Date: 10/05/2010 01:41 AM
>>>> Subject: Re: [OpenSIPS-Users] opensips tm timer core dump
>>>> Sent by: users-bounces at lists.opensips.org
>>>>
>>>>
>>>>
>> ------------------------------------------------------------------------
>>
>>>>
>>>> Hi Kennard,
>>>>
>>>> The core was generated by process 22255:
>>>>    [22238]: INFO:core:handle_sigs: child process 22255 exited by a
>>>> signal 11
>>>>
>>>> and this process also reported mem problems:
>>>>    [22255]: ERROR:tm:new_t: out of mem
>>>>
>>>> Can you print the "tl" or "ptr" variables in frame 0?
>>>>
>>>> Regards,
>>>> Bogdan
>>>>
>>>> Kennard_White at logitech.com wrote:
>>>>
>>>>> Running against opensips HEAD, I got a segfault in the tm timer
>>>>>
>> code.
>>
>>>>> I believe this is triggered by running out of shared memory.
>>>>>
>>>>>
>>>>> The stack trace:
>>>>>
>>>>> (gdb) where
>>>>> #0 0x00007fe8f8d96212 in insert_timer_unsafe (new_tl=0x7fe8f66337b0,
>>>>> list_id=WT_TIMER_LIST, ext_timeout=<value optimized out>) at
>>>>>
>>> timer.c:731
>>>
>>>>> #1 set_1timer (new_tl=0x7fe8f66337b0, list_id=WT_TIMER_LIST,
>>>>> ext_timeout=<value optimized out>) at timer.c:904
>>>>> #2 0x00007fe8f8d78ac8 in t_release_transaction
>>>>>
>> (trans=0x7fe8f6633730)
>>
>>>>> at t_funcs.c:122
>>>>> #3 0x00007fe8f8d808e5 in t_unref (p_msg=<value optimized out>)
>>>>> at t_lookup.c:1152
>>>>> #4 0x0000000000483ae5 in exec_post_req_cb ()
>>>>> #5 0x000000000046c1e4 in receive_msg ()
>>>>> #6 0x00000000004bc77c in udp_rcv_loop ()
>>>>> #7 0x000000000042de9c in main ()
>>>>>
>>>>> The offending code (I believe):
>>>>> if (tl->time_out==ptr->time_out) {
>>>>> tl->ld_tl = ptr->ld_tl
>>>>> ptr->ld_tl = 0;
>>>>> tl->ld_tl->ld_tl = tl; <-- SEG FAULT HERE (according to trace)
>>>>> } else {
>>>>> tl->ld_tl = tl;
>>>>> }
>>>>>
>>>>> Unfortunately, due to optimization I cannot dump anything
>>>>>
>> useful, and
>>
>>>>> I'm not convinced the actual fault is on the line indicated.
>>>>>
>> Note that
>>
>>>>> the core dump is not one of the processes that reported out of
>>>>>
>> memory.
>>
>>>>> Maybe one of the other processes left the timer list in a corrupt
>>>>>
>>> state?
>>>
>>>>> The log file:
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:
>>>>> ERROR:tm:sip_msg_cloner: no more share memory
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:
>>>>> ERROR:tm:new_t: out of mem
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22255]:
>>>>> ERROR:tm:t_newtran: new_t failed
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:
>>>>> WARNING:core:fm_malloc: Not enough free memory, will atempt
>>>>>
>>>> defragmenation
>>>>
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:
>>>>> ERROR:tm:sip_msg_cloner: no more share memory
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:
>>>>> ERROR:tm:new_t: out of mem
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22254]:
>>>>> ERROR:tm:t_newtran: new_t failed
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:
>>>>> INFO:core:handle_sigs: child process 22255 exited by a signal 11
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:
>>>>> INFO:core:handle_sigs: core was generated
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22238]:
>>>>> INFO:core:handle_sigs: terminating due to SIGCHLD
>>>>> Sep 29 11:43:36 org-sip01 /var/run/openser/opensips-pres[22256]:
>>>>> INFO:core:sig_usr: signal 15 received
>>>>>
>>>>>
>>>>>
>>>
------------------------------------------------------------------------
>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users at lists.opensips.org
>>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>>>>
>>>>>
>>>> --
>>>> Bogdan-Andrei Iancu
>>>> OpenSIPS Bootcamp
>>>> 15 - 19 November 2010, Edison, New Jersey, USA
>>>> www.voice-system.ro
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at lists.opensips.org
>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>>>
>>>>
>>>>
>> ------------------------------------------------------------------------
>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at lists.opensips.org
>>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>>>
>>>>
>>> --
>>> Bogdan-Andrei Iancu
>>> OpenSIPS Bootcamp
>>> 15 - 19 November 2010, Edison, New Jersey, USA
>>> www.voice-system.ro
>>>
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.opensips.org
>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>>
>>>
------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users at lists.opensips.org
>>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>>
>>>
>> --
>> Bogdan-Andrei Iancu
>> OpenSIPS Bootcamp
>> 15 - 19 November 2010, Edison, New Jersey, USA
>> www.voice-system.ro
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.opensips.org
>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Users mailing list
>> Users at lists.opensips.org
>> http://lists.opensips.org/cgi-bin/mailman/listinfo/users
>>
>>
>
>
>


--
Bogdan-Andrei Iancu
OpenSIPS Bootcamp
15 - 19 November 2010, Edison, New Jersey, USA
www.voice-system.ro


_______________________________________________
Users mailing list
Users at lists.opensips.org
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.opensips.org/pipermail/users/attachments/20101026/9c3340aa/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://lists.opensips.org/pipermail/users/attachments/20101026/9c3340aa/attachment-0001.gif 


More information about the Users mailing list