[Users] Re: [Devel] "detached" timer

Jiri Kuthan jiri at iptel.org
Fri Mar 30 15:56:33 CEST 2007


At 14:48 30/03/2007, Bogdan-Andrei Iancu wrote:
>wrong again :)

I wish it would be.

The operational experience shows us that in the former versions
there have been race conditions which do cause troubles under
hard-to-reproduce conditions. Based on surface knowledge, it appears
that openser has inhereted those from ser before's ser's overhaul
of those.

>as I mentioned in my previous email, the "detached timer" was more an maker that something else was going wrong - there was no amplification.

lucky those who haven't been affected by the race conditions. My point
is though, this particular warning corelates with undeterminism.

>and as TR clearly said, the problem was with DB connectivity and had nothing to do with TM timers.

Well, as a matter of fact, I have witnessed several failures which coincidently
appeared with this warning. Studing the code will reveal to you and anyone else
that actually this warning is just a hack which helps to ignore erroneous conditions 
and survive those, but doesn't heal the cause of the problem, which may still generate
disfucntional service.

Again -- I don't mean to daemonize it, with this -ignore-the-problem-hack things
have been running mostly fine.

-jiri


>regards,
>bogdan
>
>Jiri Kuthan wrote:
>>Actually more likely it has been both. The root problem lies in the timer subsystem
>>and may be amplified by other troubles (or amplify those).
>>
>>-jiri
>>
>>At 01:35 30/03/2007, T.R. Missner wrote:
>>  
>>>FYI All
>>>
>>>This turned out to be a database write ( acc ) that was blocking due to a raid card problem.
>>>
>>>
>>>
>>>T.R. Missner wrote:
>>>    
>>>>Is it possible the locked state I am seeing with openser leads to the "detached" timer?
>>>>Since the "detached" timer is a race, it would make sense to see the race condition after openser locks up and messages buffer up in the stack.
>>>>When a bunch of messages are processed all at once by multiple threads the race condition would occur.
>>>>Does this make sense?
>>>>
>>>>Maybe I have been focusing on the wrong place.
>>>>
>>>>Ignoring the "detached" timer what could cause openser to hang for a couple seconds then clear every 5 - 10 minutes?
>>>>
>>>>Ideas?
>>>>
>>>>We are seeing this on 3 different productions servers.
>>>>
>>>>Thanks
>>>>
>>>>TR
>>>>
>>>>using openser1.1.1
>>>>
>>>>
>>>>
>>>>T.R. Missner wrote:
>>>>      
>>>>>Bogdan,
>>>>>
>>>>>I have been chasing this for days and done lots of debugging.
>>>>>using 1.1.1
>>>>>While looking at the network trace at the time of these messages ( I usually see at least 5 in a row with differing hex values ) I see many incoming packets coming into the box and no response from the proxy for somewhere between 5 - 10 seconds, then a flood a responses from the proxy.
>>>>>I can email you a sample pcap file if you like.
>>>>>As part of my debugging I forced a 100 reply at the very top of my cfg file.
>>>>>The forced 100 was not sent during the locked up time leading me to believe openser was not processing incoming packets.
>>>>>I have now seen this on multiple servers in different locations. Likely a particular customer call flow is causing this but I have not been able to pin it down to the exact customer. These proxies run pretty fast during the day so finding a pattern leading up the this issue is difficult. What could I add to the Log output to identify the offending sip-callid? Is sip-callid or branch tag or anything similar easily accessible in any of the data structs in timer.c?
>>>>>
>>>>>TR
>>>>>
>>>>>Bogdan-Andrei Iancu wrote:
>>>>>        
>>>>>>Hi TR,
>>>>>>
>>>>>>it is race between expire even (from timer) and inserting again on a timer list.
>>>>>>  1 is the final response timer list (fr_timer)
>>>>>>  3 id the wait timer list (wt_timer)
>>>>>>
>>>>>>I would say there is no way this could leas to a any kind of lock.
>>>>>>
>>>>>>what version are you using? what makes you say it locks?
>>>>>>
>>>>>>regards,
>>>>>>bogdan
>>>>>>
>>>>>>T.R. Missner wrote:
>>>>>>          
>>>>>>>Does anyone know what causes this?
>>>>>>>
>>>>>>>*/set_timer for 1 list called on a "detached" timer -- ignoring /*
>>>>>>>
>>>>>>>I also see
>>>>>>>
>>>>>>>*/set_timer for 3 list called on a "detached" timer -- ignoring /*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>When this happens Openser seems to lock up for 10 seconds or so.
>>>>>>>
>>>>>>>>From searching it appears this is caused by a race but I am not sure what the race is or why this results in an unresponsive openser instance for multiple seconds.
>>>>>>>
>>>>>>>Transaction expiration racing reply?
>>>>>>>
>>>>>>>
>>>>>>>Desperately need to understand how this could be triggered so I can get customer to adjust system.
>>>>>>>
>>>>>>>Any way to adjust?
>>>>>>>
>>>>>>>tried tweaking fr_inv_timer but no joy.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>TR
>>>>>>>            
>>>>>>
>>>>>>
>>>>>>
>>>>>>--
>>>>>>Jiri Kuthan            http://iptel.org/~jiri/





More information about the Users mailing list