[OpenSIPS-Users] crashing in 2.2.2

Bogdan-Andrei Iancu bogdan at opensips.org
Tue Mar 7 16:10:27 EST 2017


Hi Richard,

Sorry for the slow reply - is this crash happing only under some ++load 
? do you see any errors from OpenSIPS prior to the crash ?

I noticed that the backtraces in the corefiles are similar - how easy is 
for you to reproduce it ?

Regards,

Bogdan-Andrei Iancu
   OpenSIPS Founder and Developer
   http://www.opensips-solutions.com

OpenSIPS Summit May 2017 Amsterdam
   http://www.opensips.org/events/Summit-2017Amsterdam.html

On 03/07/2017 12:28 PM, Richard Robson wrote:
> Hi,
>
> I've gone over the script and as far as I can see its working as 
> expected until the traffic remps up and then opensips crashes.
>
> cores:
> http://pastebin.com/CgN0h40K
> http://pastebin.com/ay5TS8zD
> http://pastebin.com/PGn3AqmU
>
> Regards,
>
> Richard
>
> On 06/03/2017 12:14, Richard Robson wrote:
>> Hi<
>>
>> I've tested this on the latest 2.2.3 with the same results.
>>
>> http://pastebin.com/Uixb3v8G
>>
>> there were a few of these in the logsd too just before the crash:
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079270 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079360 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079460 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079560 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079660 ms), it may overlap..
>> Mar  5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]: 
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled 
>> for 204079170 ms (now 204079760 ms), it may overlap..
>>
>>
>> Regards,
>>
>> Richard
>>
>>
>>
>> On 03/03/2017 13:15, Richard Robson wrote:
>>> More cores
>>>
>>> http://pastebin.com/MXW2VBhi
>>> http://pastebin.com/T7JFAP2U
>>> http://pastebin.com/u44aaVpWquit
>>> http://pastebin.com/SFKKcGxE
>>> http://pastebin.com/dwSgMsJi
>>> http://pastebin.com/9HdGLm96
>>>
>>> I've put 2.2.3 on the dev box now and will try to replicate on that 
>>> box, but its difficult to replicate the traffic artificially. I'll 
>>> try to replicate the fault on the dev box over the weekend. I cant 
>>> do it on the live gateways because it will affect customer traffic.
>>>
>>> Regards,
>>>
>>> Richard
>>>
>>>
>>> On 03/03/2017 11:28, Richard Robson wrote:
>>>> I've revisited the gateway failover mechanism I had developed in 
>>>> order to re route calls to the next gateway on 500's due to 
>>>> capacity on the gateways we are using.
>>>>
>>>> we have 3 gateways from one carrier and one from another. The 3 
>>>> have 4 cps and will return a 503 or 500 if we breach this. The 
>>>> single gateway from the other carrier has plenty of capacity and 
>>>> should not be a problem so we want to catch this . and route to the 
>>>> next gateway.
>>>>
>>>> We are counting the CPS and channel limits and are routing to the 
>>>> next gateway if we exceed the limit set, but There are still 
>>>> occasions where a 5XX is generated, which results in a rejected call.
>>>>
>>>>
>>>> We want to stop these rejected calls and therefore want to 
>>>> implement the failover mechanism for the 5XX responses. For 6 
>>>> months we have been failing over if we think the counts are to high 
>>>> on any one gateway without a problem. But when I implement a 
>>>> failover on a 5XX response opensips starts crashing.
>>>>
>>>> It's difficult to generate artificial traffic to mimic the real 
>>>> traffic, but I've not had a problem with the script in testing. 
>>>> Last night I rolled out the new script but by 09:15 this morning 
>>>> opensips started crashing 10 times in 5 minutes. This was as the 
>>>> traffic ramped up. I rolled back the script and it restarted OK and 
>>>> has not crashed since. Therefore the Failover Mechanism in the 
>>>> script is where the crash is happening
>>>>
>>>> Core dump: http://pastebin.com/CqnESCm4
>>>>
>>>> I'll add more dumps later
>>>>
>>>> Regards,
>>>>
>>>> Richard
>>>>
>>>>
>>>> this is the failure route catching the 5XX
>>>>
>>>> failure_route[dr_fo] {
>>>>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn, 
>>>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>>>         if (t_was_cancelled()) {
>>>>                 xlog("[dr]call cancelled by internal caller");
>>>>                 rtpengine_manage();
>>>>                 do_accounting("db", "cdr|missed");
>>>>                 exit;
>>>>         }
>>>>
>>>>         if ( t_check_status("[54]03")) {
>>>>                 route(relay_failover);
>>>>         }
>>>>         if ( t_check_status("500")) {
>>>>                 route(relay_failover);
>>>>         }
>>>>
>>>>         do_accounting("db", "cdr|missed");
>>>>         rtpengine_manage();
>>>>         exit;
>>>> }
>>>>
>>>> This is the route taken on the failure
>>>>
>>>>
>>>> route[relay_failover]{
>>>>
>>>>         if (use_next_gw()) {
>>>>                 xlog("[relay_failover-route] Selected Gateway is 
>>>> $rd");
>>>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>>>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>>>
>>>>                 ####### check channel limit ######
>>>> get_profile_size("outbound","$rd","$var(size)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is 
>>>> $rd var(size) = $var(size)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is 
>>>> $rd avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is 
>>>> $rd  result = ( $var(size) > $avp(trunkchannellimit))");
>>>>                 if ( $(var(size){s.int}) > 
>>>> $(avp(trunkchannellimit){s.int})) {
>>>>                         xlog("[relay_failover-route] Trunk $rd 
>>>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>>>                         route(relay_failover);
>>>>                 }
>>>>         } else {
>>>>                send_reply("503", "Gateways Exhusted");
>>>>                exit;
>>>>         }
>>>>
>>>>         ##### We need to check Rate Limiting #######
>>>>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})", 
>>>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>>>                 rl_dec_count("$rd"); # decrement the counter since 
>>>> we've not "used" one
>>>>                 xlog("[ratelimiter-route] [Max CPS: 
>>>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to: 
>>>> $rU from: $fU CPS exceeded, delaying");
>>>>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>>> async(usleep("200000"),relay_failover_delay);
>>>>                 xlog ("Should not get here!!!! after async requst");
>>>>         } else {
>>>>                 xlog ("[relay_outbound-route] [Max CPS: 
>>>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU 
>>>> from: $fU not ratelimited");
>>>>         }
>>>>
>>>>         t_on_failure("dr_fo");
>>>>         do_accounting("db", "cdr|missed");
>>>>         rtpengine_manage();
>>>>         if (!t_relay()) {
>>>>                         xlog("[relay-route] ERROR: Unable to relay");
>>>>                         send_reply("500","Internal Error");
>>>>                         exit;
>>>>         }
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>




More information about the Users mailing list