[OpenSIPS-Users] Autoscaler in 3.2.x

Bogdan-Andrei Iancu bogdan at opensips.org
Mon Oct 17 06:39:56 UTC 2022


Hi,

So even with the auto-scaling disabled, after a bit of a time you still 
get the TCP related issues? Do you use TLS in asyc mode? if yes, try to 
turn that off.

Regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
   https://www.opensips-solutions.com
OpenSIPS Summit 27-30 Sept 2022, Athens
   https://www.opensips.org/events/Summit-2022Athens/

On 10/12/22 1:36 AM, Yury Kirsanov wrote:
> Hi Bogdan,
> Yes, if I enable the autoscaler I immediately run into all sorts of 
> issues with TCP. When it's off I'm just getting this issue from time 
> to time and I have to restart OpenSIPS in that case, even though it's 
> still working - part of the processes lock up and consume 100% CPU, 
> but overall the system continues to service requests.
>
> https://github.com/OpenSIPS/opensips/issues/2921 
> <https://github.com/OpenSIPS/opensips/issues/2921>
>
> Best regards,
> Yury.
>
> On Tue, Oct 11, 2022 at 10:59 PM Bogdan-Andrei Iancu 
> <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>
>     Hi Yury,
>
>     Is this still an issue ?
>
>     Regards,
>
>     Bogdan-Andrei Iancu
>
>     OpenSIPS Founder and Developer
>        https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>     OpenSIPS Summit 27-30 Sept 2022, Athens
>        https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>
>     On 9/15/22 5:26 PM, Yury Kirsanov wrote:
>>     Hi Bogdan,
>>     Looks like I'm running into some issues with TCP and autoscaling
>>     again...Now after a good start and within about 5-10 minutes
>>     after OpenSIPS restart, even if rate-limiter is enabled in
>>     iptables I'm getting a lot of these errors:
>>
>>     Sep 16 00:20:56 ERROR:core:send_fd: sendmsg would block on 683:
>>     Resource temporarily unavailable
>>     Sep 16 00:20:56 ERROR:core:send2worker: send_fd failed
>>     Sep 16 00:20:56 ERROR:core:handle_new_connect: no TCP workers
>>     available
>>
>>     And the number of registered users starts to drop.
>>
>>     I've tried to change my autoscaler profile to be a bit more
>>     aggressive:
>>
>>     auto_scaling_profile = PROFILE_TCP
>>          scale up to 32 on 20% for 4 cycles within 5
>>          scale down to 4 on 10% for 10 cycles
>>
>>     But that didn't help. Current TCP settings:
>>
>>     tcp_accept_aliases=0
>>     tcp_keepalive=1
>>     tcp_connect_timeout=1500
>>     tcp_keepinterval = 10
>>     tcp_keepidle = 10
>>     tcp_max_msg_time = 10
>>     tcp_workers = 4 use_auto_scaling_profile PROFILE_TCP
>>     tcp_max_connections = 4096
>>
>>     # Proto TCP
>>     loadmodule "proto_tcp.so"
>>     modparam("proto_tcp", "tcp_async", 1)
>>     modparam("proto_tcp", "tcp_send_timeout", 1000)
>>     modparam("proto_tcp", "tcp_async_local_connect_timeout", 500)
>>     modparam("proto_tcp", "tcp_async_local_write_timeout", 500)
>>     modparam("proto_tcp", "tcp_max_msg_chunks", 16)
>>     modparam("proto_tcp", "tcp_parallel_handling", 1)
>>
>>     I'm also setting TCP persistent flag before mid_register_save
>>     (not sure which one to use - setflag or setbflag so doing both):
>>
>>     modparam("mid_registrar", "tcp_persistent_flag",
>>     "TCP_PERSIST_REGISTRATIONS")
>>
>>             if (is_method("REGISTER"))
>>                 if ($socket_in(proto)!="udp")
>>                 {
>>                     setflag("TCP_PERSIST_REGISTRATIONS");
>>                     setbflag("TCP_PERSIST_REGISTRATIONS");
>>                 }
>>
>>     That didn't help. So I had to manually set tcp_workers=32 and now
>>     it works fine. Not sure what's going on here...
>>
>>     Thanks and best regards,
>>     Yury.
>>
>>
>>     On Thu, Sep 15, 2022 at 4:02 PM Bogdan-Andrei Iancu
>>     <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>>
>>         I'm glad it helped. keep me posted please if the auto-scaling
>>         fix holds.
>>
>>         Best regards,
>>
>>         Bogdan-Andrei Iancu
>>
>>         OpenSIPS Founder and Developer
>>            https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>         OpenSIPS Summit 27-30 Sept 2022, Athens
>>            https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>
>>         On 9/14/22 10:10 PM, Yury Kirsanov wrote:
>>>         Hi Bogdan,
>>>         Sorry to email directly to you again, but just wanted to say
>>>         a huge thank you for all your great work in supporting
>>>         OpenSIPS and its users!
>>>
>>>         After adjusting TCP parameters my OpenSIPS server can handle
>>>         restarts easily without any issues, even though I'm
>>>         currently dropping all the caches and dialogs and everything
>>>         and not using any rate-limit iptables rules.
>>>
>>>         Also, I've enabled the autoscaler and it seem to work great
>>>         this far, please see this screenshot, you can see 79
>>>         processes before the restart, then a restart and number of
>>>         processes immediately dropped to a very low number even
>>>         though it now keeps some load on active processes:
>>>
>>>         image.png
>>>
>>>         All the SIP devices were able to reconnect successfully and
>>>         seem to be stable at this stage! No more memory leaks!
>>>         Thanks again!
>>>
>>>         Best regards,
>>>         Yury.
>>>
>>>         On Wed, Sep 14, 2022 at 10:58 PM Bogdan-Andrei Iancu
>>>         <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>>>
>>>             Hi Yury,
>>>
>>>             You need to check the TCP setting and to be sure your
>>>             OpenSIPS will (1) not try to perform TCP connect against
>>>             destination known not to be able to accept (like TCP/WS
>>>             end points behind NAT) - see the tcp_no_new_conn_bflag
>>>             [1] - or (2) not block for long time while attempting a
>>>             connect - see the tcp_connect_timeout [2] or consider
>>>             enabling async [3].
>>>
>>>             [1]
>>>             https://www.opensips.org/Documentation/Script-CoreParameters-3-2#tcp_no_new_conn_bflag
>>>             <https://www.opensips.org/Documentation/Script-CoreParameters-3-2#tcp_no_new_conn_bflag>
>>>             [2]
>>>             https://www.opensips.org/Documentation/Script-CoreParameters-3-2#tcp_connect_timeout
>>>             <https://www.opensips.org/Documentation/Script-CoreParameters-3-2#tcp_connect_timeout>
>>>             [3]
>>>             https://opensips.org/html/docs/modules/3.2.x/proto_tcp.html#idp168992
>>>             <https://opensips.org/html/docs/modules/3.2.x/proto_tcp.html#idp168992>
>>>
>>>             Regards,
>>>
>>>             Bogdan-Andrei Iancu
>>>
>>>             OpenSIPS Founder and Developer
>>>                https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>>             OpenSIPS Summit 27-30 Sept 2022, Athens
>>>                https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>>
>>>             On 9/13/22 12:01 PM, Yury Kirsanov wrote:
>>>>             Hi Bogdan,
>>>>             Thanks for this update, but it looks like I can't check
>>>>             autoscaler because of this first issue with blocking
>>>>             TCP connect. Is there a way to resolve it? Am I doing
>>>>             something wrong? Or is that something to do with
>>>>             OpenSIPS code? As yes, you're right, as soon as I
>>>>             restart OpenSIPS having a lot of SIP devices trying to
>>>>             connect to it - it goes crazy, starts to consume memory
>>>>             and stops to forward packets sitting there at 100% load
>>>>             until it runs out of memory and segfaults. Sometimes I
>>>>             can't even restart it to come to normal state to make
>>>>             it work, it just loops into same crash whatever I try
>>>>             to do.
>>>>
>>>>             I've compiled OpenSIPS 3.3.1 with your patch and was
>>>>             able to start it but not sure, maybe I was just lucky
>>>>             this time.
>>>>
>>>>             What should I do? Thanks!
>>>>
>>>>             Best regards,
>>>>             Yury.
>>>>
>>>>             On Tue, 13 Sept 2022, 18:56 Bogdan-Andrei Iancu,
>>>>             <bogdan at opensips.org <mailto:bogdan at opensips.org>> wrote:
>>>>
>>>>                 Hi Yury,
>>>>
>>>>                 it looks like you some multiple issues, overlapping
>>>>                 here. The traps you sent here have nothing to do
>>>>                 with the auto-scaling, but with a blocking TCP
>>>>                 connect for SIP - most of the procs get blocked
>>>>                 into a sync TCP connect.
>>>>
>>>>                 Regards,
>>>>
>>>>                 Bogdan-Andrei Iancu
>>>>
>>>>                 OpenSIPS Founder and Developer
>>>>                    https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>>>                 OpenSIPS Summit 27-30 Sept 2022, Athens
>>>>                    https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>>>
>>>>                 On 9/12/22 4:39 PM, Yury Kirsanov wrote:
>>>>>                 Hi Bogdan,
>>>>>                 I've applied the patch (had to find where to apply
>>>>>                 it manually for 3.2.8 downloaded from Web page,
>>>>>                 line 1568 instead of 1652) and restarted the
>>>>>                 server with only about 300-350 SIP devices and
>>>>>                 immediately got into same issue. I'm attaching two
>>>>>                 GDB dumps made within several minutes from each
>>>>>                 other. Autoscale was now OFF, please see my
>>>>>                 previous message as currently for some reason I'm
>>>>>                 experiencing lockups even when it's off :(
>>>>
>>>>>                 Best regards,
>>>>>                 Yury.
>>>>>
>>>>>                 On Mon, Sep 12, 2022 at 7:48 PM Bogdan-Andrei
>>>>>                 Iancu <bogdan at opensips.org
>>>>>                 <mailto:bogdan at opensips.org>> wrote:
>>>>>
>>>>>                     Hi Yuri,
>>>>>
>>>>>                     Could you give this patch a try? it should fix
>>>>>                     the blocking you experience (it should apply
>>>>>                     on 3.2 too).
>>>>>
>>>>>                     Best regards,
>>>>>
>>>>>                     Bogdan-Andrei Iancu
>>>>>
>>>>>                     OpenSIPS Founder and Developer
>>>>>                        https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>>>>                     OpenSIPS Summit 27-30 Sept 2022, Athens
>>>>>                        https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>>>>
>>>>>                     On 9/7/22 2:54 PM, Bogdan-Andrei Iancu wrote:
>>>>>>                     Hi Yury,
>>>>>>
>>>>>>                     Thanks for the details info here - let me do
>>>>>>                     a review of some code and run some tests, as
>>>>>>                     at this point I have a good idea on the
>>>>>>                     direction to dig into.
>>>>>>
>>>>>>                     I will update here.
>>>>>>
>>>>>>                     Best regards,
>>>>>>                     Bogdan-Andrei Iancu
>>>>>>
>>>>>>                     OpenSIPS Founder and Developer
>>>>>>                        https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>>>>>                     OpenSIPS Summit 27-30 Sept 2022, Athens
>>>>>>                        https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>>>>>                     On 9/6/22 11:24 AM, Yury Kirsanov wrote:
>>>>>>>                     Hi Bogdan,
>>>>>>>                     Yes, I'm listening on all types of sockets
>>>>>>>                     including UDP, TCP and TLS on the outside
>>>>>>>                     public interface and then forward traffic
>>>>>>>                     into internal LAN via UDP only.
>>>>>>>
>>>>>>>                     Previously it was getting stuck quite
>>>>>>>                     easily, now I had to wait for a while before
>>>>>>>                     this actually happened. I've routed part of
>>>>>>>                     my customers to this server to obtain this
>>>>>>>                     result so I will have to do that again.
>>>>>>>
>>>>>>>                     As soon as I see one of the processes stuck
>>>>>>>                     I'll dot the trap command and send you all
>>>>>>>                     the details including processes load, ps
>>>>>>>                     output and so on.
>>>>>>>
>>>>>>>                     For now I had to switch autoscaling off and
>>>>>>>                     just create many listeners. Do I understand
>>>>>>>                     correctly that I need to restart OpenSIPS in
>>>>>>>                     order to apply autoscaling profiles and
>>>>>>>                     reload-routes is not sufficient?
>>>>>>>
>>>>>>>                     Also, do I need separate UDP profiles for
>>>>>>>                     public and private interfaces? And do I need
>>>>>>>                     to apply autoscaling profile just to a
>>>>>>>                     socket or I need to specify udp or
>>>>>>>                     tcp_workers with autoscaler too?
>>>>>>>
>>>>>>>                     Thanks and best regards,
>>>>>>>                     Yury.
>>>>>>>
>>>>>>>                     On Tue, 6 Sept 2022, 18:18 Bogdan-Andrei
>>>>>>>                     Iancu, <bogdan at opensips.org
>>>>>>>                     <mailto:bogdan at opensips.org>> wrote:
>>>>>>>
>>>>>>>                         Hi Yury,
>>>>>>>
>>>>>>>                         Thanks for the info. I see that the
>>>>>>>                         stuck process (24) is an auto-scalled
>>>>>>>                         one (based on its id). Do you have SIP
>>>>>>>                         traffic from UDP to TCP or doing some
>>>>>>>                         HEP capturing for SIP ? I saw a recent
>>>>>>>                         similar report where a UDP auto-scalled
>>>>>>>                         worked got stuck when trying to do some
>>>>>>>                         communication with the TCP main/manager
>>>>>>>                         process (in order to handle a TCP
>>>>>>>                         operation).
>>>>>>>
>>>>>>>                         BTW, any chance to do a "opensips-cli -x
>>>>>>>                         trap" when you have that stuck process,
>>>>>>>                         just to see where is it stuck? and is it
>>>>>>>                         hard to reproduce? as I may ask you to
>>>>>>>                         extract some information from the
>>>>>>>                         running process....
>>>>>>>
>>>>>>>                         Regards,
>>>>>>>
>>>>>>>                         Bogdan-Andrei Iancu
>>>>>>>
>>>>>>>                         OpenSIPS Founder and Developer
>>>>>>>                            https://www.opensips-solutions.com  <https://www.opensips-solutions.com>
>>>>>>>                         OpenSIPS Summit 27-30 Sept 2022, Athens
>>>>>>>                            https://www.opensips.org/events/Summit-2022Athens/  <https://www.opensips.org/events/Summit-2022Athens/>
>>>>>>>
>>>>>>>                         On 9/3/22 6:54 PM, Yury Kirsanov wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>                     _______________________________________________
>>>>>>                     Users mailing list
>>>>>>                     Users at lists.opensips.org  <mailto:Users at lists.opensips.org>
>>>>>>                     http://lists.opensips.org/cgi-bin/mailman/listinfo/users  <http://lists.opensips.org/cgi-bin/mailman/listinfo/users>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20221017/c2345388/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 103945 bytes
Desc: not available
URL: <http://lists.opensips.org/pipermail/users/attachments/20221017/c2345388/attachment-0001.png>


More information about the Users mailing list