[OpenSIPS-Users] mi_fifo lock on reply after a period of time in 3.1.3

Andrew Yager andrew at rwts.com.au
Mon Oct 18 09:51:25 EST 2021


For those following along, this appears to have been an issue with Auto
Scaling. Once disabled, the problem has not reoccured. In our case this is
a perfectly acceptable way to move forward, but maybe at some stage we'll
come back to working this out.

Andrew


On Sat, 9 Oct 2021 at 01:39, Andrew Yager <andrew at rwts.com.au> wrote:

> Process 24103 (the blocked proc) last logged at 19:02:01, and in doing
> so, it had:
>
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: start
> searching: hash=34811, isACK=0
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:matching_3261: RFC3261
> transaction matching failed
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: no
> transaction found
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:statistics:w_update_stat:
> needed statistic is <cpbx_presence_notify_rcv>
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:statistics:parse_groupname:
> group: '', name: 'cpbx_presence_notify_rcv'
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:select_contacts: ct:
> sip:[snip]@[snip]:55087;transport=tls
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:push_branch: setting msg
> R-URI <sip:[snip]@[snip]:55087;transport=tls>
> Oct  8 19:02:01 hvprxy osips[24103]: [RELAY] Performing mid-registrar
> lookup on sip:[snip]:5060
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:lookup: '' Not found in
> usrloc
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:statistics:w_update_stat:
> needed statistic is <cpbx_contact_lookup_failed>
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:statistics:parse_groupname:
> group: '', name: 'cpbx_contact_lookup_failed'
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:select_contacts: ct:
> sip:[snip]@[snip]:55087;transport=tls
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:push_branch: setting msg
> R-URI <sip:xxx at xxx:55087;transport=tls>
> Oct  8 19:02:01 hvprxy osips[24103]: [RELAY] Performed
> mid_registrar_lookup: sip:xxx at xxx:55087;transport=tls
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:t_newtran: transaction on
> entrance=(nil)
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers:
> flags=ffffffffffffffff
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: flags=78
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: start
> searching: hash=34811, isACK=0
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:matching_3261: RFC3261
> transaction matching failed
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:t_lookup_request: no
> transaction found
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:tm:run_reqin_callbacks:
> trans=0x7fafa71d8150, callback type 1, id 0 entered
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:mk_proxy: doing DNS lookup...
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers: flags=2000
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:parse_headers:
> flags=ffffffffffffffff
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: con found in
> state 0
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=647
> < no more >
>
> Proc 11468 (tcp main thread) logged:
>
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> getsockopt: snd is initially 425984
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> setting snd buf to 851968 had no effect
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> setting snd buf to 428032 had no effect
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> using snd buffer of 416 kb
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:init_sock_keepalive: TCP
> keepalive enabled on socket 636
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:print_ip: tcpconn_new:
> new tcp connection to: xxx
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_new: on port
> 11934, proto 3
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_add: hashes: 1004,
> 320
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:handle_new_connect: new
> connection: 0x7fafa7cc0c80 636 flags: 001c
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: no free tcp
> receiver, connection passed to the least busy one (proc #1, 3 con)
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: to tcp
> worker 1 (0), 0x7fafa7cc0c80 rw 1
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:ipc_handle_job: received
> job type 0[RPC] from process 1
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> getsockopt: snd is initially 425984
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> setting snd buf to 851968 had no effect
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> setting snd buf to 428032 had no effect
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:probe_max_sock_buff:
> using snd buffer of 416 kb
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:init_sock_keepalive: TCP
> keepalive enabled on socket 637
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:print_ip: tcpconn_new:
> new tcp connection to: xxx
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_new: on port
> 11850, proto 3
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:tcpconn_add: hashes: 824, 321
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:handle_new_connect: new
> connection: 0x7fafa7e76170 637 flags: 001c
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: no free tcp
> receiver, connection passed to the least busy one (proc #5, 3 con)
> Oct  8 19:02:01 hvprxy osips[11468]: DBG:core:send2worker: to tcp
> worker 5 (0), 0x7fafa7e76170 rw 1
>
> Looking for maybe other interesting thingsā€¦
>
> Oct  8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=555
> Oct  8 19:01:55 hvprxy osips[11397]: DBG:core:tcp_conn_get: after
> receive_fd: c= 0x7fafa84997e8 n=8 fd=5
> Oct  8 19:01:55 hvprxy osips[11397]: DBG:proto_tls:proto_tls_send:
> after write: c=0x7fafa84997e8 n=483 fd=5
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:02:01 hvprxy osips[24103]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=647
> Oct  8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=557
> Oct  8 19:02:13 hvprxy osips[11398]: DBG:core:tcp_conn_get: after
> receive_fd: c= 0x7fafa84997e8 n=8 fd=5
> Oct  8 19:02:13 hvprxy osips[11398]: DBG:proto_tls:proto_tls_send:
> after write: c=0x7fafa84997e8 n=360 fd=5
> Oct  8 19:02:13 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req:
> Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0
> Oct  8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=557
> Oct  8 19:02:43 hvprxy osips[11398]: DBG:core:tcp_conn_get: after
> receive_fd: c= 0x7fafa84997e8 n=8 fd=5
> Oct  8 19:02:43 hvprxy osips[11398]: DBG:proto_tls:proto_tls_send:
> after write: c=0x7fafa84997e8 n=360 fd=5
> Oct  8 19:02:43 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req:
> Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0
> Oct  8 19:02:44 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req:
> Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0
> Oct  8 19:02:44 hvprxy osips[24113]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:02:44 hvprxy osips[24113]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=649
> Oct  8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: tcp
> connection found (0x7fafa84997e8), acquiring fd
> Oct  8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: c=
> 0x7fafa84997e8, n=16, Usock=625
> Oct  8 19:03:13 hvprxy osips[11458]: DBG:core:tcp_conn_get: after
> receive_fd: c= 0x7fafa84997e8 n=8 fd=950
> Oct  8 19:03:13 hvprxy osips[11458]: DBG:proto_tls:proto_tls_send:
> after write: c=0x7fafa84997e8 n=360 fd=950
> Oct  8 19:03:13 hvprxy osips[11460]: DBG:proto_tls:tcp_handle_req:
> Nothing more to read on TCP conn 0x7fafa84997e8, currently in state 0
>
> All this said, I can't say I'm totally sure I know what I would be
> looking for. Nothing seems to stand out in the logs to me; but there
> are a few gigs of logs so it's hard to quite know what I'm looking
> for.
>
> Andrew
>
>
> On Sat, 9 Oct 2021 at 00:19, Bogdan-Andrei Iancu <bogdan at opensips.org>
> wrote:
> >
> > Hi Andrew,
> >
> > The second blocked process (doing the TLS/TCP stuff) surprisingly got
> > stuck while waiting for a TCP fd from the TCP Main process.
> >
> > You mentioned that the logs of the UDP worker (doing the TCP send)
> > suddenly stopped - around that time, do you see any errors from that
> > process or from the TCP MAIN processes ?
> >
> > Regards,
> >
> > Bogdan-Andrei Iancu
> >
> > OpenSIPS Founder and Developer
> >    https://www.opensips-solutions.com
> > OpenSIPS eBootcamp 2021
> >    https://opensips.org/training/OpenSIPS_eBootcamp_2021/
> >
> > On 10/8/21 2:43 PM, Andrew Yager wrote:
> > > Hi Bogdan-Andrei,
> > >
> > > Have restarted since the last bt, but have recreated again and
> > > attached. Earlier today we did also get another bt full on the second
> > > blocked pid, but I didn't save it. In that case it was a UDP reply
> > > from one of our upstream servers that had gone through mid_registrar
> > > and was being relayed to a TCP endpoint. The TCP endpoint did have an
> > > open file descriptor we could see, and it had sent and was blocked on
> > > receive at the same point (I'm getting better at reading backtraces!
> > > :D).
> > >
> > > The thing I do note is happening is that every example I have is a UDP
> > > message being received from an upstream server being relayed to a
> > > client on a TCP/TLS connection via a UDP worker.
> > >
> > > While we are using WolfSSL in this box, the other box where we have
> > > the same behaviour (but I haven't taken backtraces yet) is running
> > > OpenSSL and on 3.1.3; so it's not SSL library specific.
> > >
> > > I'm going to see if I can get a backtrace from the 3.1.3 box shortly.
> > >
> > > Andrew
> > >
> > > On Fri, 8 Oct 2021 at 17:13, Bogdan-Andrei Iancu <bogdan at opensips.org>
> wrote:
> > >> Hi Andrew,
> > >>
> > >> OK, interesting progress here. So, the FIFO process blocks as it is
> > >> trying to send an IPC JOB to an UDP process which looks like also
> being
> > >> blocked.
> > >>
> > >> Could you attach with GDB to the that UDP blocked process too ? (you
> > >> have its PID in the printing of the pt[x] in first gdb)
> > >>
> > >> Regards,
> > >>
> > >> Bogdan-Andrei Iancu
> > >>
> > >> OpenSIPS Founder and Developer
> > >>     https://www.opensips-solutions.com
> > >> OpenSIPS eBootcamp 2021
> > >>     https://opensips.org/training/OpenSIPS_eBootcamp_2021/
> > >>
> > >> On 10/8/21 1:43 AM, Andrew Yager wrote:
> > >>> Interestingly, where I usually see a range of continued messages from
> > >>> a process continually in the debug log, they appear to stop for this
> > >>> PID at 3:47am, and that process seems blocked on a tcp/tls send:
> > >>>
> > >>>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20211018/d57b1119/attachment-0001.html>


More information about the Users mailing list