[OpenSIPS-Users] WARNING:core:utimer_ticker: utimer task <tm-utimer> already schedualed for

Fri Jul 20 02:02:08 EDT 2018

Hi again.
Guys I need some hints from you, at least personally, so please take a look
at my first message if possible.
Thanks in advance and have a nice day.

2018-07-16 10:34 GMT+03:00 Donat Zenichev <donat.zenichev at gmail.com>:

> Hi community, recently I've got the problems with sudden termination of
> the daemon.
>
> This bunch of similar rows seems to be some timer job that takes longer
> than it was supposed to be:
> Jul 14 02:13:10 some-machine opensips[25060]: message repeated 3 times: [
> WARNING:core:timer_ticker: timer task <nh-timer> already schedualed for
> 10040133060 ms (now 10040134050 ms), it may overlap..]
> Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404100 ms (now
> 10063404200 ms), it may overlap..
> Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404200 ms (now
> 10063404300 ms), it may overlap..
> Jul 14 08:45:29 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404300 ms (now
> 10063404400 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404400 ms (now
> 10063404500 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404500 ms (now
> 10063404600 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404600 ms (now
> 10063404700 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404700 ms (now
> 10063404800 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404800 ms (now
> 10063404900 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063404900 ms (now
> 10063405010 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063405010 ms (now
> 10063405110 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25077]: CRITICAL:core:receive_fd:
> EOF on 41
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063405110 ms (now
> 10063405210 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063405210 ms (now
> 10063405310 ms), it may overlap..
> Jul 14 08:45:30 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063405310 ms (now
> 10063405410 ms), it may overlap..
> Jul 14 08:45:31 some-machine opensips[25060]: WARNING:core:utimer_ticker:
> utimer task <tm-utimer> already schedualed for 10063405410 ms (now
> 10063405510 ms), it may overlap..
>
> Then we have the following row, that means that child process 25075 was
> killed by SIGSEGV signal (probably this process tried to appeal to
> non-existent memory cell?).
> Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: child
> process 25075 exited by a signal 11
>
> I can see in the logs lots of rows about the process 25075 that was
> treating the nathelper module:
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send:
> send() for proto 2 failed
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer:
> sip msg_send failed
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send:
> send() for proto 2 failed
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer:
> sip msg_send failed
>
> Then we have a core dump generated
> Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs: core
> was generated
>
> Then we have a SIGCHLD
> Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs:
> terminating due to SIGCHLD
>
> I'm not sure how this is considered by opensips daemon (afaik a root
> process can ignore such requests) but seems that it caused a termination of
> the root process as well.
> Because just after that we see, that a SIGTERM (killing) for all the
> forked processes had started:
> Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:handle_sigs:
> terminating due to SIGCHLD
> Jul 14 08:45:31 some-machine opensips[25074]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25077]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25070]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25064]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25072]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25062]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25060]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25056]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25067]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25065]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25061]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25063]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25066]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25068]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25076]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25069]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25071]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25073]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25058]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25059]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25057]: INFO:core:sig_usr: signal 15
> received
> Jul 14 08:45:31 some-machine opensips[25054]: INFO:core:cleanup: cleanup
>
> And it's bizarre, since just one killed sub-process triggered such general
> termination.
>
> Since we see that, the process 25075 was related to nathelper module, the
> problem perhaps is around that.
> I've read several bug articles on the similar issues, and one of the
> answers (from Razvan Crainea) seems to be relevant to our story:
>
> "OpenSIPS pings all users in a single timer run. If there are a lot of
> users, this might take a lot, especially if users are using TCP, and
> some of them have a stale connection.
> Increasing the natping_partitions partitions will basically split all
> the users in smaller sets, and will spread them among multiple job runs,
> thus multiple processes. Therefore increasing this parameter will
> definitely help you.
> The range is not a magic value . It depends on the amount of users,
> bandwidth, etc. It makes sense to set this parameter from 1 to the
> maximum number of processes provisioned - anything else will probably
> make timer jobs overlap. "
>
> For now we do have this parameter set to:
> modparam("nathelper", "natping_partitions", 4)
>
> So it performs nat pinging separately, each part has 1/4 part of users.
>
> Actually it seems pretty enough to work properly, but seems that nathelper
> accumulates the "dead" users (that are not available anymore) and tries to
> ping them. But it's just my assumption, because we have to many rows with
> the same error (not only this time and with this child process, they are
> spread all a log), I mean like that:
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:msg_send:
> send() for proto 2 failed
> Jul 14 08:45:28 some-machine opensips[25075]: ERROR:nathelper:nh_timer:
> sip msg_send failed
>
> And parameters that can have an impact on this situation:
> modparam("nathelper", "natping_interval", 30)
> modparam("nathelper", "natping_partitions", 4)
>
> My idea is to increase the partitions, probably to value "8" ?
> And probably decrease the interval to 10 seconds.
>
> What would you advice?
>
> --
> --
> BR, Donat Zenichev
>
>

-- 
-- 
BR, Donat Zenichev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opensips.org/pipermail/users/attachments/20180720/6d98744a/attachment-0001.html>