[OpenSIPS-Users] usrloc restart persistency on seed node

Thu Jan 3 05:33:50 EST 2019

Happy New Year John, Alexey and everyone else!

I just finished catching up with this thread, and I must admit that I now
concur with John's distaste of the asymmetric nature of cluster node 
restarts!

Although it is correct and gets the job done, the 2.4 "seed" mechanism 
forces
the admin to conditionally add a "opensipsctl fifo ul_cluster_sync" command
into the startup script of all "seed" nodes.  I think we can do better :)

What if we kept the "seed" concept, but tweaked it such that instead of 
meaning:

"following a restart, always start in 'synced' state, with an empty dataset"

... it would now mean:

"following a restart or cluster sync command, fall back to a 'synced' state,
with an empty dataset if and only if we are unable to find a suitable sync
candidate within X seconds"

This solution seems to fit all requirements that I've seen posted so 
far.  It is:

* correct (a cluster with at least 1 "seed" node will still never deadlock)
* symmetric (with the exception of cluster bootstrapping, all node 
restarts are identical)
* autonomous (users need not even know about "ul_cluster_sync" anymore!  
Not saying
               this is necessarily good, but it brings down the learning 
curve)

The only downside could be that any cluster bootstrap will now last at 
least X seconds.
But that seems such a rare event (in production, at least) that we need 
not worry
about it.  Furthermore, the X seconds will be configurable.

What do you think?

PS: by "cluster bootstrap" I mean (re)starting all nodes simultaneously.

Best regards,

Liviu Chircu
OpenSIPS Developer
http://www.opensips-solutions.com

On 02.01.2019 12:24, John Quick wrote:
> Alexey,
>
> Thanks for your feedback. I acknowledge that, in theory, a situation may
> arise where a node is brought online and all the previously running nodes
> were not fully synchronised so it is then a problem for the newly started
> node to know which data set to pull. In addition to the example you give -
> lost interconnection - I can also foresee difficulties when several nodes
> all start at the same time. However, I do not see how arbitrarily setting
> one node as "seed" will help to resolve either of these situations unless
> the seed node has more (or better) information than the others.
>
> I am trying to design a multi-node solution that is scalable. I want to be
> able to add and remove nodes according to current load. Also, to be able to
> take one node offline, do some maintenance, then bring it back online. For
> my scenario, the probability of any node being taken offline for maintenance
> during the year is 99.9% whereas I would say the probability of partial loss
> of LAN connectivity (causing the split-brain issue) is less than 0.01%.
>
> If possible, I would really like to see an option added to the usrloc module
> to override the "seed" node concept. Something that allows any node
> (including seed) to attempt to pull registration details from another node
> on startup. In my scenario, a newly started node with no usrloc data is a
> major problem - it could take it 40 minutes to get close to having a full
> set of registration data. I would prefer to take the risk of it pulling data
> from the wrong node rather than it not attempting to synchronise at all.
>
> Happy New Year to all.
>
> John Quick
> Smartvox Limited
>
>
>> Hi John,
>>
>> Next is just my opinion. And I didn't explore source code OpenSIPS for
> syncing data.
>> The problem is little bit deeper. As we have cluster, we potentially have
> split-brain.
>> We can disable seed node at all and just let nodes work after
> disaster/restart. But it means that we can't guarantee consistency of data.
> So nodes must show this with <Not in sync> state.
>> Usually clusters use quorum to trust on. But for OpenSIPS I think this
> approach is too expensive. And of course for quorum we need minimum 3 hosts.
>> For 2 hosts after loosing/restoring interconnection it is impossible to
> say, which host has consistent data. That's why OpenSIPS uses seed node as
> artificial trust point. I think <seed> node doesn't solve syncing problems,
> but it simplifies total work.
>> Let's imagine 3 nodes A,B,C. A is Active. A and B lost interconnection. C
> is down. Then C is up and has 2 hosts for syncing. But A already has 200
> phones re-registered for some reason. So we have 200 conflicts (on node B
> the same phones still in memory). Where to sync from? <Seed> host will
> answer this question in 2 cases (A or B). Of course if C is <seed> so it
> just will be happy from the start. And I actually don't know what happens,
> if we now run <ul_cluster_sync> on C. Will it get all the contacts from A
> and B or not?
>> We operate with specific data, which is temporary. So syncing policy can be
> more relaxed. May be it's a good idea to connect somehow <seed> node with
> Active role in the cluster. But again, if Active node restarts and still
> Active - we will have a problem.
>> -----
>> Alexey Vasilyev