batman goes looping...

Message ID 200908031821.22497.lindner_marek@yahoo.de (mailing list archive)
State RFC, archived
Headers

Commit Message

Marek Lindner Aug. 3, 2009, 10:21 a.m. UTC
  On Thursday 30 July 2009 22:32:04 Yang Su wrote:
> 1.  Similar to what Marek proposed. But push it to the extreme: switch to a
> neighbor only when it has the best tq AND it has the newest seqno.  This
> fix along seems to already solve the looping problem in the test cases. It
> reduces the rerouting time from more than 1 minutes to less than 15
> seconds.  Marek: I also tried the patch you sent. It didn't help in this
> setup.

Changing the route based on the (fastest) sequence number has some drawbacks 
which we experienced before (BATMAN III). It tends to discriminate longer but 
better routes and favors short paths. In some asymetric link (worst case) 
scenarios it would route against all odds, simply because receiving a fast 
packet (newest seqno) does not mean we can send the same way back. If possible 
I'd like to avoid this strict seqno check.

I attached another patch which will conduct a route switch only if the TQ of 
the sending neighbor is better than our current best route (no negative 
switching anymore). I tested the patch here and it works so far but my 
environment is less controlled than yours. Could you perform the same test on 
your setup ?

If you intend to experiment with other ideas watch out for the sequence number 
as it will overflow. A simple "greater than" check might lead to strange 
results. Also, updates_routes() checks for changed HNA messages even if the 
next hop does not change.


> 2. Relaxed echo cancellation. This is based on  the following observation:
> the TQ value that a node puts into OGM is completely decoupled with "from
> which neighbor this OGM is received".   As a result, the TQ value contained
> in the echoed OGM represent the real TQ value at the neighbor which echoed
> this OGM. The current echo cancellation implementation just drops all the
> echoed OGM. This may prevent the node from updating the information towards
> the neighbor that echos the OGM. In the extreme case, the information
> towards that neighbor may becomes completely stale (similar to what happens
> in case 2).  The change I  made:  Always check the TQ contained in the
> echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use
> this TQ reading to update the avg TQ towards that neighbor. This change
> didn't show any effect during the chain tests. However, I still include
> this change in the patch to bring up the discussion.

Right, every node will emit his currently best TQ value but I did not 
understand how we can use that. If we send him a better TQ he will send back 
that number. If we send a bad TQ he will send his good number. Furthermore, 
each hop will apply some asymetric / hop / wifi penalty that we pull into our 
routing database ?

Regards,
Marek
  

Comments

nick farrow Aug. 4, 2009, 3:59 p.m. UTC | #1
Hi,

I have 3 linksys wrt54 flashed with openwrt 7.09, webif'ed and installed BATMAN 0.2-rv478.1, all OK I think.

I have arranged the 3 nodes such that node 1 can only see 2 , and  2 only  3, so no direct contact is available from 1 to 3 (except via the mesh)

I have not broken the br-lan configuration of these devices and assigned 172.16.1.1/24 to node 1 172.16.1.2/24 to node 2 172.16.1.3/24 to node 3, so I think that means all packets recvd on wl0 or eh0 are bridged.

I have a pc hooked into the lan port of node 3 From here I can ssh into node3 (via the ethernet) and into node2 via the wifi (but not .1)  Netstat -rn on node  3 indicates that 172.16.1.1 (node1) is via 172.16.1.2 and this is confirmed by the batman web if that shows .1 is reached via .2 - all great.

From the ssh prompt on .3 I can ping .2 with no loss and ping .1 with  ~ 6% loss. However, if I try to ping .1 from the pc connected on the ethernet port of .3 I get no route to host. Putting wireshark on the ethernet ifc shows the arp going out but nothing coming back. Interestingly enough if I ssh into .3 and ping .1, with wireshark running on the ethernet of .3 Is see the outgoing ICMP but not the reply (which is there because the ping succeeds) . It seems as if the local host traffic is not truly bridged.

This is the first attempt so I might have some incorrect/missing config or done something stupid, any ideas would be welcome!

Thanks very much

nick
  
Yang Su Aug. 6, 2009, 8:44 a.m. UTC | #2
|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Marek Lindner <lindner_marek@yahoo.de>                                                                                                            |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"The list for a Better Approach To Mobile Ad-hoc Networking" <b.a.t.m.a.n@lists.open-mesh.net>                                                    |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |03.08.2009 12:23                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |b.a.t.m.a.n-bounces@lists.open-mesh.net                                                                                                           |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|








On Thursday 30 July 2009 22:32:04 Yang Su wrote:
> 1.  Similar to what Marek proposed. But push it to the extreme: switch to
a
> neighbor only when it has the best tq AND it has the newest seqno.  This
> fix along seems to already solve the looping problem in the test cases.
It
> reduces the rerouting time from more than 1 minutes to less than 15
> seconds.  Marek: I also tried the patch you sent. It didn't help in this
> setup.

Changing the route based on the (fastest) sequence number has some
drawbacks
which we experienced before (BATMAN III). It tends to discriminate longer
but
better routes and favors short paths. In some asymetric link (worst case)
scenarios it would route against all odds, simply because receiving a fast
packet (newest seqno) does not mean we can send the same way back. If
possible
I'd like to avoid this strict seqno check.

- You are right. Under certain circumstance, this approach may lead to less
optimal route selection. Just to understand batman algorithm better, I try
to make an example (please correct me if something's wrong here): There are
two direct neighbors via which a sender communicates with a receiver. Via
the good neighbor, their is a single good route with long delay. Between
the bad neighbor and the receiver there are many parallel bad routes. Each
bad route has high packet loss rate but very short delay. The receiver's
OGMs arrive at the sender first via the bad neighbor. Although there is
substantial OGM loss on each individual bad routes, by aggregation over
those parallel bad routes, the bad neighbor is still able to relay most of
sender's OGM to the receiver. As a result, if the sender already chooses
the bad neighbor as the next hop, it will stick to it for quite long time
and does not switch to the good neighbor.

I attached another patch which will conduct a route switch only if the TQ
of the sending neighbor is better than our current best route (no negative
switching anymore). I tested the patch here and it works so far but my
environment is less controlled than yours. Could you perform the same test
on your setup ?

- This approach achieves the similar goal as the seqno-based one: switch to
a neighbor only when this neighbor is better and has more up-to-date
information than the current best route. But this approach is less
constrained and seems not to suffer from the above problem for the
seqno-based approach. I have tested the patch on our test environment. It
works pretty well on the test cases.

> 2. Relaxed echo cancellation. This is based on  the following
observation:
> the TQ value that a node puts into OGM is completely decoupled with "from
> which neighbor this OGM is received".   As a result, the TQ value
contained
> in the echoed OGM represent the real TQ value at the neighbor which
echoed
> this OGM. The current echo cancellation implementation just drops all the
> echoed OGM. This may prevent the node from updating the information
towards
> the neighbor that echos the OGM. In the extreme case, the information
> towards that neighbor may becomes completely stale (similar to what
happens
> in case 2).  The change I  made:  Always check the TQ contained in the
> echoed OGMs. When it is worse than the avg TQ towards that neighbor, we
use
> this TQ reading to update the avg TQ towards that neighbor. This change
> didn't show any effect during the chain tests. However, I still include
> this change in the patch to bring up the discussion.

Right, every node will emit his currently best TQ value but I did not
understand how we can use that. If we send him a better TQ he will send
back
that number. If we send a bad TQ he will send his good number. Furthermore,

each hop will apply some asymetric / hop / wifi penalty that we pull into
our
routing database ?

- The problem might happen when the neighbor thinks that I am his best
neighbor (his good number is actually from me). In this case, my estimation
of that neighbor's TQ is actually decided by my own TQ (my TQ minus one hop
penalty). When my TQ drops down, the correct behavior should be that my
estimation of that neighbor's TQ also drops down accordingly. However, when
echo cancellation steps in, my estimation of that neighbor's TQ will not be
updated and remains to be the stale value. This might cause problems (e.g.
looping) in corner cases even with the new patch.  Does this analysis make
sense or I miss some details of the echo cancellation?

Regards,
Marek

[attachment "switch-route-with-better-tq.patch" deleted by Yang
Su/NSUYAN/CH/Ascom] _______________________________________________
B.A.T.M.A.N mailing list
B.A.T.M.A.N@lists.open-mesh.net
https://lists.open-mesh.net/mm/listinfo/b.a.t.m.a.n
  
Marek Lindner Aug. 6, 2009, 2:15 p.m. UTC | #3
On Thursday 06 August 2009 16:44:28 Yang Su wrote:
> - You are right. Under certain circumstance, this approach may lead to less
> optimal route selection. Just to understand batman algorithm better, I try
> to make an example (please correct me if something's wrong here): There are
> two direct neighbors via which a sender communicates with a receiver. Via
> the good neighbor, their is a single good route with long delay. Between
> the bad neighbor and the receiver there are many parallel bad routes. Each
> bad route has high packet loss rate but very short delay. The receiver's
> OGMs arrive at the sender first via the bad neighbor. Although there is
> substantial OGM loss on each individual bad routes, by aggregation over
> those parallel bad routes, the bad neighbor is still able to relay most of
> sender's OGM to the receiver. As a result, if the sender already chooses
> the bad neighbor as the next hop, it will stick to it for quite long time
> and does not switch to the good neighbor.

It is possble to simplify your example more to clearly illustrate the problem. 
Scenario: Host A wants to send data to host B. Consider the beautiful graphic 
I attached. ;-)
The link between A and B is quite asymetric. No packet loss from B to A but 
heavy packet loss the other way round. Now, B periodically sends its OGMs and 
all OGMs will arrive at A directly and via the neighbor whereas the OGMs via 
the neighbor always be slower than the directly received ones. A should choose 
the longer path via the neighbor because it has no packet loss but the 
"fastest sequence number wins" algorithm won't allow that. 


> - This approach achieves the similar goal as the seqno-based one: switch to
> a neighbor only when this neighbor is better and has more up-to-date
> information than the current best route. But this approach is less
> constrained and seems not to suffer from the above problem for the
> seqno-based approach. I have tested the patch on our test environment. It
> works pretty well on the test cases.

Great! I comitted the patch immediately. Thanks again for your thorough 
analysis and ideas. :-)


> - The problem might happen when the neighbor thinks that I am his best
> neighbor (his good number is actually from me). In this case, my estimation
> of that neighbor's TQ is actually decided by my own TQ (my TQ minus one hop
> penalty). When my TQ drops down, the correct behavior should be that my
> estimation of that neighbor's TQ also drops down accordingly. However, when
> echo cancellation steps in, my estimation of that neighbor's TQ will not be
> updated and remains to be the stale value. This might cause problems (e.g.
> looping) in corner cases even with the new patch.  Does this analysis make
> sense or I miss some details of the echo cancellation?

Unless you found another bug the echo cancellation should not hinder the 
propagation of our own TQ value. Once we (uml2) received a packet we will 
rebroadcast it. Our neighbor (uml1) will receive it (unless the packet got 
lost) and update its routing tables accordingly. Now, he will rebroadcast the 
packet as well which will be killed by the echo cancellation on uml1 as this 
packet does not contain new information. 
But even if the TQ received via uml2 is worse uml1 might still have a good TQ 
from uml3 in its routing table which might get rebroadcasted.

Regards,
Marek
  
Yang Su Aug. 7, 2009, 3:13 p.m. UTC | #4
> - This approach achieves the similar goal as the seqno-based one: switch
to
> a neighbor only when this neighbor is better and has more up-to-date
> information than the current best route. But this approach is less
> constrained and seems not to suffer from the above problem for the
> seqno-based approach. I have tested the patch on our test environment. It
> works pretty well on the test cases.

Great! I comitted the patch immediately. Thanks again for your thorough
analysis and ideas. :-)

- Some updates: I did more thorough testing. It turns out that this patch
does really solve the problem.  This method seems not able to completely
eliminate all possible forms of looping which happen in the test cases.
This result in an special pattern: the rerouting is fast (~15 seconds),
then (after 3~ 5 seconds) the network enters the looping state and it
normally takes long time to recover.  Last time when I did tests, I stop
the tests right after I observed the successful rerouting. I didn't look
into the cause of this problem yet. Any inputs are welcome.

> - The problem might happen when the neighbor thinks that I am his best
> neighbor (his good number is actually from me). In this case, my
estimation
> of that neighbor's TQ is actually decided by my own TQ (my TQ minus one
hop
> penalty). When my TQ drops down, the correct behavior should be that my
> estimation of that neighbor's TQ also drops down accordingly. However,
when
> echo cancellation steps in, my estimation of that neighbor's TQ will not
be
> updated and remains to be the stale value. This might cause problems
(e.g.
> looping) in corner cases even with the new patch.  Does this analysis
make
> sense or I miss some details of the echo cancellation?

Unless you found another bug the echo cancellation should not hinder the
propagation of our own TQ value. Once we (uml2) received a packet we will
rebroadcast it. Our neighbor (uml1) will receive it (unless the packet got
lost) and update its routing tables accordingly. Now, he will rebroadcast
the
packet as well which will be killed by the echo cancellation on uml1 as
this
packet does not contain new information.
But even if the TQ received via uml2 is worse uml1 might still have a good
TQ
from uml3 in its routing table which might get rebroadcasted.

- In the following example (also appended in the attachment :) ), if I
understand the current echo cancellation implementation correctly, batman
will enter permanent looping between A and B. In this example, A send to F,
all the links are perfect and have the same delay. Only exception is link
A-E. It is an asymmetric link.

Regards,
Yang
(See attached file: looping.pdf)
  
Marek Lindner Aug. 7, 2009, 3:58 p.m. UTC | #5
On Friday 07 August 2009 23:13:46 Yang Su wrote:
> - Some updates: I did more thorough testing. It turns out that this patch
> does really solve the problem.  This method seems not able to completely
> eliminate all possible forms of looping which happen in the test cases.
> This result in an special pattern: the rerouting is fast (~15 seconds),
> then (after 3~ 5 seconds) the network enters the looping state and it
> normally takes long time to recover.  Last time when I did tests, I stop
> the tests right after I observed the successful rerouting. I didn't look
> into the cause of this problem yet. Any inputs are welcome.

Can you provide logs from the nodes involved ? You also can send them off-list 
to avoid sending too big attachments.


> - In the following example (also appended in the attachment :) ), if I
> understand the current echo cancellation implementation correctly, batman
> will enter permanent looping between A and B. In this example, A send to F,
> all the links are perfect and have the same delay. Only exception is link
> A-E. It is an asymmetric link.

Why do you think it would loop permanently and why do you think it is the 
fault of the echo cancellation ?
Actually, this example should[tm] be pretty easy. Every time the OGM from E or 
F arrive at A via the asymetric link A will apply a severe asymetric link 
penalty. All nodes will route via the longer path. Did you try to run this 
scenario in your testbed ?

Regards,
Marek
  
Yang Su Aug. 10, 2009, 7:20 a.m. UTC | #6
For the echo cancellation example: the problem happens with the interaction
between A and B. Because of the asym link, A chooses B as its next neighbor
towards F. Because the asym path is short, B always receives F's OGM from A
first  (before B receives the OGM with the same seqno from C). As a result,
from A's point of view, F's OGMs rebroadcasted by B are always echoes and A
will never update the TQ information towards F via B.

When the quality of any links on the path B to F drops down, B might choose
A as its next neighbor towards F and a loop forms between A and B.

Regards,
Yang





On Friday 07 August 2009 23:13:46 Yang Su wrote:
> - Some updates: I did more thorough testing. It turns out that this patch
> does really solve the problem.  This method seems not able to completely
> eliminate all possible forms of looping which happen in the test cases.
> This result in an special pattern: the rerouting is fast (~15 seconds),
> then (after 3~ 5 seconds) the network enters the looping state and it
> normally takes long time to recover.  Last time when I did tests, I stop
> the tests right after I observed the successful rerouting. I didn't look
> into the cause of this problem yet. Any inputs are welcome.

Can you provide logs from the nodes involved ? You also can send them
off-list
to avoid sending too big attachments.


> - In the following example (also appended in the attachment :) ), if I
> understand the current echo cancellation implementation correctly, batman
> will enter permanent looping between A and B. In this example, A send to
F,
> all the links are perfect and have the same delay. Only exception is link
> A-E. It is an asymmetric link.

Why do you think it would loop permanently and why do you think it is the
fault of the echo cancellation ?
Actually, this example should[tm] be pretty easy. Every time the OGM from E
or
F arrive at A via the asymetric link A will apply a severe asymetric link
penalty. All nodes will route via the longer path. Did you try to run this
scenario in your testbed ?

Regards,
Marek
  
Marek Lindner Aug. 10, 2009, 7:48 a.m. UTC | #7
On Monday 10 August 2009 15:20:28 Yang Su wrote:
> For the echo cancellation example: the problem happens with the interaction
> between A and B. Because of the asym link, A chooses B as its next neighbor
> towards F. Because the asym path is short, B always receives F's OGM from A
> first  (before B receives the OGM with the same seqno from C). As a result,
> from A's point of view, F's OGMs rebroadcasted by B are always echoes and A
> will never update the TQ information towards F via B.
>
> When the quality of any links on the path B to F drops down, B might choose
> A as its next neighbor towards F and a loop forms between A and B.

Ah, here we have the misunderstanding. Let me briefly outline how the echo 
cancellation works (I refer to your PDF-example):

1. Node C emits a packet.
2. Node B receives it and writes the address of Node C in the previous sender 
field because it received that packet via C. Now, B rebroadcasts the OGM.
3. Node C receives the packet it sent before and discards it as the packet 
contains its own address in the previous sender field.
4. Node A happily receives the packet (and goes back to step 2).

The echo cancellation will not kill the packets coming via the longer path. 

Does this help ? 

Regards,
Marek
  
Yang Su Aug. 10, 2009, 7:57 a.m. UTC | #8
|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Marek Lindner <lindner_marek@yahoo.de>                                                                                                            |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"The list for a Better Approach To Mobile Ad-hoc Networking" <b.a.t.m.a.n@lists.open-mesh.net>                                                    |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |10.08.2009 09:50                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |b.a.t.m.a.n-bounces@lists.open-mesh.net                                                                                                           |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|








On Monday 10 August 2009 15:20:28 Yang Su wrote:
> For the echo cancellation example: the problem happens with the
interaction
> between A and B. Because of the asym link, A chooses B as its next
neighbor
> towards F. Because the asym path is short, B always receives F's OGM from
A
> first  (before B receives the OGM with the same seqno from C). As a
result,
> from A's point of view, F's OGMs rebroadcasted by B are always echoes and
A
> will never update the TQ information towards F via B.
>
> When the quality of any links on the path B to F drops down, B might
choose
> A as its next neighbor towards F and a loop forms between A and B.

Ah, here we have the misunderstanding. Let me briefly outline how the echo
cancellation works (I refer to your PDF-example):

1. Node C emits a packet.
2. Node B receives it and writes the address of Node C in the previous
sender
field because it received that packet via C. Now, B rebroadcasts the OGM.

-- B has already rebroadcasted an earlier OGM with the same sequence number
(received from A). Will B rebroadcast the OGM with the same sequence number
(received from C) again?

3. Node C receives the packet it sent before and discards it as the packet
contains its own address in the previous sender field.
4. Node A happily receives the packet (and goes back to step 2).


Cheers,
Yang
  
Marek Lindner Aug. 10, 2009, 8:16 a.m. UTC | #9
On Monday 10 August 2009 15:57:29 Yang Su wrote:
> -- B has already rebroadcasted an earlier OGM with the same sequence number
> (received from A). Will B rebroadcast the OGM with the same sequence number
> (received from C) again?

Excellent question - it should. I will check the code later to verify that 
this is the case. 
However, this is not related to the echo cancellation anymore.

Regards,
Marek
  
Marek Lindner Aug. 11, 2009, 4:42 p.m. UTC | #10
On Monday 10 August 2009 16:16:49 Marek Lindner wrote:
> Excellent question - it should. I will check the code later to verify that
> this is the case.
> However, this is not related to the echo cancellation anymore.

I inspected the code for behaviour in the case you outlined. I think our 
dropping policy might be a bit too strict. I suggest the following behaviour:

* New OGMs are processed and rebroadcasted as usual.
* Known OGMs (the sequence number is not new but we did not hear this sequence 
number via that neighbor) are processed and rebroadcasted as well.
* Duplicates (known sequence number via a neighbor that sent us this sequence 
number before are dropped.

Objections / ideas ?

Regards,
Marek
  
Yang Su Aug. 12, 2009, 2:34 p.m. UTC | #11
Given the DV nature of Batman, current strict dropping policy seems to make
sense, because sending known OGMs (as suggested) does introduce significant
overhead but does not propagate so much new information.

In addition, loosing the dropping policy may introduce new problems which
might not be easily predictable at this moment, e.g., stability, looping
(propagating stale information within the network might cause even more
nasty looping problems).

Another way (and maybe simpler ) to fix the problem: instead of loosing the
OGM dropping policy,  loosing the echo cancellation policy.

It is a very simple fix: in addition to the usual echo cancellation
process, I also check  the TQ value contained in the echoed OGM.  This TQ
is announced by the neighbor that broadcasts this echo. If this TQ is worse
than my current TQ estimation towards that neighbor, I update the
estimation with this TQ.

Does that make sense?

Regards,
Yang



|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Marek Lindner <lindner_marek@yahoo.de>                                                                                                            |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"The list for a Better Approach To Mobile Ad-hoc Networking" <b.a.t.m.a.n@lists.open-mesh.net>                                                    |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |11.08.2009 18:44                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |b.a.t.m.a.n-bounces@lists.open-mesh.net                                                                                                           |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





On Monday 10 August 2009 16:16:49 Marek Lindner wrote:
> Excellent question - it should. I will check the code later to verify
that
> this is the case.
> However, this is not related to the echo cancellation anymore.

I inspected the code for behaviour in the case you outlined. I think our
dropping policy might be a bit too strict. I suggest the following
behaviour:

* New OGMs are processed and rebroadcasted as usual.
* Known OGMs (the sequence number is not new but we did not hear this
sequence
number via that neighbor) are processed and rebroadcasted as well.
* Duplicates (known sequence number via a neighbor that sent us this
sequence
number before are dropped.

Objections / ideas ?

Regards,
Marek
  
Marek Lindner Aug. 13, 2009, 9:56 a.m. UTC | #12
On Wednesday 12 August 2009 22:34:17 Yang Su wrote:
> Given the DV nature of Batman, current strict dropping policy seems to make
> sense, because sending known OGMs (as suggested) does introduce significant
> overhead but does not propagate so much new information.
>
> In addition, loosing the dropping policy may introduce new problems which
> might not be easily predictable at this moment, e.g., stability, looping
> (propagating stale information within the network might cause even more
> nasty looping problems).

I agree that this might introduce new problems which is the reason for posting 
it here. I definitely see the overhead created by that methog. Every better 
idea will be accepted immediately.  :-)


> Another way (and maybe simpler ) to fix the problem: instead of loosing the
> OGM dropping policy,  loosing the echo cancellation policy.
>
> It is a very simple fix: in addition to the usual echo cancellation
> process, I also check  the TQ value contained in the echoed OGM.  This TQ
> is announced by the neighbor that broadcasts this echo. If this TQ is worse
> than my current TQ estimation towards that neighbor, I update the
> estimation with this TQ.

In your testbed it fixed the issue ?
Looking at the example you gave I would expect that the problem is only solved 
partially. You argued the OGMs from E via A would always arrive faster at B 
compared to the longer path. Using your method would probably avoid the loop 
between A and B in the event of a packet loss on the longer path but it would 
not let route A via the longer path. Am I mistaken ?

Regards,
Marek
  
Yang Su Aug. 13, 2009, 3:07 p.m. UTC | #13
Maybe I miss some thing here. What do you think would prevent A from
choosing via the long route (via B) ?

Cheers,
Yang


|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Marek Lindner <lindner_marek@yahoo.de>                                                                                                            |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |"The list for a Better Approach To Mobile Ad-hoc Networking" <b.a.t.m.a.n@lists.open-mesh.net>                                                    |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |13.08.2009 11:59                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Sent by:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |b.a.t.m.a.n-bounces@lists.open-mesh.net                                                                                                           |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





On Wednesday 12 August 2009 22:34:17 Yang Su wrote:
> Given the DV nature of Batman, current strict dropping policy seems to
make
> sense, because sending known OGMs (as suggested) does introduce
significant
> overhead but does not propagate so much new information.
>
> In addition, loosing the dropping policy may introduce new problems which
> might not be easily predictable at this moment, e.g., stability, looping
> (propagating stale information within the network might cause even more
> nasty looping problems).

I agree that this might introduce new problems which is the reason for
posting
it here. I definitely see the overhead created by that methog. Every better

idea will be accepted immediately.  :-)


> Another way (and maybe simpler ) to fix the problem: instead of loosing
the
> OGM dropping policy,  loosing the echo cancellation policy.
>
> It is a very simple fix: in addition to the usual echo cancellation
> process, I also check  the TQ value contained in the echoed OGM.  This TQ
> is announced by the neighbor that broadcasts this echo. If this TQ is
worse
> than my current TQ estimation towards that neighbor, I update the
> estimation with this TQ.

In your testbed it fixed the issue ?
Looking at the example you gave I would expect that the problem is only
solved
partially. You argued the OGMs from E via A would always arrive faster at B

compared to the longer path. Using your method would probably avoid the
loop
between A and B in the event of a packet loss on the longer path but it
would
not let route A via the longer path. Am I mistaken ?

Regards,
Marek
  
Marek Lindner Aug. 13, 2009, 4 p.m. UTC | #14
On Thursday 13 August 2009 23:07:06 Yang Su wrote:
> Maybe I miss some thing here. What do you think would prevent A from
> choosing via the long route (via B) ?

The starting point (still using your example):
* A points to E as it never learned about the long path.
* B points to C because it learned about the long path but does not 
rebroadcast the packets from C because they are slower than the 
packets via A.
* C uses the longer path.

Your idea: Use the echo cancellation's answer to retrieve more information, 
especially when the TQ which is coming back is worse than my own I update my 
routing information.

The situation now:
* A still points to E because the echos coming from B do not contain TQ values 
that are worse than its own.
* B will probably never route via A because the TQ values are much worse.
* C is not affected.

Did I overlook something ?

Regards,
Marek
  

Patch

 batman-adv-kernelland/routing.c |   44 ++++++++++++++++----------------------
 1 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/batman-adv-kernelland/routing.c b/batman-adv-kernelland/routing.c
index b576f8c..204f6a3 100644
--- a/batman-adv-kernelland/routing.c
+++ b/batman-adv-kernelland/routing.c
@@ -270,31 +270,22 @@  static int isBidirectionalNeigh(struct orig_node *orig_node, struct orig_node *o
 
 static void update_orig(struct orig_node *orig_node, struct ethhdr *ethhdr, struct batman_packet *batman_packet, struct batman_if *if_incoming, unsigned char *hna_buff, int hna_buff_len, char is_duplicate)
 {
-	struct neigh_node *neigh_node = NULL, *tmp_neigh_node = NULL, *best_neigh_node = NULL;
-	unsigned char max_tq = 0, max_bcast_own = 0;
+	struct neigh_node *neigh_node = NULL, *tmp_neigh_node = NULL;
 	int tmp_hna_buff_len;
 
 	debug_log(LOG_TYPE_BATMAN, "update_originator(): Searching and updating originator entry of received packet \n");
 
 	list_for_each_entry(tmp_neigh_node, &orig_node->neigh_list, list) {
-
 		if (compare_orig(tmp_neigh_node->addr, ethhdr->h_source) && (tmp_neigh_node->if_incoming == if_incoming)) {
 			neigh_node = tmp_neigh_node;
-		} else {
-
-			if (!is_duplicate) {
-				ring_buffer_set(tmp_neigh_node->tq_recv, &tmp_neigh_node->tq_index, 0);
-				tmp_neigh_node->tq_avg = ring_buffer_avg(tmp_neigh_node->tq_recv);
-			}
+			continue;
+		}
 
-			/* if we got have a better tq value via this neighbour or same tq value if it is currently our best neighbour (to avoid route flipping) */
-			if ((tmp_neigh_node->tq_avg > max_tq) || ((tmp_neigh_node->tq_avg == max_tq) && (tmp_neigh_node->orig_node->bcast_own_sum[if_incoming->if_num] > max_bcast_own)) || ((orig_node->router == tmp_neigh_node) && (tmp_neigh_node->tq_avg == max_tq))) {
+		if (is_duplicate)
+			continue;
 
-				max_tq = tmp_neigh_node->tq_avg;
-				max_bcast_own = tmp_neigh_node->orig_node->bcast_own_sum[if_incoming->if_num];
-				best_neigh_node = tmp_neigh_node;
-			}
-		}
+		ring_buffer_set(tmp_neigh_node->tq_recv, &tmp_neigh_node->tq_index, 0);
+		tmp_neigh_node->tq_avg = ring_buffer_avg(tmp_neigh_node->tq_recv);
 	}
 
 	if (neigh_node == NULL)
@@ -313,17 +304,20 @@  static void update_orig(struct orig_node *orig_node, struct ethhdr *ethhdr, stru
 		neigh_node->last_ttl = batman_packet->ttl;
 	}
 
-	if ((neigh_node->tq_avg > max_tq) || ((neigh_node->tq_avg == max_tq) && (neigh_node->orig_node->bcast_own_sum[if_incoming->if_num] > max_bcast_own)) || ((orig_node->router == neigh_node) && (neigh_node->tq_avg == max_tq))) {
-
-		max_tq = neigh_node->tq_avg;
-		max_bcast_own = neigh_node->orig_node->bcast_own_sum[if_incoming->if_num];
-		best_neigh_node = neigh_node;
-
-	}
-
 	tmp_hna_buff_len = (hna_buff_len > batman_packet->num_hna * ETH_ALEN ? batman_packet->num_hna * ETH_ALEN : hna_buff_len);
 
-	update_routes(orig_node, best_neigh_node, hna_buff, tmp_hna_buff_len);
+	/**
+	 * if we got have a better tq value via this neighbour or
+	 * same tq value but the link is more symetric change the next hop
+	 * router
+	 */
+	if ((orig_node->router != neigh_node) && ((!orig_node->router) ||
+	    (neigh_node->tq_avg > orig_node->router->tq_avg) ||
+	    ((neigh_node->tq_avg == orig_node->router->tq_avg) &&
+	     (neigh_node->orig_node->bcast_own_sum[if_incoming->if_num] > orig_node->router->orig_node->bcast_own_sum[if_incoming->if_num]))))
+		update_routes(orig_node, neigh_node, hna_buff, tmp_hna_buff_len);
+	else
+		update_routes(orig_node, orig_node->router, hna_buff, tmp_hna_buff_len);
 }
 
 static char count_real_packets(struct ethhdr *ethhdr, struct batman_packet *batman_packet, struct batman_if *if_incoming)