batman goes looping...

Message ID 200907280020.06026.lindner_marek@yahoo.de (mailing list archive)
State RFC, archived
Headers

Commit Message

Marek Lindner July 27, 2009, 4:20 p.m. UTC
  Hi,

> Looking at the logs of uml1, uml1 is always routing to uml9 via uml2.
> The problem here i think is to do with the asymetric links algorithms.
> When sending out an OGM, the node uses the TQ for its best link to the
> originator, not the link the OGM came in on. If the OGM from uml1
> origionally from UML3 reported the TQ via that route, the TQ would
> very likely be lower. uml2 would then not of choosen to swap to
> uml1. However, uml1 reports its best route, which is via uml2. uml2
> does not know this, decides to use uml1, and we have a loop.
>
> Does this all hang together correctly? I'm i interpreting this all
> right...
>
> How would you suggest fix this?

I would tend to say it is a bug in the routing selection code. UML2 switches 
the route because it compare thes "negative" TQ message of seqno N with a TQ 
of seqno N - x. I suggest the following fix:

If we receive a TQ value via a neighbor that is smaller than the previous TQ 
that we received via that neighbor we don't change the route to a neighbor 
which did not send us the same or newer seqno.

That way your scenario should not happen because uml2 would not switch. On the 
other hand if uml1 really has a better route uml2 would switch as soon as the 
packet with a new seqno via uml1 arrives. What do you think ?

I attached a patch that should do exactly that. As I'm travelling right now 
I'm not able to test it before next week. If you find the time to do so, please 
let me know about your findings. This patch is not as strict as it could be. It 
might be necessary to rework it as soon as the concept has been proven.

Thanks again for this thorough analysis. Let us know if you find more.  :-)

Regards,
Marek
  

Comments

Yang Su July 30, 2009, 2:32 p.m. UTC | #1
-  More information about the UML based test setup: a chain consisting of 9
nodes (each node is an uml instatnce, uml1, uml2 ... uml9). Direct
neighbors (distance=1 hop) can talk to each other with 0% link loss rate.
Nodes that are 2 hop away can talk to each other with 20% link loss rate.
Nodes that are more than 2 hops away cannot communicate to each other. We
shut down the node in the middle (uml5) and measure (with ping) the time it
takes for batman to re-setup the end-to-end routes (from uml1 to uml9).

-  We observe extensive tmp routing loops during the rerouting events. It
often takes batman more than 1 minutes to get any packet through after the
rerouting and it takes even longer (several minutes) before batman
stabilizes.

-  The cause for the loops:

Let's study the interaction between uml1 and uml2:

OGMs from uml9 may arrive at uml1 and uml2 at roughly the same time
(because both of them has a shortest path of 6-hop from uml9).  uml1 and
uml2 will forward those OGMs to each other.  Depending on the timing
relations between uml1 and uml2, there are four possible combinations:

Case 1: first uml1 forwards the OGM received from uml3 to uml2,  then  uml
2 forwards the OGM received from uml3 to uml1.

Case 2: first uml1 forwards the OGM received from uml3 to uml2, then uml 2
forwards the OGM received from uml1 back to uml1.

Case 3: first uml2 forwards the OGM received from uml3 to uml1, then uml1
forwards the OGM received from uml3 to uml2.

Case 4: first uml2 forwards the OGM received from uml3 to uml1, then uml1
forwards the OGM received from uml2 back to uml2.

When the quality of the path from uml1 to uml9 drops down dramatically
(e.g., in our case, path loss rate increases from 0% to 20% after we remove
uml5), 3 out of 4 cases (case 1,2,4) may cause looping between uml1 and
uml2. With the interaction of echo cancellation mechanism, case 2 may even
lead batman into very deep looping.

- To fix the problem:

I explored two ideas:

1.  Similar to what Marek proposed. But push it to the extreme: switch to a
neighbor only when it has the best tq AND it has the newest seqno.  This
fix along seems to already solve the looping problem in the test cases. It
reduces the rerouting time from more than 1 minutes to less than 15
seconds.  Marek: I also tried the patch you sent. It didn't help in this
setup.

2. Relaxed echo cancellation. This is based on  the following observation:
the TQ value that a node puts into OGM is completely decoupled with "from
which neighbor this OGM is received".   As a result, the TQ value contained
in the echoed OGM represent the real TQ value at the neighbor which echoed
this OGM. The current echo cancellation implementation just drops all the
echoed OGM. This may prevent the node from updating the information towards
the neighbor that echos the OGM. In the extreme case, the information
towards that neighbor may becomes completely stale (similar to what happens
in case 2).  The change I  made:  Always check the TQ contained in the
echoed OGMs. When it is worse than the avg TQ towards that neighbor, we use
this TQ reading to update the avg TQ towards that neighbor. This change
didn't show any effect during the chain tests. However, I still include
this change in the patch to bring up the discussion.

In the attachment, I append a patch which contains a preliminary
implementation for both ideas.



Cheers,
Yang
(See attached file: routing-loops.patch)



|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Marek Lindner <lindner_marek@yahoo.de>                                                                                                            |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |b.a.t.m.a.n@lists.open-mesh.net                                                                                                                   |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Andrew Lunn <andrew.lunn@ascom.ch>, Yang Su <yang.su@ascom.ch>                                                                                    |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |27.07.2009 18:21                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: [B.A.T.M.A.N.] batman goes looping...                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|






Hi,

> Looking at the logs of uml1, uml1 is always routing to uml9 via uml2.
> The problem here i think is to do with the asymetric links algorithms.
> When sending out an OGM, the node uses the TQ for its best link to the
> originator, not the link the OGM came in on. If the OGM from uml1
> origionally from UML3 reported the TQ via that route, the TQ would
> very likely be lower. uml2 would then not of choosen to swap to
> uml1. However, uml1 reports its best route, which is via uml2. uml2
> does not know this, decides to use uml1, and we have a loop.
>
> Does this all hang together correctly? I'm i interpreting this all
> right...
>
> How would you suggest fix this?

I would tend to say it is a bug in the routing selection code. UML2
switches
the route because it compare thes "negative" TQ message of seqno N with a
TQ
of seqno N - x. I suggest the following fix:

If we receive a TQ value via a neighbor that is smaller than the previous
TQ
that we received via that neighbor we don't change the route to a neighbor
which did not send us the same or newer seqno.

That way your scenario should not happen because uml2 would not switch. On
the
other hand if uml1 really has a better route uml2 would switch as soon as
the
packet with a new seqno via uml1 arrives. What do you think ?

I attached a patch that should do exactly that. As I'm travelling right now

I'm not able to test it before next week. If you find the time to do so,
please
let me know about your findings. This patch is not as strict as it could
be. It
might be necessary to rework it as soon as the concept has been proven.

Thanks again for this thorough analysis. Let us know if you find more.  :-)

Regards,
Marek
  
nick farrow Aug. 4, 2009, 3:59 p.m. UTC | #2
Hi,

I have 3 linksys wrt54 flashed with openwrt 7.09, webif'ed and installed BATMAN 0.2-rv478.1, all OK I think.

I have arranged the 3 nodes such that node 1 can only see 2 , and  2 only  3, so no direct contact is available from 1 to 3 (except via the mesh)

I have not broken the br-lan configuration of these devices and assigned 172.16.1.1/24 to node 1 172.16.1.2/24 to node 2 172.16.1.3/24 to node 3, so I think that means all packets recvd on wl0 or eh0 are bridged.

I have a pc hooked into the lan port of node 3 From here I can ssh into node3 (via the ethernet) and into node2 via the wifi (but not .1)  Netstat -rn on node  3 indicates that 172.16.1.1 (node1) is via 172.16.1.2 and this is confirmed by the batman web if that shows .1 is reached via .2 - all great.

From the ssh prompt on .3 I can ping .2 with no loss and ping .1 with  ~ 6% loss. However, if I try to ping .1 from the pc connected on the ethernet port of .3 I get no route to host. Putting wireshark on the ethernet ifc shows the arp going out but nothing coming back. Interestingly enough if I ssh into .3 and ping .1, with wireshark running on the ethernet of .3 Is see the outgoing ICMP but not the reply (which is there because the ping succeeds) . It seems as if the local host traffic is not truly bridged.

This is the first attempt so I might have some incorrect/missing config or done something stupid, any ideas would be welcome!

Thanks very much

nick
  

Patch

 batman-adv-kernelland/routing.c |   19 ++++++++++++++++++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/batman-adv-kernelland/routing.c b/batman-adv-kernelland/routing.c
index b576f8c..cb9f0ea 100644
--- a/batman-adv-kernelland/routing.c
+++ b/batman-adv-kernelland/routing.c
@@ -271,7 +271,7 @@  static int isBidirectionalNeigh(struct orig_node *orig_node, struct orig_node *o
 static void update_orig(struct orig_node *orig_node, struct ethhdr *ethhdr, struct batman_packet *batman_packet, struct batman_if *if_incoming, unsigned char *hna_buff, int hna_buff_len, char is_duplicate)
 {
 	struct neigh_node *neigh_node = NULL, *tmp_neigh_node = NULL, *best_neigh_node = NULL;
-	unsigned char max_tq = 0, max_bcast_own = 0;
+	unsigned char max_tq = 0, max_bcast_own = 0, tq_avg_old;
 	int tmp_hna_buff_len;
 
 	debug_log(LOG_TYPE_BATMAN, "update_originator(): Searching and updating originator entry of received packet \n");
@@ -306,6 +306,7 @@  static void update_orig(struct orig_node *orig_node, struct ethhdr *ethhdr, stru
 	neigh_node->last_valid = jiffies;
 
 	ring_buffer_set(neigh_node->tq_recv, &neigh_node->tq_index, batman_packet->tq);
+	tq_avg_old = neigh_node->tq_avg;
 	neigh_node->tq_avg = ring_buffer_avg(neigh_node->tq_recv);
 
 	if (!is_duplicate) {
@@ -323,6 +324,22 @@  static void update_orig(struct orig_node *orig_node, struct ethhdr *ethhdr, stru
 
 	tmp_hna_buff_len = (hna_buff_len > batman_packet->num_hna * ETH_ALEN ? batman_packet->num_hna * ETH_ALEN : hna_buff_len);
 
+	/**
+	 * if the neighbor that sent us this packet is our current best next
+	 * hop but delivers a TQ that is worse than the previous one we have
+	 * have to make sure that the alternative route already knows about the
+	 * changed TQ otherwise we risk a (temporary) loop
+	 * in case our alternative route does not know about his change we
+	 * stick with our current route
+	 */
+	if ((orig_node->router == neigh_node) &&
+	    (neigh_node != best_neigh_node) &&
+	    (tq_avg_old > neigh_node->tq_avg) &&
+	    (!get_bit_status(best_neigh_node->real_bits,
+	                     orig_node->last_real_seqno,
+	                     batman_packet->seqno)))
+		best_neigh_node = neigh_node;
+
 	update_routes(orig_node, best_neigh_node, hna_buff, tmp_hna_buff_len);
 }