batman-adv: encourage batman to take shorter routes by changing the default hop penalty

Message ID 1327677116-22645-1-git-send-email-lindner_marek@yahoo.de (mailing list archive)
State Accepted, archived
Commit 6a12de1939281dd7fa62a6e22dc2d2c38f82734f
Headers

Commit Message

Marek Lindner Jan. 27, 2012, 3:11 p.m. UTC
  Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
---
 soft-interface.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
  

Comments

Andrew Lunn Jan. 27, 2012, 3:19 p.m. UTC | #1
On Fri, Jan 27, 2012 at 11:11:55PM +0800, Marek Lindner wrote:
> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
> ---
>  soft-interface.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/soft-interface.c b/soft-interface.c
> index 2ffdc74..7548762 100644
> --- a/soft-interface.c
> +++ b/soft-interface.c
> @@ -836,7 +836,7 @@ struct net_device *softif_create(const char *name)
>  	atomic_set(&bat_priv->gw_sel_class, 20);
>  	atomic_set(&bat_priv->gw_bandwidth, 41);
>  	atomic_set(&bat_priv->orig_interval, 1000);
> -	atomic_set(&bat_priv->hop_penalty, 10);
> +	atomic_set(&bat_priv->hop_penalty, 30);

Hi Marek

Do you have any performance analysis to show this is really helpful
and not harmful?

I've seen indoor results where i had to reduce the hop penalty,
otherwise BATMAN was taking a short path which worked badly. By
reducing the hop penalty, so encouraging it to take more hops, i got
usable routes.

I see the danger here this could break working networks, so maybe it
needs justification?

Thanks
	Andrew
  
Marek Lindner Jan. 27, 2012, 3:54 p.m. UTC | #2
Hi Andrew,

> Do you have any performance analysis to show this is really helpful
> and not harmful?
> 
> I've seen indoor results where i had to reduce the hop penalty,
> otherwise BATMAN was taking a short path which worked badly. By
> reducing the hop penalty, so encouraging it to take more hops, i got
> usable routes.
> 
> I see the danger here this could break working networks, so maybe it
> needs justification?

as a matter of fact I do believe it is helpful. In various networks (more than 
a dozen) I have seen that batman would largely favor multi-hop routes, thus 
reducing the overall throughput. By setting it to a higher value I regained 
some of its performance. The networks are still up & running - I can show them 
to you if you are interested.

So, you had to reduce the default value of 10 to something even smaller ? A 
hop penalty of 10 results in a penatly of 4% per hop. A rough equivalent of 2 
lost packets (62/64). Does not sound very much to me. Can you explain your 
test setup a little more ?

Nevertheless, this patch was intended to get a discussion going. The main 
problem I have been seeing in the last weeks is that OGM broadcasts have a 
hard time estimating the link quality / throughput on 11n devices. I'll also 
try to hack a proof of concept for an rssi influence on the routing and see if 
that has a better effect.

Regards,
Marek
  
Daniele Furlan Jan. 27, 2012, 6:17 p.m. UTC | #3
Hi all,

2012/1/27 Marek Lindner <lindner_marek@yahoo.de>:
>
> Hi Andrew,
>
>> Do you have any performance analysis to show this is really helpful
>> and not harmful?
>>
>> I've seen indoor results where i had to reduce the hop penalty,
>> otherwise BATMAN was taking a short path which worked badly. By
>> reducing the hop penalty, so encouraging it to take more hops, i got
>> usable routes.
>>
>> I see the danger here this could break working networks, so maybe it
>> needs justification?

I have experencied the same situation in some tests, and I agree with Andrew
when he says that some form of justification is necessary.
>
> as a matter of fact I do believe it is helpful. In various networks (more than
> a dozen) I have seen that batman would largely favor multi-hop routes, thus
> reducing the overall throughput. By setting it to a higher value I regained
> some of its performance. The networks are still up & running - I can show them
> to you if you are interested.
>
> So, you had to reduce the default value of 10 to something even smaller ? A
> hop penalty of 10 results in a penatly of 4% per hop. A rough equivalent of 2
> lost packets (62/64). Does not sound very much to me. Can you explain your
> test setup a little more ?
>
> Nevertheless, this patch was intended to get a discussion going. The main
> problem I have been seeing in the last weeks is that OGM broadcasts have a
> hard time estimating the link quality / throughput on 11n devices. I'll also
> try to hack a proof of concept for an rssi influence on the routing and see if
> that has a better effect.

The problems of TQ emerges when the rate of devices increase, because especially
in mixed b,g,n networks TQ does not distinguish between fast and slow link.
We all know that brodcast losses does not say almost nothing about link speed
or load.

The only way to improve the TQ metric is a cross-layer implementation as already
experienced (considering only bandwidth) in my tests. Obviously this
means breaking
the "universal" compatibility with network interfaces, the use of
mac80211 and cfg80211
in any case can limit this problem in my opinion.

>
> Regards,
> Marek

Regards,
Daniele
  
Andrew Lunn Jan. 27, 2012, 7:13 p.m. UTC | #4
> So, you had to reduce the default value of 10 to something even smaller ? A 
> hop penalty of 10 results in a penatly of 4% per hop. A rough equivalent of 2 
> lost packets (62/64). Does not sound very much to me. Can you explain your 
> test setup a little more ?

These observations come from a research project made together with
Hochschule Luzern. There is some flyer like documentation in:

www.hslu.ch/t-spawn-project-description_en.pdf

It is a deployable indoor network. The tests i made were with a mesh
of 6 nodes, deployed in a chain. The deployment is intelligent, made
independently of BATMAN. It uses packet probing at the lowest coding
rate to ensure there is always a link to two nodes upstream in the
chain. So you walk along with 5 nodes in your hand. When the algorithm
determines the link upstream to two nodes has reach a threshold, it
tells you to deploy the next mesh node. We kept doing this, along the
corridor, down the steps, along another corridor, through a fire door,
etc, until we were out of nodes.

When iperf was used to measure the traffic from one end of the chain
to the other. With the default hop penalty we got poor
performance. With the traceroute facility of batctl, we could see it
was route flipping between 3 hops and 4 hops. When it used 3 hops, the
packet loss was too high and we got poor bandwidth. Then it went up to
4 hops, the packet loss was lower, so we got more bandwidth.
 
This was repeatable, with each deploy we made.

Then we tried with a lower hop penalty. I think it was 5, but i don't
remember. BATMAN then used 5 hops and there was no route flipping. We
also got the best iperf bandwidth for end to end of the chain.

The fact BATMAN was route flipping with a hop penalty of 10 suggests
to me the links had similar TQ. So OGMs are getting through at the
lowest coding rate. But data packets are having trouble, maybe because
they are full MTU, or because the wifi driver is using the wrong
coding rate.

I suspect the TQ measurements as determined by OGMs are more
optimistic than actual data packets. Linus's played with different NDP
packet sizes, and i think he ended up with big packets in order to
give more realistic TQ measurements.

Unfortunately, this project is now finished. I do have access to the
hardware, but no time allocated to play with it :-(

> Nevertheless, this patch was intended to get a discussion going. 

Well, i'm happy to take part in the discussion. I've no idea if our
use case is typical, or an edge case. So comments, and results from
other peoples networks would be useful.

If this change it to help 11n, maybe some more intelligence would be
better, to ask the wireless stack is the interface abg or n, and from
that determine what hop penalty should be used?

     Andrew
  
Antonio Quartulli Jan. 28, 2012, 2:12 p.m. UTC | #5
Hi all,

Very nice setup Andrew :)

On Fri, Jan 27, 2012 at 08:13:34 +0100, Andrew Lunn wrote:
> > So, you had to reduce the default value of 10 to something even smaller ? A 
> > hop penalty of 10 results in a penatly of 4% per hop. A rough equivalent of 2 
> > lost packets (62/64). Does not sound very much to me. Can you explain your 
> > test setup a little more ?
> 
> I suspect the TQ measurements as determined by OGMs are more
> optimistic than actual data packets. Linus's played with different NDP
> packet sizes, and i think he ended up with big packets in order to
> give more realistic TQ measurements.
> 
> > Nevertheless, this patch was intended to get a discussion going. 
> 
> Well, i'm happy to take part in the discussion. I've no idea if our
> use case is typical, or an edge case. So comments, and results from
> other peoples networks would be useful.
> 
> If this change it to help 11n, maybe some more intelligence would be
> better, to ask the wireless stack is the interface abg or n, and from
> that determine what hop penalty should be used?


In my honest opinion we are mixing two different issues:
1) current hop penalty value not really significant
2) OGM link quality measurements do not reflect the metric we'd like it to be


problem 2 is not going to be solved by hacking the hop penalty. It needs further
investigation/research and NDP is probably a good starting point towards a
possible solution (I think we all agree on this).


For what concern the hop penalty, as far as I understood, it is in charge of
making batman prefer a shorter route in case of equal TQs over the traversed
links. Instead of hacking the value...what about redesigning the way the hop
penalty affects the TQ value of forwarded OGMs? Maybe using a different
function (poly of deg>1 or exp) instead of a simple linear decreasing? May this
help all the scenarios we mentioned? 


Cheers,
  
Simon Wunderlich Jan. 28, 2012, 3:35 p.m. UTC | #6
On Fri, Jan 27, 2012 at 11:54:25PM +0800, Marek Lindner wrote:
> 
> Hi Andrew,
> 
> > Do you have any performance analysis to show this is really helpful
> > and not harmful?
> > 
> > I've seen indoor results where i had to reduce the hop penalty,
> > otherwise BATMAN was taking a short path which worked badly. By
> > reducing the hop penalty, so encouraging it to take more hops, i got
> > usable routes.
> > 
> > I see the danger here this could break working networks, so maybe it
> > needs justification?
> 
> as a matter of fact I do believe it is helpful. In various networks (more than 
> a dozen) I have seen that batman would largely favor multi-hop routes, thus 
> reducing the overall throughput. By setting it to a higher value I regained 
> some of its performance. The networks are still up & running - I can show them 
> to you if you are interested.

I have seen similar results in my test setups. One simple scenario where I have
seen route flapping with hop penalty 10 in multiple setups is: If some nodes are
at the same place (e.g. a few netbooks on the same table), they often don't use
the direct route but change to a two-hop route to reach their destination - even
if the direct link is nearly perfect. There don't even has to be payload traffic
involved, the routes just flap because of the little tq oscillations from some
packets lost.

In these tests, I have also changed the hop penalty to 30 (or even 50, sometimes)
and these problems are gone.

The TQ metric has limited informative value in terms of available bandwidth/chosen
rate. The default wifi broadcast/multicast rate of 1 Mbit/s may lead to prefering
low-rate 1 hop links over high-rate 2 hop links. However, this can be often fixed
by increasing the mcast rate (mac80211 or madwifi support this). We should consider
including rate information in future metrics.

Anyway, for now and our current TQ metric I strongly agree in increasing the hop
penalty too.

Cheers,
	Simon

> 
> So, you had to reduce the default value of 10 to something even smaller ? A 
> hop penalty of 10 results in a penatly of 4% per hop. A rough equivalent of 2 
> lost packets (62/64). Does not sound very much to me. Can you explain your 
> test setup a little more ?
> 
> Nevertheless, this patch was intended to get a discussion going. The main 
> problem I have been seeing in the last weeks is that OGM broadcasts have a 
> hard time estimating the link quality / throughput on 11n devices. I'll also 
> try to hack a proof of concept for an rssi influence on the routing and see if 
> that has a better effect.
  
Marek Lindner Jan. 28, 2012, 8:49 p.m. UTC | #7
Hi,

> In my honest opinion we are mixing two different issues:
> 1) current hop penalty value not really significant
> 2) OGM link quality measurements do not reflect the metric we'd like it to
> be
> 
> 
> problem 2 is not going to be solved by hacking the hop penalty. It needs
> further investigation/research and NDP is probably a good starting point
> towards a possible solution (I think we all agree on this).

you are right - these are 2 different issues.


> For what concern the hop penalty, as far as I understood, it is in charge
> of making batman prefer a shorter route in case of equal TQs over the
> traversed links. Instead of hacking the value...what about redesigning the
> way the hop penalty affects the TQ value of forwarded OGMs? Maybe using a
> different function (poly of deg>1 or exp) instead of a simple linear
> decreasing? May this help all the scenarios we mentioned?

The hop penalty is not as linear as you think. The formula is:
tq * (TQ_MAX_VALUE - hop_penalty)) / (TQ_MAX_VALUE

With a hop penalty of 10 you get the following results:
tq = 255, penalty = 10, resulting tq = 245
tq = 200, penalty = 8, resulting tq = 192
tq = 150, penalty = 6, resulting tq = 144
tq = 100, penalty = 4, resulting tq = 96
tq = 50, penalty = 2, resulting tq = 48

As you can see the more the tq goes down the less influence the hop penalty 
has.

Regards,
Marek
  
Marek Lindner Jan. 28, 2012, 8:57 p.m. UTC | #8
On Saturday, January 28, 2012 03:13:34 Andrew Lunn wrote:
> When iperf was used to measure the traffic from one end of the chain
> to the other. With the default hop penalty we got poor
> performance. With the traceroute facility of batctl, we could see it
> was route flipping between 3 hops and 4 hops. When it used 3 hops, the
> packet loss was too high and we got poor bandwidth. Then it went up to
> 4 hops, the packet loss was lower, so we got more bandwidth.
> 
> This was repeatable, with each deploy we made.
> 
> Then we tried with a lower hop penalty. I think it was 5, but i don't
> remember. BATMAN then used 5 hops and there was no route flipping. We
> also got the best iperf bandwidth for end to end of the chain.

I have a hard time understanding this because the hop penalty has less 
influence on bad links. As you can see in my previous mail below a TQ of 100 
the default penalty of 10 make less then a 4 TQ points difference.

Did you try setting a higher multicast rate ? This tends to also eliminate 
flaky direct connections.


> The fact BATMAN was route flipping with a hop penalty of 10 suggests
> to me the links had similar TQ. So OGMs are getting through at the
> lowest coding rate. But data packets are having trouble, maybe because
> they are full MTU, or because the wifi driver is using the wrong
> coding rate.

Actually, in most of my setups the connection all neighboring nodes is 
perfect. Maybe that is another corner case ?  :-)


> I suspect the TQ measurements as determined by OGMs are more
> optimistic than actual data packets. Linus's played with different NDP
> packet sizes, and i think he ended up with big packets in order to
> give more realistic TQ measurements.
> 
> Unfortunately, this project is now finished. I do have access to the
> hardware, but no time allocated to play with it :-(

It was a good idea and is not forgotten. Hopefully I have the code ready by 
the time of the WBMv5. Then we can play a bit with that.


> > Nevertheless, this patch was intended to get a discussion going.
> 
> Well, i'm happy to take part in the discussion. I've no idea if our
> use case is typical, or an edge case. So comments, and results from
> other peoples networks would be useful.
> 
> If this change it to help 11n, maybe some more intelligence would be
> better, to ask the wireless stack is the interface abg or n, and from
> that determine what hop penalty should be used?

It is not directly related with 11n. The pain level grows with 11n as the gap 
between packet loss and throughput grows. This setting is more intended for 
setups in which all nodes have rather good connections to all other nodes. 
Then the direct TQs and the "hop" TQs are too similar and batman starts using 
multi-hop connections.

Regards,
Marek
  
Marek Lindner Jan. 28, 2012, 9:03 p.m. UTC | #9
Hi,

> I have experencied the same situation in some tests, and I agree with
> Andrew when he says that some form of justification is necessary.

you also have seen that a hop penalty of 10 is too high ? Can you explain your 
setup a bit more ?


> The problems of TQ emerges when the rate of devices increase, because
> especially in mixed b,g,n networks TQ does not distinguish between fast
> and slow link. We all know that brodcast losses does not say almost
> nothing about link speed or load.
> 
> The only way to improve the TQ metric is a cross-layer implementation as
> already experienced (considering only bandwidth) in my tests. Obviously
> this means breaking the "universal" compatibility with network interfaces,
> the use of mac80211 and cfg80211 in any case can limit this problem in my
> opinion.

I am certain that you great ideas and that you spend a lot of time on working 
with batman / meshing. However, it is somewhat difficult to review / discuss / 
adapt your work since we have a hard time understanding your concepts without 
proper explications / documentation. Would it possible for you to talk/write a 
bit more about your stuff ?

The WBMv5 is a good opportunity to chat because you get all of us in one 
place.  ;-)

Cheers,
Marek
  
Andrew Lunn Jan. 30, 2012, 8:06 a.m. UTC | #10
On Sat, Jan 28, 2012 at 03:12:57PM +0100, Antonio Quartulli wrote:
> Hi all,
> 
> Very nice setup Andrew :)

I cannot take much credit for it. I helped write the project proposal,
but then was not allowed to take part in the project because of other
higher priority projects. The credit goes to Hochschule Luzern, Linus
and others.

> In my honest opinion we are mixing two different issues:
> 1) current hop penalty value not really significant
> 2) OGM link quality measurements do not reflect the metric we'd like it to be

Yes, i agree. However, in the scenarios we have seen in this project,
they are related. When OGM based TQ giving us too optimistic values, a
higher hop penalty makes this even worse.

However, comments so far suggest i'm in a corner case, and that for
others, a higher hop penalty does help. So for the moment, maybe
increasing the hop penalty is the right things to do, but remember
that once we have a better TQ measurement, that the hop penalty should
be examined again.

   Andrew
  
Marek Lindner Feb. 5, 2012, 6:52 p.m. UTC | #11
On Friday, January 27, 2012 23:11:55 Marek Lindner wrote:
> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
> ---
>  soft-interface.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)

Thanks for all the feedback and comments. I applied the patch in revision 
6a12de1. Let's see how it goes.

Regards,
Marek
  

Patch

diff --git a/soft-interface.c b/soft-interface.c
index 2ffdc74..7548762 100644
--- a/soft-interface.c
+++ b/soft-interface.c
@@ -836,7 +836,7 @@  struct net_device *softif_create(const char *name)
 	atomic_set(&bat_priv->gw_sel_class, 20);
 	atomic_set(&bat_priv->gw_bandwidth, 41);
 	atomic_set(&bat_priv->orig_interval, 1000);
-	atomic_set(&bat_priv->hop_penalty, 10);
+	atomic_set(&bat_priv->hop_penalty, 30);
 	atomic_set(&bat_priv->log_level, 0);
 	atomic_set(&bat_priv->fragmentation, 1);
 	atomic_set(&bat_priv->bcast_queue_left, BCAST_QUEUE_LEN);