Patch which fixes a uint32overflow ; fix algorithm which was unintented changed on 2014 update

Message ID 20150516221537.2a6fad9a@i3.local
State Superseded, archived
Headers

Commit Message

Ruben Wisniewski May 16, 2015, 8:15 p.m. UTC
  Changelog

* fix overflow of uint32-value while multiplying
* restore algorithm to origin calculation which expect kbit/s values

Signed-off-by: Ruben Wisniewsi <ruben@vfn-nrw.de>
---
 net/batman-adv/gateway_client.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
  

Comments

Sven Eckelmann May 16, 2015, 10:11 p.m. UTC | #1
On Saturday 16 May 2015 22:15:37 Ruben Wisniewski wrote:
> Changelog
> 
> * fix overflow of uint32-value while multiplying
> * restore algorithm to origin calculation which expect kbit/s values
> 
> Signed-off-by: Ruben Wisniewsi <ruben@vfn-nrw.de>
> ---

Please read
 http://www.open-mesh.org/projects/open-mesh/wiki/Contribute
before submitting patches.

And please also include the commit which caused the regression. This makes
it easier to find out which kernel versions must be patched in case the
problem is critical. Here is an example how this information should be
included:

 http://git.open-mesh.org/batman-adv.git/commit/210a3cb5a5e09f0067386fd07b3a7a843fcf66c5

Kind regards,
	Sven
  
Sven Eckelmann May 17, 2015, 8:26 a.m. UTC | #2
On Sunday 17 May 2015 00:11:52 Sven Eckelmann wrote: 
> And please also include the commit which caused the regression. This makes
> it easier to find out which kernel versions must be patched in case the
> problem is critical. Here is an example how this information should be
> included:

You are most likely want to fix the regression introduced in 0853ec7 ("batman-
adv: tvlv - gateway download/upload bandwidth container"). The gateway client 
code for fast node selection was previously using batadv_gw_bandwidth_to_kbit 
to convert the input parameters to Kibit/s. After the TLV restructuring it 
should have multiplied the TLV value by 100 to get the exact same result 
because the batadv_tvlv_gateway_data bandwidth_down/bandwidth_up stores the 
throughput with the base unit of 100 Kibit/s.

The problem here would be to decide if it changes the result of the algorithm. 
The code goes through all gateways. For each gateway it calculates the formula 
(gw_down is in 100 Kibit/s; gw_divisor is constant	):

    gw_factor = |_ (tq_avg ** 2 * gw_down * 10 ** 4) / gw_divisor _|

You want to change it to

    gw_factor = |_ (tq_avg ** 2 * gw_down * 10 ** 6) / gw_divisor _|

The rest of the algorithm is just comparing the gw_factor of each gateway. The 
question which should have been answered by the commit message would be:

   Why is this constant of 100 affecting your results? Is it rounding related?
   How much does it affect the algorithm?

tq_avg is a value <= 255. A realistic value for bandwidth_down is maybe 
something like 20. gw_divisor is 2 ** 18.

So each difference in "tq_avg ** 2 * gw_down * 10 ** 4" for each gateway must 
be larger than 2 ** 18 to always be distinguishable by the algorithm. For a 
fixed tq_avg value of 255 the algorithm could currently compare two gateways 
with a gw_down difference of (1. / 2480) and still distinguish them. It would 
still detect a difference when the tq_avg is 6 and the gw_down difference is 
1.

For a fixed gw_down value of 1, the current algorithm would have problems to 
distinguish the tq_avg values < 14 when the tq_avg difference is small.

Your change would make it 100x more precise for gw_down. But I am currently 
not seeing the actual impact. Maybe you could explain it in more detail in the 
commit message.

Kind regards,
	Sven
  
Sven Eckelmann May 17, 2015, 9:29 a.m. UTC | #3
Please don't drop the mailing list when replying (unless it is actually 
private conversation).

On Sunday 17 May 2015 11:21:10 Ruben Wisniewski wrote:
> > should be included:
> Since the overflow-bug seem to be is since epoch, the patch have to be
> applied to all versions.

I am talking about the part which you described yourself as regression. But I 
agree that the overflow bug is in there since the beginning of the gateway 
code.

> But I think batman-adv 2013.x is that old that
> nobody is using it anymore or kernels with that version. So we just
> have to fix all 2014.x versions.

There is a stable branch for Linux 2.6.32.x. Linux 2.6.32 came out 2009. So it 
is possible that David Miller decides to submit the overflow patch to stable 
for 2.6.32. This is not in our jurisdiction. The same may be the case for the 
regression - but then somebody (David Miller/stable team) must know how severe 
the problem is and when it was introduced.

Kind regards,
	Sven
  

Patch

diff --git a/net/batman-adv/gateway_client.c
b/net/batman-adv/gateway_client.c index bb01586..6f00584 100644
--- a/net/batman-adv/gateway_client.c
+++ b/net/batman-adv/gateway_client.c
@@ -153,7 +153,7 @@  batadv_gw_get_best_gw_node(struct batadv_priv
*bat_priv) struct batadv_neigh_node *router;
 	struct batadv_neigh_ifinfo *router_ifinfo;
 	struct batadv_gw_node *gw_node, *curr_gw = NULL;
-	uint32_t max_gw_factor = 0, tmp_gw_factor = 0;
+	uint64_t max_gw_factor = 0, tmp_gw_factor = 0;
 	uint32_t gw_divisor;
 	uint8_t max_tq = 0;
 	uint8_t tq_avg;
@@ -185,7 +185,7 @@  batadv_gw_get_best_gw_node(struct batadv_priv
*bat_priv) switch (atomic_read(&bat_priv->gw_sel_class)) {
 		case 1: /* fast connection */
 			tmp_gw_factor = tq_avg * tq_avg;
-			tmp_gw_factor *= gw_node->bandwidth_down;
+			tmp_gw_factor *= gw_node->bandwidth_down * 100;
 			tmp_gw_factor *= 100 * 100;
 			tmp_gw_factor /= gw_divisor;