batman-adv: Use safer default config for optional features

Message ID 1423153373-17033-1-git-send-email-sven@narfation.org (mailing list archive)
State Superseded, archived
Headers

Commit Message

Sven Eckelmann Feb. 5, 2015, 4:22 p.m. UTC
  The current default settings for optional features in batman-adv seems to be
based around the idea that the user only compiles what he requires. They will
automatically enabled when they are compiled in. For example the network coding
part of batman-adv is by default disabled in the out-of-tree module but will be
enabled when the code is compiled during the module build.

But distributions like Debian just enable all features of the batman-adv kernel
module and hope that more experimental features or features with possible
negative effects have to be enabled using some runtime configuration interface.

The network_coding feature can help in specific setups but also has drawbacks
and is not disabled by default in the out-of-tree module. Disabling by default
in the runtime config seems to be also quite sane.

The distributed_arp_table is in theory a good solution to reduce connection
problems in large networks caused by ARP packet loss. Unfortunatelly, it seems
to also break ARP resolution in simple mesh setups. The only solution which
seems to be used by AP firmwares seems to be the deactivation of this feature.
Disabling this feature by default until the problem was understood and fixed
may help new deployments to create a working mesh. Tuning of the mesh can still
be done by them in case DAT works in their setup.

The bridge_loop_avoidance is the only feature which is disabled by default but
may be necessary even in simple setups. Packet loops may even be created
during the initial node setup when this is not enabled. This is different than
STP on bridges because mesh is usually used on Adhoc WiFi. Having two nodes
(by accident) in the same LAN segment and in the same mesh network is rather
common in this situation.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
--
This patchset is based on my comments in
 https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2014-November/012543.html

The patch doesn't touch the multicast feature because I have no idea how well
it works and what the current multicast mode is actual doing.
---
 network-coding.c | 2 +-
 soft-interface.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
  

Comments

Antonio Quartulli Feb. 6, 2015, 7:26 a.m. UTC | #1
Hi Sven,
thanks for this patch!

On 05/02/15 17:22, Sven Eckelmann wrote:
> The distributed_arp_table is in theory a good solution to reduce connection
> problems in large networks caused by ARP packet loss. Unfortunatelly, it seems
> to also break ARP resolution in simple mesh setups. The only solution which
> seems to be used by AP firmwares seems to be the deactivation of this feature.

Could you please be a bit more specific about the issue you are talking
about?

So far the only situation where I personally switched DAT off did not
really get fixed and continued to show the issue. Therefore I am not
really sure that DAT can be blamed for this.

So far I don't know anybody else who complained about DAT.

Cheers,
  
Sven Eckelmann Feb. 6, 2015, 12:38 p.m. UTC | #2
On Friday 06 February 2015 08:26:04 Antonio Quartulli wrote:
> Could you please be a bit more specific about the issue you are talking
> about?

I can only talk about the simple two node installation I had when I was forced 
to write the wireshark-batman-adv dissector for v15. I just placed them next 
to each other and tried to ping the other one (not sure anymore if it was the 
node directly or two other devices plugged in the ethernet port of the 
devices). And I think there was a reboot of one node and a ARP cache flush 
involved. I've expected to see the 4addr packets when capturing on the 
wireless and then having both nodes talking to each other. This was not 
working. I could only communicate after disabling DAT... but of course the 
communication which I wanted to capture wasn't happening.

Either I got a dump from you or I've tried again until it worked.
 
> So far the only situation where I personally switched DAT off did not
> really get fixed and continued to show the issue. Therefore I am not
> really sure that DAT can be blamed for this.
> 
> So far I don't know anybody else who complained about DAT.

You have to ask Marek why DAT is disabled in many installations for which he 
is more or less responsible. Maybe he has more to contribute than me.

Kind regards,
	Sven
  
Antonio Quartulli Feb. 6, 2015, 1:20 p.m. UTC | #3
On 06/02/15 13:38, Sven Eckelmann wrote:
> On Friday 06 February 2015 08:26:04 Antonio Quartulli wrote:
>> Could you please be a bit more specific about the issue you are talking
>> about?
> 
> I can only talk about the simple two node installation I had when I was forced 
> to write the wireshark-batman-adv dissector for v15. I just placed them next 
> to each other and tried to ping the other one (not sure anymore if it was the 
> node directly or two other devices plugged in the ethernet port of the 
> devices). And I think there was a reboot of one node and a ARP cache flush 
> involved. I've expected to see the 4addr packets when capturing on the 
> wireless and then having both nodes talking to each other. This was not 
> working. I could only communicate after disabling DAT... but of course the 
> communication which I wanted to capture wasn't happening.

Thanks for the explanation Sven.

To me this looks like the caching effect of DAT: bat0 is likely to
change MAC address everytime you recreate the interface (unless you
statically assign the MAC) therefore I presume that the other node (the
one which did not reboot) was still caching the old bat0 MAC address
(each entry requires some minutes before being invalidated/refreshed).

This means that any communication willing to contact the rebooted node
was targetting the old address and therefore the two were not be able
talk until the cache was refreshed.

Can this be the case?



Cheers,
  
Sven Eckelmann Feb. 7, 2015, 7:27 a.m. UTC | #4
On Friday 06 February 2015 14:20:28 Antonio Quartulli wrote:
> To me this looks like the caching effect of DAT: bat0 is likely to
> change MAC address everytime you recreate the interface (unless you
> statically assign the MAC) therefore I presume that the other node (the
> one which did not reboot) was still caching the old bat0 MAC address
> (each entry requires some minutes before being invalidated/refreshed).

Thanks for the explanation.

So you are basically saying that DAT cannot detect when some nodes found a 
conflict in their ARP table (when receiving IP packets with a different MAC 
for an IP or similar things)? And there is also the problem when no local 
information is stored and only a conflict with some data data on another 
unrelated node is happened.

The first conflict (the conflict in the local ARP table) is not detected 
because David wanted that batman-adv isn't accessing the ARP table?

> This means that any communication willing to contact the rebooted node
> was targetting the old address and therefore the two were not be able
> talk until the cache was refreshed.
> 
> Can this be the case?

Yes, unfortunately I hadn't the time to analyze it further and have no logs of 
any similar problem. Thats why I cannot check if this is contradicted by 
anything I've done (or not done). So it is a very plausible scenario which 
you've described.

Maybe Marek can tell me more why he chose to disable it.

Kind regards,
	Sven
  
Antonio Quartulli Feb. 7, 2015, 8:58 a.m. UTC | #5
On 07/02/15 08:27, Sven Eckelmann wrote:
> On Friday 06 February 2015 14:20:28 Antonio Quartulli wrote:
>> To me this looks like the caching effect of DAT: bat0 is likely to
>> change MAC address everytime you recreate the interface (unless you
>> statically assign the MAC) therefore I presume that the other node (the
>> one which did not reboot) was still caching the old bat0 MAC address
>> (each entry requires some minutes before being invalidated/refreshed).
> 
> Thanks for the explanation.
> 
> So you are basically saying that DAT cannot detect when some nodes found a 
> conflict in their ARP table (when receiving IP packets with a different MAC 
> for an IP or similar things)? 

Right.
Well, the scenario "incoming packets altering the local ARP table and
generating a conflict" was never really considered.

> And there is also the problem when no local 
> information is stored and only a conflict with some data data on another 
> unrelated node is happened.

I haven't really understood this issue.

> 
> The first conflict (the conflict in the local ARP table) is not detected 
> because David wanted that batman-adv isn't accessing the ARP table?

Even without reading the the ARP table we could "detect the conflict" by
checking incoming packets and by matching their IP/MAC couple with what
we have in DAT...At the moment we only assume that the IP/MAC couple is
stable enough and that in the worst case the user will wait for the
cache to get invalidated.


Still, I think this is an optimization that we may want to implement,
but not a real issue that pushes us to disable DAT by default.

> 
>> This means that any communication willing to contact the rebooted node
>> was targetting the old address and therefore the two were not be able
>> talk until the cache was refreshed.
>>
>> Can this be the case?
> 
> Yes, unfortunately I hadn't the time to analyze it further and have no logs of 
> any similar problem. Thats why I cannot check if this is contradicted by 
> anything I've done (or not done). So it is a very plausible scenario which 
> you've described.

No problem, I just wanted to get your feeling/feedback about that :)


Thanks a lot.

Cheers,
  
Marek Lindner Feb. 18, 2015, 7:33 a.m. UTC | #6
On Thursday, February 05, 2015 17:22:53 Sven Eckelmann wrote:
> The current default settings for optional features in batman-adv seems to be
> based around the idea that the user only compiles what he requires. They
> will automatically enabled when they are compiled in. For example the
> network coding part of batman-adv is by default disabled in the out-of-tree
> module but will be enabled when the code is compiled during the module
> build.
> 
> But distributions like Debian just enable all features of the batman-adv
> kernel module and hope that more experimental features or features with
> possible negative effects have to be enabled using some runtime
> configuration interface.

Interesting point. Based on what you are saying we definitely should review our 
policy and agree on sane defaults.


> The network_coding feature can help in specific setups but also has
> drawbacks and is not disabled by default in the out-of-tree module.
> Disabling by default in the runtime config seems to be also quite sane.

This feature requires the wifi driver to support promisc mode. We should keep 
it disabled.


> The distributed_arp_table is in theory a good solution to reduce connection
> problems in large networks caused by ARP packet loss. Unfortunatelly, it
> seems to also break ARP resolution in simple mesh setups. The only solution
> which seems to be used by AP firmwares seems to be the deactivation of this
> feature. Disabling this feature by default until the problem was understood
> and fixed may help new deployments to create a working mesh. Tuning of the
> mesh can still be done by them in case DAT works in their setup.

I vote for keeping DAT enabled. 


> The bridge_loop_avoidance is the only feature which is disabled by default
> but may be necessary even in simple setups. Packet loops may even be
> created during the initial node setup when this is not enabled. This is
> different than STP on bridges because mesh is usually used on Adhoc WiFi.
> Having two nodes (by accident) in the same LAN segment and in the same mesh
> network is rather common in this situation.

Agreed.


Cheers,
Marek
  
Martin Hundebøll Feb. 18, 2015, 5:18 p.m. UTC | #7
Hi Sven, Marek,

On 2015-02-18 08:33, Marek Lindner wrote:
> On Thursday, February 05, 2015 17:22:53 Sven Eckelmann wrote:
>> The current default settings for optional features in batman-adv seems to be
>> based around the idea that the user only compiles what he requires. They
>> will automatically enabled when they are compiled in. For example the
>> network coding part of batman-adv is by default disabled in the out-of-tree
>> module but will be enabled when the code is compiled during the module
>> build.
>>
>> But distributions like Debian just enable all features of the batman-adv
>> kernel module and hope that more experimental features or features with
>> possible negative effects have to be enabled using some runtime
>> configuration interface.
>
> Interesting point. Based on what you are saying we definitely should review our
> policy and agree on sane defaults.

Sane defaults have always been one of the great values of batman-adv...

>> The network_coding feature can help in specific setups but also has
>> drawbacks and is not disabled by default in the out-of-tree module.
>> Disabling by default in the runtime config seems to be also quite sane.
>
> This feature requires the wifi driver to support promisc mode. We should keep
> it disabled.

I think the problem Sven mentions, is that some distro's build their 
kernels with all_yes, and so network coding is always built in. Since NC 
is enabled by default (runtime) when compiled in, it may run even when 
promisc mode is not enabled/supported.

So I suggest NC being disabled runtime by default.

>> The distributed_arp_table is in theory a good solution to reduce connection
>> problems in large networks caused by ARP packet loss. Unfortunatelly, it
>> seems to also break ARP resolution in simple mesh setups. The only solution
>> which seems to be used by AP firmwares seems to be the deactivation of this
>> feature. Disabling this feature by default until the problem was understood
>> and fixed may help new deployments to create a working mesh. Tuning of the
>> mesh can still be done by them in case DAT works in their setup.
>
> I vote for keeping DAT enabled.

++

>> The bridge_loop_avoidance is the only feature which is disabled by default
>> but may be necessary even in simple setups. Packet loops may even be
>> created during the initial node setup when this is not enabled. This is
>> different than STP on bridges because mesh is usually used on Adhoc WiFi.
>> Having two nodes (by accident) in the same LAN segment and in the same mesh
>> network is rather common in this situation.
>
> Agreed.

++
  

Patch

diff --git a/network-coding.c b/network-coding.c
index 127cc4d..be005b2 100644
--- a/network-coding.c
+++ b/network-coding.c
@@ -155,7 +155,7 @@  err:
  */
 void batadv_nc_init_bat_priv(struct batadv_priv *bat_priv)
 {
-	atomic_set(&bat_priv->network_coding, 1);
+	atomic_set(&bat_priv->network_coding, 0);
 	bat_priv->nc.min_tq = 200;
 	bat_priv->nc.max_fwd_delay = 10;
 	bat_priv->nc.max_buffer_time = 200;
diff --git a/soft-interface.c b/soft-interface.c
index 8748987..ae96cee 100644
--- a/soft-interface.c
+++ b/soft-interface.c
@@ -738,10 +738,10 @@  static int batadv_softif_init_late(struct net_device *dev)
 	atomic_set(&bat_priv->aggregated_ogms, 1);
 	atomic_set(&bat_priv->bonding, 0);
 #ifdef CONFIG_BATMAN_ADV_BLA
-	atomic_set(&bat_priv->bridge_loop_avoidance, 0);
+	atomic_set(&bat_priv->bridge_loop_avoidance, 1);
 #endif
 #ifdef CONFIG_BATMAN_ADV_DAT
-	atomic_set(&bat_priv->distributed_arp_table, 1);
+	atomic_set(&bat_priv->distributed_arp_table, 0);
 #endif
 #ifdef CONFIG_BATMAN_ADV_MCAST
 	bat_priv->mcast.flags = BATADV_NO_FLAGS;