[RFC] batman-adv: mitigate issue when empty vlan is received

Message ID 20180506195559.32602-1-me@irrelefant.net (mailing list archive)
State RFC
Delegated to: Antonio Quartulli
Headers
Series [RFC] batman-adv: mitigate issue when empty vlan is received |

Commit Message

Leonardo Mörlein May 6, 2018, 7:55 p.m. UTC
  This patch is not a fix for the issue itself, but a fix for the
other nodes, which are also influenced by the issue.

- Some (affected) nodes send TT announcements for an empty vlan (for
  now only seen with vlan_id = 0).
- This behaviour is a bug! Batman-adv nodes must not send TT
  announcements for empty vlans.
- The receiving batman-adv can not handle incoming TT announcements
  with empty vlans. (The crc check routine batadv_tt_global_check_crc()
  fails.)
- As the receiving node thinks, the crc is broken, it does a full
  table request then.
- The announcing node sends a TT full table to the receiving node
  then, which also contains the empty vlan, so
  *batadv_tt_global_check_crc()* fails again on the receiver side.
- This causes a lot of unneeded TT traffic. In Freifunk Hannover we
  decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes.
  We have ~800 nodes, which are connected via vpn to one of six
  supernodes. Those supernodes are connected with each other in a fully
  connected mesh.

Signed-off-by: Leonardo Mörlein <me@irrelefant.net>
---
 net/batman-adv/translation-table.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)
  

Comments

Sven Eckelmann May 7, 2018, 6:32 a.m. UTC | #1
On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote:
> This patch is not a fix for the issue itself, but a fix for the
> other nodes, which are also influenced by the issue.
> 
> - Some (affected) nodes send TT announcements for an empty vlan (for
>   now only seen with vlan_id = 0).
> - This behaviour is a bug! Batman-adv nodes must not send TT
>   announcements for empty vlans.
> - The receiving batman-adv can not handle incoming TT announcements
>   with empty vlans. (The crc check routine batadv_tt_global_check_crc()
>   fails.)
> - As the receiving node thinks, the crc is broken, it does a full
>   table request then.
> - The announcing node sends a TT full table to the receiving node
>   then, which also contains the empty vlan, so
>   *batadv_tt_global_check_crc()* fails again on the receiver side.
> - This causes a lot of unneeded TT traffic. In Freifunk Hannover we
>   decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes.
>   We have ~800 nodes, which are connected via vpn to one of six
>   supernodes. Those supernodes are connected with each other in a fully
>   connected mesh.

Can you attach a (small) pcap of actual traffic which show a packet which 
triggers this behavior?

Thanks,
	Sven
  
Leonardo Mörlein May 7, 2018, 8:06 a.m. UTC | #2
On 05/07/2018 08:32 AM, Sven Eckelmann wrote:
> Can you attach a (small) pcap of actual traffic which show a packet which 
> triggers this behavior?

For sure. See for the originator 5e:25:f7:13:2d:eb and vid=0x8000 in the
attached pcap. I selected the interested packets (one ogm, one full
table request and one full table response).

Kind regards,
Leonardo Mörlein
  
Sven Eckelmann May 7, 2018, 10:23 a.m. UTC | #3
On Montag, 7. Mai 2018 10:06:20 CEST me wrote:
> On 05/07/2018 08:32 AM, Sven Eckelmann wrote:
> > Can you attach a (small) pcap of actual traffic which show a packet which 
> > triggers this behavior?
> 
> For sure. See for the originator 5e:25:f7:13:2d:eb and vid=0x8000 in the
> attached pcap. I selected the interested packets (one ogm, one full
> table request and one full table response).

Thanks,

These are always from intermediate nodes. I was hoping to see the initial 
packet which "introduced" the incorrect vid=0x8000 without entry in the first 
place to better understand the problem which causes this.

At least I would have hoped that this might help the people at WBM to have a 
better discussion about it. At least to understand whether this is from an 
intermediate node or from the sender.

On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote:
[...]
> - The receiving batman-adv can not handle incoming TT announcements
>   with empty vlans. (The crc check routine batadv_tt_global_check_crc()
>   fails.)

This part is not 100% clear in the commit message.

Right now it just sounds like softif_vlan_list (see 
batadv_tt_prepare_tvlv_local_data) has an "empty" vlan entry (an entry without 
mac addresses). Which brings me to the point that batadv_softif_create_vlan 
initializes new vlan's with the CRC 0x0 before it adds it to the list (see tt/
batadv_vlan_tt in batadv_softif_vlan). We know now that the function 
batadv_tt_local_update_crc is responsible for updating the crc (holding the 
tt.commit_lock) on the sender site (for non-intermediate nodes). And this crc 
is calculated over the bat_priv->tt.local_hash - which (according to the crc 
0x0) doesn't have a single (non-NEW) entry for this VID. The return (new crc) 
is therefore 0x0 and it will be submitted this way.

The same should be true for the receiver. The batadv_tt_global_update_crc goes 
over the list of entries for this vid. And we just received a full table which 
only contains non-vlan entries and no vlan 0 entries. My assumption would 
therefore be that the crc on the receiver should also be 0x0 when 
batadv_tt_global_update_crc was done. But if it is not then it would mean that 
there was still some entry on the receiver which batadv_tt_global_crc used to 
calculate the crc for this vid or the crc was just not calculated for it 
(because it doesn't exist).

The latter would match the content of the full table reply in your pcap, the 
comment in the patch and the info info I got from Antonio: The vid entry will 
only be created in the originator vlan_list when an entry was received (I 
haven't checked this). This would sound to me like the vlan should just be 
created instead of adding this workaround. But maybe you should discuss the 
details here with Antonio because at the end it is just about the different 
ways of handling "!vlan" case from your patch.



Btw. Antonio, wasn't the VLAN 0 always created on each device in the local 
table by the VLAN driver? See the message

   [   29.866038] 8021q: adding VLAN 0 to HW filter on device bat0

when batman-adv attaches itself to the vlan monitoring. But usually, there is 
at least one entry in it (the bat0 mac address). Not sure why this isn't the 
case here. At least batadv_interface_add_vid should add it at the end. Maybe 
there was an error before the entry was created? Would be interesting to know 
where exactly.

Kind regards,
	Sven
  
Sven Eckelmann May 8, 2018, 8:45 a.m. UTC | #4
n Montag, 7. Mai 2018 12:23:57 CEST Sven Eckelmann wrote:
> This would sound to me like the vlan should just be 
> created instead of adding this workaround. But maybe you should discuss the 
> details here with Antonio because at the end it is just about the different 
> ways of handling "!vlan" case from your patch.

Just had a look at the cleanup code for vlans and it seems like each vlan
in vlan_list must contain at least one entry (according to
vlan->tt.num_entries) and otherwise batadv_tt_global_size_mod will drop it.
So it isn't as easy as just adding a simple vlan for each received vlan
entry via TT response.

So your patch (or a variant of it) will most likely be the way to go for
receivers.


Now to your commit message:

On Sonntag, 6. Mai 2018 21:55:59 CEST Leonardo Mörlein wrote:

> batman-adv: mitigate issue when empty vlan is received

Something more specific would be nice. Maybe somebody else has a better
idea but what about:

batman-adv: Mitigate TT loop when receiving empty VLAN


> This patch is not a fix for the issue itself, but a fix for the
> other nodes, which are also influenced by the issue.

I personally don't like to read "this patch" in commit messages. This part
should maybe be reformulated and put at the end (before the Fixes line).
Something like:

While the problem is caused by the sender, the receiver can mitigate the
problem by ignoring (potentially) empty VLANs during the CRC check.

[...]
> - This behaviour is a bug! Batman-adv nodes must not send TT
>   announcements for empty vlans.

Where is this defined as incorrect message? But yes, the current
implementation cannot handle it.

> - The receiving batman-adv can not handle incoming TT announcements
>   with empty vlans. (The crc check routine batadv_tt_global_check_crc()
>   fails.)

This is to vague as discussed earlier. Maybe something more like:

- The receiving batman-adv can not handle incoming TT announcements with
  empty vlans. No new struct batadv_orig_node_vlan entry will be added
  to batadv_orig_node->vlan_list. This will cause the crc check routine
  batadv_tt_global_check_crc() to fail because it cannot find the local
  representation of this empty vlan.

[...]
> - This causes a lot of unneeded TT traffic. In Freifunk Hannover we
>   decreased the amount of from ~20kpp/s to ~4kpp/s on our supernodes.
>   We have ~800 nodes, which are connected via vpn to one of six
>   supernodes. Those supernodes are connected with each other in a fully
>   connected mesh.

This part doesn't seem to fit in the list and might just be a new paragraph
at the end to summarize the impact of the problem.


A "Fixes: " line is missing here. Please check whether this sounds about
right:

Fixes: 7ea7b4a14275 ("batman-adv: make the TT CRC logic VLAN specific")


Kind regards,
	Sven
  
Marek Lindner May 9, 2018, 2:08 p.m. UTC | #5
On Monday, May 7, 2018 3:55:59 AM HKT Leonardo Mörlein wrote:
> -               if (!vlan)
> -                       return false;
> +               if (!vlan) {
> +                       /* Due to a bug, some originators send TT
> +                        * announcements for empty vlans. As the receiving
> +                        * nodes will ignore those empty vlans (they do not
> +                        * add a batadv_orig_node_vlan into the transglobal
> +                        * for the originating node), crc check will fail
> +                        * here. To circumvent this issue, we skip the
> +                        * verification for the vlan if the crc is
> +                        * equal to 0x00000000.
> +                        */
> +                       if (tt_vlan_tmp->crc == 0x00000000)
> +                               continue;
> +                       else
> +                               return false;
> +               }

There are some issues with this approach:

 * As you might be aware, a CRC of 0x00000000 is not an invalid or 
uninitialized CRC. That means it is conceivable the CRC over the translation 
table entries and flags results in a CRC of 0x00000000 without any bugs being 
present. Applying this patch would lead to introducing a bug in these 
conditions.

 * The second concern is about how to deal with 'the bug' (referring to your 
patch comment above).  In your case 'the bug' results in missing TT entries 
for a given VLAN plus a CRC of 0x00000000. What if the CRC was 0x00000001 due 
to some other bug ? Then you're check would totally miss the mark and we're 
back at the beginning.
If we wish to deal with such misbehavior on a scale where we assume the code 
should be able to deal with garbage values in some or all fields a different 
approach is needed. Relying on a magic number (CRC 0x00000000) is as bad as 
the current behavior (assuming each VLAN has an entry).

Shortly, I will send a patch to prevent sending an empty VLAN around. Please 
note that this patch still is more of a band-aid. We still don't know how the 
empty VLAN came into being in the first place. Plus, that patch won't protect 
against garbage packets which certainly requires more effort.

Cheers,
Marek
  

Patch

diff --git a/net/batman-adv/translation-table.c b/net/batman-adv/translation-table.c
index 0225616d..d2a843fd 100644
--- a/net/batman-adv/translation-table.c
+++ b/net/batman-adv/translation-table.c
@@ -2996,8 +2996,21 @@  static bool batadv_tt_global_check_crc(struct batadv_orig_node *orig_node,
 
 		vlan = batadv_orig_node_vlan_get(orig_node,
 						 ntohs(tt_vlan_tmp->vid));
-		if (!vlan)
-			return false;
+		if (!vlan) {
+			/* Due to a bug, some originators send TT
+			 * announcements for empty vlans. As the receiving
+			 * nodes will ignore those empty vlans (they do not
+			 * add a batadv_orig_node_vlan into the transglobal
+			 * for the originating node), crc check will fail
+			 * here. To circumvent this issue, we skip the
+			 * verification for the vlan if the crc is
+			 * equal to 0x00000000.
+			 */
+			if (tt_vlan_tmp->crc == 0x00000000)
+				continue;
+			else
+				return false;
+		}
 
 		crc = vlan->tt.crc;
 		batadv_orig_node_vlan_put(vlan);