[maint] batman-adv: fix packet checksum in receive path

Message ID 6d1b250c047fb4cd3aac72d9e921ed33428fdaf5.1516648598.git.mschiffer@universe-factory.net (mailing list archive)
State Superseded, archived
Delegated to: Simon Wunderlich
Headers
Series [maint] batman-adv: fix packet checksum in receive path |

Commit Message

Matthias Schiffer Jan. 22, 2018, 7:24 p.m. UTC
  skb_postpull_rcsum() is necessary after eth_type_trans() to adjust the
skb checksum, otherwise log spam of the form "bat0: hw csum failure" will
result when packets with CHECKSUM_COMPLETE are received (at least in some
setups, e.g. when stacking batman-adv on top of VXLAN).

Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net>
---

I don't know what the exact circumstances are that trigger the log spam,
but it seems this was broken forever (I could also reproduce the issue with
our compat-14 legacy branch)... so please ask David to queue this up for
stable :)


 net/batman-adv/soft-interface.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)
  

Comments

Sven Eckelmann Jan. 22, 2018, 8:52 p.m. UTC | #1
On Montag, 22. Januar 2018 20:24:50 CET Matthias Schiffer wrote:
> skb_postpull_rcsum() is necessary after eth_type_trans() to adjust the
> skb checksum, otherwise log spam of the form "bat0: hw csum failure" will
> result when packets with CHECKSUM_COMPLETE are received (at least in some
> setups, e.g. when stacking batman-adv on top of VXLAN).

Would be nice to have a better explanation here.

The comment previously assumed that skb_pull_rcsum would be enough. But the 
problem here is that the skb_pull_rcsum only pulls the batman-adv headers. The 
actual pull of the ethernet header (with skb_pull_inline) happens inside 
eth_type_trans. Or did I miss anything?

[...]
> I don't know what the exact circumstances are that trigger the log spam,
> but it seems this was broken forever (I could also reproduce the issue with
> our compat-14 legacy branch)... so please ask David to queue this up for
> stable :)

Yes, this is broken since earliest commits. The most relevant commit in 
batman-adv is:

Fixes: fe28a94c01e1 ("batman-adv: receive packets directly using skbs")

But I would propose to use following in the kernel tree:

Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")

The 4.15 release will be soon(tm) and Simon is currently on vacation. So we 
will most likely postpone the submission to David until Simon found a way out 
of the snow and after 4.15 is released...

But it would be nice when some people could test the patch [1] (together with 
vxlan?) on batman-adv or batman-adv-legacy. And please provide a 
"Tested-by: Full Name <email@example.org>" [2] reply when it works.

Thanks,
	Sven

[1] https://patchwork.open-mesh.org/patch/17250/
[2] https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
  
Matthias Schiffer Jan. 22, 2018, 9:18 p.m. UTC | #2
On 01/22/2018 09:52 PM, Sven Eckelmann wrote:
> On Montag, 22. Januar 2018 20:24:50 CET Matthias Schiffer wrote:
>> skb_postpull_rcsum() is necessary after eth_type_trans() to adjust the
>> skb checksum, otherwise log spam of the form "bat0: hw csum failure" will
>> result when packets with CHECKSUM_COMPLETE are received (at least in some
>> setups, e.g. when stacking batman-adv on top of VXLAN).
> 
> Would be nice to have a better explanation here.
> 
> The comment previously assumed that skb_pull_rcsum would be enough. But the 
> problem here is that the skb_pull_rcsum only pulls the batman-adv headers. The 
> actual pull of the ethernet header (with skb_pull_inline) happens inside 
> eth_type_trans. Or did I miss anything?

This is correct, eth_type_trans() contains a simple skb_pull(), so the csum
must be adjusted afterwards (grepping the kernel for eth_type_trans will
find a lot of this). I can send a v2 with a better commit message later.

> 
> [...]
>> I don't know what the exact circumstances are that trigger the log spam,
>> but it seems this was broken forever (I could also reproduce the issue with
>> our compat-14 legacy branch)... so please ask David to queue this up for
>> stable :)
> 
> Yes, this is broken since earliest commits. The most relevant commit in 
> batman-adv is:
> 
> Fixes: fe28a94c01e1 ("batman-adv: receive packets directly using skbs")
> 
> But I would propose to use following in the kernel tree:
> 
> Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
> 
> The 4.15 release will be soon(tm) and Simon is currently on vacation. So we 
> will most likely postpone the submission to David until Simon found a way out 
> of the snow and after 4.15 is released...
> 
> But it would be nice when some people could test the patch [1] (together with 
> vxlan?) on batman-adv or batman-adv-legacy. And please provide a 
> "Tested-by: Full Name <email@example.org>" [2] reply when it works.
> 
> Thanks,> 	Sven

I've tested this on Kernel 4.14.14 (everything working correctly now) and
4.4.110 (here, there are still checksum errors; it seems on older kernels,
the checksum handling in VXLAN is broken too? Still debugging this...)

Matthias



> 
> [1] https://patchwork.open-mesh.org/patch/17250/
> [2] https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
>
  
Matthias Schiffer Jan. 23, 2018, 9:12 a.m. UTC | #3
On 01/22/2018 10:18 PM, Matthias Schiffer wrote:
> On 01/22/2018 09:52 PM, Sven Eckelmann wrote:
>> On Montag, 22. Januar 2018 20:24:50 CET Matthias Schiffer wrote:
>>> skb_postpull_rcsum() is necessary after eth_type_trans() to adjust the
>>> skb checksum, otherwise log spam of the form "bat0: hw csum failure" will
>>> result when packets with CHECKSUM_COMPLETE are received (at least in some
>>> setups, e.g. when stacking batman-adv on top of VXLAN).
>>
>> Would be nice to have a better explanation here.
>>
>> The comment previously assumed that skb_pull_rcsum would be enough. But the 
>> problem here is that the skb_pull_rcsum only pulls the batman-adv headers. The 
>> actual pull of the ethernet header (with skb_pull_inline) happens inside 
>> eth_type_trans. Or did I miss anything?
> 
> This is correct, eth_type_trans() contains a simple skb_pull(), so the csum
> must be adjusted afterwards (grepping the kernel for eth_type_trans will
> find a lot of this). I can send a v2 with a better commit message later.
> 
>>
>> [...]
>>> I don't know what the exact circumstances are that trigger the log spam,
>>> but it seems this was broken forever (I could also reproduce the issue with
>>> our compat-14 legacy branch)... so please ask David to queue this up for
>>> stable :)
>>
>> Yes, this is broken since earliest commits. The most relevant commit in 
>> batman-adv is:
>>
>> Fixes: fe28a94c01e1 ("batman-adv: receive packets directly using skbs")
>>
>> But I would propose to use following in the kernel tree:
>>
>> Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
>>
>> The 4.15 release will be soon(tm) and Simon is currently on vacation. So we 
>> will most likely postpone the submission to David until Simon found a way out 
>> of the snow and after 4.15 is released...
>>
>> But it would be nice when some people could test the patch [1] (together with 
>> vxlan?) on batman-adv or batman-adv-legacy. And please provide a 
>> "Tested-by: Full Name <email@example.org>" [2] reply when it works.
>>
>> Thanks,> 	Sven
> 
> I've tested this on Kernel 4.14.14 (everything working correctly now) and
> 4.4.110 (here, there are still checksum errors; it seems on older kernels,
> the checksum handling in VXLAN is broken too? Still debugging this...)

I've found the issue of this other checksum problem: batman-adv
fragmentation code doesn't handle the checksum on reassembly at all. I
think the best option here is to simply set ip_summed to CHECKSUM_NONE on
reassembly, I will send another patch for that.

The IP fragmentation code does more fancy things when all fragments have
CHECKSUM_COMPLETE, adding up the checksums of the fragments under certain
circumstances. This only works because IP fragments are guaranteed to be
split at even byte offsets (multiples of 8, actually); as far as I can
tell, batman-adv allows odd fragment sizes, making it impossible to add up
the 16bit checksums in the general case.

Matthias


> 
> 
> 
>>
>> [1] https://patchwork.open-mesh.org/patch/17250/
>> [2] https://www.kernel.org/doc/html/v4.12/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
>>
> 
>
  
Maximilian Wilhelm Jan. 23, 2018, 2:25 p.m. UTC | #4
Anno domini 2018 Sven Eckelmann scripsit:

> On Montag, 22. Januar 2018 20:24:50 CET Matthias Schiffer wrote:
[...]
> > I don't know what the exact circumstances are that trigger the log spam,
> > but it seems this was broken forever (I could also reproduce the issue with
> > our compat-14 legacy branch)... so please ask David to queue this up for
> > stable :)
> 
> Yes, this is broken since earliest commits. The most relevant commit in 
> batman-adv is:
> 
> Fixes: fe28a94c01e1 ("batman-adv: receive packets directly using skbs")
> 
> But I would propose to use following in the kernel tree:
> 
> Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
> 
> The 4.15 release will be soon(tm) and Simon is currently on vacation. So we 
> will most likely postpone the submission to David until Simon found a way out 
> of the snow and after 4.15 is released...
> 
> But it would be nice when some people could test the patch [1] (together with 
> vxlan?) on batman-adv or batman-adv-legacy. And please provide a 
> "Tested-by: Full Name <email@example.org>" [2] reply when it works.

I took a Debian Kernel package (4.14.13-1~bpo9+1), applied the patch
and deployed the package on a gateway running BATMAN over VTEPs. The
log messages from previous kernels don't show up anymore <3.

Tested-by: Maximilian Wilhelm <max@sdn.clinic>

Thanks!

Best
Max
  
Maximilian Wilhelm Jan. 23, 2018, 9:56 p.m. UTC | #5
Anno domini 2018 Matthias Schiffer scripsit:

Hi,

> On 01/22/2018 10:18 PM, Matthias Schiffer wrote:
> > On 01/22/2018 09:52 PM, Sven Eckelmann wrote:
> >> On Montag, 22. Januar 2018 20:24:50 CET Matthias Schiffer wrote:
> >>> skb_postpull_rcsum() is necessary after eth_type_trans() to adjust the
> >>> skb checksum, otherwise log spam of the form "bat0: hw csum failure" will
> >>> result when packets with CHECKSUM_COMPLETE are received (at least in some
> >>> setups, e.g. when stacking batman-adv on top of VXLAN).
> >>
> >> Would be nice to have a better explanation here.
> >>
> >> The comment previously assumed that skb_pull_rcsum would be enough. But the 
> >> problem here is that the skb_pull_rcsum only pulls the batman-adv headers. The 
> >> actual pull of the ethernet header (with skb_pull_inline) happens inside 
> >> eth_type_trans. Or did I miss anything?
> > 
> > This is correct, eth_type_trans() contains a simple skb_pull(), so the csum
> > must be adjusted afterwards (grepping the kernel for eth_type_trans will
> > find a lot of this). I can send a v2 with a better commit message later.
> > 
> >>
> >> [...]
> >>> I don't know what the exact circumstances are that trigger the log spam,
> >>> but it seems this was broken forever (I could also reproduce the issue with
> >>> our compat-14 legacy branch)... so please ask David to queue this up for
> >>> stable :)
> >>
> >> Yes, this is broken since earliest commits. The most relevant commit in 
> >> batman-adv is:
> >>
> >> Fixes: fe28a94c01e1 ("batman-adv: receive packets directly using skbs")
> >>
> >> But I would propose to use following in the kernel tree:
> >>
> >> Fixes: c6c8fea29769 ("net: Add batman-adv meshing protocol")
> >>
> >> The 4.15 release will be soon(tm) and Simon is currently on vacation. So we 
> >> will most likely postpone the submission to David until Simon found a way out 
> >> of the snow and after 4.15 is released...
> >>
> >> But it would be nice when some people could test the patch [1] (together with 
> >> vxlan?) on batman-adv or batman-adv-legacy. And please provide a 
> >> "Tested-by: Full Name <email@example.org>" [2] reply when it works.
> >>
> >> Thanks,> 	Sven
> > 
> > I've tested this on Kernel 4.14.14 (everything working correctly now) and
> > 4.4.110 (here, there are still checksum errors; it seems on older kernels,
> > the checksum handling in VXLAN is broken too? Still debugging this...)
> 
> I've found the issue of this other checksum problem: batman-adv
> fragmentation code doesn't handle the checksum on reassembly at all. I
> think the best option here is to simply set ip_summed to CHECKSUM_NONE on
> reassembly, I will send another patch for that.
> 
> The IP fragmentation code does more fancy things when all fragments have
> CHECKSUM_COMPLETE, adding up the checksums of the fragments under certain
> circumstances. This only works because IP fragments are guaranteed to be
> split at even byte offsets (multiples of 8, actually); as far as I can
> tell, batman-adv allows odd fragment sizes, making it impossible to add up
> the 16bit checksums in the general case.

And

  Tested-By: Maximilian Wilhelm <max@sdn.clinic>

to the fix for fragmentation.c, too.

Disclaimer: As MTUs are calculated accordingly in our backbone
fragmentation of VXLAN packets isn't an issue and we did not see these
messages before. I can confirm, that I still don't see any now,
meaning the log spam from the previous fix is still fixed and no new
issues have arisen as of now.

Thanks a lot! <3

Best
Max
  

Patch

diff --git a/net/batman-adv/soft-interface.c b/net/batman-adv/soft-interface.c
index c95e2b26..edeffcb9 100644
--- a/net/batman-adv/soft-interface.c
+++ b/net/batman-adv/soft-interface.c
@@ -459,13 +459,7 @@  void batadv_interface_rx(struct net_device *soft_iface,
 
 	/* skb->dev & skb->pkt_type are set here */
 	skb->protocol = eth_type_trans(skb, soft_iface);
-
-	/* should not be necessary anymore as we use skb_pull_rcsum()
-	 * TODO: please verify this and remove this TODO
-	 * -- Dec 21st 2009, Simon Wunderlich
-	 */
-
-	/* skb->ip_summed = CHECKSUM_UNNECESSARY; */
+	skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
 
 	batadv_inc_counter(bat_priv, BATADV_CNT_RX);
 	batadv_add_counter(bat_priv, BATADV_CNT_RX_BYTES,