[maint] batman-adv: fix soft-interface MTU computation

Message ID 1390299725-1873-1-git-send-email-antonio@meshcoding.com (mailing list archive)
State Accepted, archived
Commit 2b108ccd0533e1375e44c73ec58c69dde9a71687
Headers

Commit Message

Antonio Quartulli Jan. 21, 2014, 10:22 a.m. UTC
  The current MTU computation always returns a value
smaller than 1500bytes even if the real interfaces
have an MTU large enough to compensate the batman-adv
overhead.

Fix the computation by properly returning the highest
admitted value.

Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
---

This patch is missing a Reported-by clause because I did not have
"russell"'s email address at hand.

Will be added later before being merged.

Cheers,


 hard-interface.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)
  

Comments

Antonio Quartulli Jan. 21, 2014, 10:31 a.m. UTC | #1
On 21/01/14 11:22, Antonio Quartulli wrote:
> The current MTU computation always returns a value
> smaller than 1500bytes even if the real interfaces
> have an MTU large enough to compensate the batman-adv
> overhead.
> 
> Fix the computation by properly returning the highest
> admitted value.
> 

Introduced by f7f2fe494388fca828094a4ebdab918a7b2d64f8
("batman-adv: limit local translation table max size")

> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
  
Russell Senior Jan. 21, 2014, 6:43 p.m. UTC | #2
>>>>> "Antonio" == Antonio Quartulli <antonio@meshcoding.com> writes:

Antonio> The current MTU computation always returns a value smaller
Antonio> than 1500bytes even if the real interfaces have an MTU large
Antonio> enough to compensate the batman-adv overhead.

Antonio> Fix the computation by properly returning the highest
Antonio> admitted value.

Antonio> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> ---

This seems to fix the bat0-MTU-unnecessarily-small problem I observed
last night and reported on the IRC channel.  I haven't actually passed
any traffic over it yet, but the interface is up with the expected MTU
value with the patch.

Antonio> This patch is missing a Reported-by clause because I did not
Antonio> have "russell"'s email address at hand.

Antonio> Will be added later before being merged.

Reported-by: Russell Senior <russell@personaltelco.net>
  
Antonio Quartulli Jan. 21, 2014, 7 p.m. UTC | #3
On 21/01/14 19:43, Russell Senior wrote:
>>>>>> "Antonio" == Antonio Quartulli <antonio@meshcoding.com> writes:
> 
> Antonio> The current MTU computation always returns a value smaller
> Antonio> than 1500bytes even if the real interfaces have an MTU large
> Antonio> enough to compensate the batman-adv overhead.
> 
> Antonio> Fix the computation by properly returning the highest
> Antonio> admitted value.
> 
> Antonio> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> ---
> 
> This seems to fix the bat0-MTU-unnecessarily-small problem I observed
> last night and reported on the IRC channel.  I haven't actually passed
> any traffic over it yet, but the interface is up with the expected MTU
> value with the patch.

Just to be sure the fix is not introducing any misbehaviour: have you
tried setting smaller MTUs to your hard interface? In that case have you
seen the bat0 reducing its MTU?

> 
> Antonio> This patch is missing a Reported-by clause because I did not
> Antonio> have "russell"'s email address at hand.
> 
> Antonio> Will be added later before being merged.
> 
> Reported-by: Russell Senior <russell@personaltelco.net>

I'd also add Tested-by ;)

Thanks a lot!
  
Russell Senior Jan. 22, 2014, 6:04 a.m. UTC | #4
>>>>> "Russell" == Russell Senior <russell@personaltelco.net> writes:

>>>>> "Antonio" == Antonio Quartulli <antonio@meshcoding.com> writes:
Antonio> The current MTU computation always returns a value smaller
Antonio> than 1500bytes even if the real interfaces have an MTU large
Antonio> enough to compensate the batman-adv overhead.

Antonio> Fix the computation by properly returning the highest
Antonio> admitted value.

Antonio> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> ---

Russell> This seems to fix the bat0-MTU-unnecessarily-small problem I
Russell> observed last night and reported on the IRC channel.  I
Russell> haven't actually passed any traffic over it yet, but the
Russell> interface is up with the expected MTU value with the patch.

Antonio> This patch is missing a Reported-by clause because I did not
Antonio> have "russell"'s email address at hand.

Russell> Reported-by: Russell Senior <russell@personaltelco.net>

Followup, as requested, I tried setting a smaller MTU (1400) on the
adhoc0 interface.  When fragmentation was enabled, this resulted in no
change to MTU (still 1500) for bat0.  When I disabled fragmentation,
the bat0 MTU dropped, as expected, to 1368.  Interestingly, the MTU on
the bridge that bat0 was a member of remained 1500 despite the lower
bat0 MTU.  Should that be?

Also, for testing actual traffic over the batman-adv link, I build
OpenWrt r39354 with the patch on a Soekris net4526, so that there were
two nodes with the same revision (different architecture):
ubnt-bullet-m with ath9k; net4826 with ath5k.  I first noticed that I
was losing about 100k of memory every couple seconds and pretty soon
(with 20 minutes) the net4826 started oopsing on out-of-memory.

I removed the patch, rev'd OpenWrt to r39365 and confirmed that the
net4826 build was also leaking at a substantial rate.

I am seeing a similar, though possibly slower, leak on the ubiquiti
bullet m2hp.  Right before rebooting, top shows kworker/u2:$N (where
$N is 0 or 3) chewing up some cpu cycles.

Has anybody else seen this memory leak?  Leads on where it's coming
from?  Not a runaway process, at least not that top shows up.  Just a
gradual disappearance from MemFree that /proc/sys/vm/drop_caches
doesn't fix.  It isn't adhoc mode, and I can associate the two devices
over adhoc and move a bunch of data with no memory lost, but turning
on batman-adv seems to sink it.
  
Antonio Quartulli Jan. 22, 2014, 7:03 a.m. UTC | #5
On 22/01/14 07:04, Russell Senior wrote:
>>>>>> "Russell" == Russell Senior <russell@personaltelco.net> writes:
> 
>>>>>> "Antonio" == Antonio Quartulli <antonio@meshcoding.com> writes:
> Antonio> The current MTU computation always returns a value smaller
> Antonio> than 1500bytes even if the real interfaces have an MTU large
> Antonio> enough to compensate the batman-adv overhead.
> 
> Antonio> Fix the computation by properly returning the highest
> Antonio> admitted value.
> 
> Antonio> Signed-off-by: Antonio Quartulli <antonio@meshcoding.com> ---
> 
> Russell> This seems to fix the bat0-MTU-unnecessarily-small problem I
> Russell> observed last night and reported on the IRC channel.  I
> Russell> haven't actually passed any traffic over it yet, but the
> Russell> interface is up with the expected MTU value with the patch.
> 
> Antonio> This patch is missing a Reported-by clause because I did not
> Antonio> have "russell"'s email address at hand.
> 
> Russell> Reported-by: Russell Senior <russell@personaltelco.net>
> 
> Followup, as requested, I tried setting a smaller MTU (1400) on the
> adhoc0 interface.  When fragmentation was enabled, this resulted in no
> change to MTU (still 1500) for bat0.  When I disabled fragmentation,
> the bat0 MTU dropped, as expected, to 1368.  Interestingly, the MTU on
> the bridge that bat0 was a member of remained 1500 despite the lower
> bat0 MTU.  Should that be?
> 

I don't really know how the bridge code behaves. As far as I remember it
should adapt to the smallest MTU.

But thanks for testing! This shows that the patch is working fine ;)
  
Antonio Quartulli Jan. 22, 2014, 7:04 a.m. UTC | #6
On 22/01/14 07:04, Russell Senior wrote:
> Also, for testing actual traffic over the batman-adv link, I build
> OpenWrt r39354 with the patch on a Soekris net4526, so that there were
> two nodes with the same revision (different architecture):
> ubnt-bullet-m with ath9k; net4826 with ath5k.  I first noticed that I
> was losing about 100k of memory every couple seconds and pretty soon
> (with 20 minutes) the net4826 started oopsing on out-of-memory.
> 

mh..does this happen with or without fragmentation enabled? Does this
happen even if you don't generate traffic on the interface?

> I removed the patch, rev'd OpenWrt to r39365 and confirmed that the
> net4826 build was also leaking at a substantial rate.
> 
> I am seeing a similar, though possibly slower, leak on the ubiquiti
> bullet m2hp.  Right before rebooting, top shows kworker/u2:$N (where
> $N is 0 or 3) chewing up some cpu cycles.
> 
> Has anybody else seen this memory leak?  Leads on where it's coming
> from?  Not a runaway process, at least not that top shows up.  Just a
> gradual disappearance from MemFree that /proc/sys/vm/drop_caches
> doesn't fix.  It isn't adhoc mode, and I can associate the two devices
> over adhoc and move a bunch of data with no memory lost, but turning
> on batman-adv seems to sink it.
> 
> 

Thanks for reporting!
  
Daniel Golle Jan. 22, 2014, 7:37 a.m. UTC | #7
On 01/22/2014 07:04 AM, Russell Senior wrote:
> Has anybody else seen this memory leak?  Leads on where it's coming
> from?  Not a runaway process, at least not that top shows up.  Just a
> gradual disappearance from MemFree that /proc/sys/vm/drop_caches
> doesn't fix.  It isn't adhoc mode, and I can associate the two devices
> over adhoc and move a bunch of data with no memory lost, but turning
> on batman-adv seems to sink it.
Yes, and I tested (compile-time selected) with and without network coding, and
(at run-time) with and without fragmentation (as I also bumped into the MTU
calculation problem later fixed by the patch on this list) -- any 32MB RAM
devices reboots after roughly 30 minutes due to OOM without substantial traffic,
if there is traffic then apparently even faster...
  
Russell Senior Jan. 22, 2014, 5:45 p.m. UTC | #8
>>>>> "Daniel" == Daniel  <daniel@makrotopia.org> writes:

Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
>> Has anybody else seen this memory leak?  Leads on where it's coming
>> from?  Not a runaway process, at least not that top shows up.  Just
>> a gradual disappearance from MemFree that /proc/sys/vm/drop_caches
>> doesn't fix.  It isn't adhoc mode, and I can associate the two
>> devices over adhoc and move a bunch of data with no memory lost,
>> but turning on batman-adv seems to sink it.

Daniel> Yes, and I tested (compile-time selected) with and without
Daniel> network coding, and (at run-time) with and without
Daniel> fragmentation (as I also bumped into the MTU calculation
Daniel> problem later fixed by the patch on this list) -- any 32MB RAM
Daniel> devices reboots after roughly 30 minutes due to OOM without
Daniel> substantial traffic, if there is traffic then apparently even
Daniel> faster...

The memory leak I see seems to commence as soon as a batman-adv
neighbor (same version, in this case 15) appears and stops when the
neighbor goes away.

I am going to try enabling kmemleak and see of that tells me anything.
  
Antonio Quartulli Jan. 22, 2014, 5:46 p.m. UTC | #9
On 22/01/14 18:45, Russell Senior wrote:
>>>>>> "Daniel" == Daniel  <daniel@makrotopia.org> writes:
> 
> Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
>>> Has anybody else seen this memory leak?  Leads on where it's coming
>>> from?  Not a runaway process, at least not that top shows up.  Just
>>> a gradual disappearance from MemFree that /proc/sys/vm/drop_caches
>>> doesn't fix.  It isn't adhoc mode, and I can associate the two
>>> devices over adhoc and move a bunch of data with no memory lost,
>>> but turning on batman-adv seems to sink it.
> 
> Daniel> Yes, and I tested (compile-time selected) with and without
> Daniel> network coding, and (at run-time) with and without
> Daniel> fragmentation (as I also bumped into the MTU calculation
> Daniel> problem later fixed by the patch on this list) -- any 32MB RAM
> Daniel> devices reboots after roughly 30 minutes due to OOM without
> Daniel> substantial traffic, if there is traffic then apparently even
> Daniel> faster...
> 
> The memory leak I see seems to commence as soon as a batman-adv
> neighbor (same version, in this case 15) appears and stops when the
> neighbor goes away.
> 

Thank you very much for the hint Russel!
Today I tried with one node only, but kmemleak did not report anything...

> I am going to try enabling kmemleak and see of that tells me anything.
> 

Thanks! Keep us informed!

Cheers,
  
Russell Senior Jan. 22, 2014, 7:18 p.m. UTC | #10
>>>>> "Antonio" == Antonio Quartulli <antonio@meshcoding.com> writes:

Russell> Has anybody else seen this memory leak?  Leads on where it's
Russell> coming from?  Not a runaway process, at least not that top
Russell> shows up.  Just a gradual disappearance from MemFree that
Russell> /proc/sys/vm/drop_caches doesn't fix.  It isn't adhoc mode,
Russell> and I can associate the two devices over adhoc and move a
Russell> bunch of data with no memory lost, but turning on batman-adv
Russell> seems to sink it.

Russell> The memory leak I see seems to commence as soon as a
Russell> batman-adv neighbor (same version, in this case 15) appears
Russell> and stops when the neighbor goes away.

Antonio> Thank you very much for the hint Russel!  Today I tried with
Antonio> one node only, but kmemleak did not report anything...

Russell> I am going to try enabling kmemleak and see of that tells me
Russell> anything.

Antonio> Thanks! Keep us informed!

Here is a bootlog in which I spit out a bunch of kmemleak stuff into a
console (captured by /usr/bin/screen, sorry for the extraneous line
feed silliness).

  https://personaltelco.net/~russell/kmemleak-batman-from-boot.log

If I count instances, it looks like batadv_orig_node_vlan_new (and the
things that are calling it) may be implicated.

Hope that helps!
  
cmsv Jan. 22, 2014, 8:49 p.m. UTC | #11
I had the same problem which caused reboots but after the last
batman-adv update i am not seeing it.
all my devices are have mb ram
i am using network coding and 1560 MTU

On 01/22/2014 12:46 PM, Antonio Quartulli wrote:
> On 22/01/14 18:45, Russell Senior wrote:
>>>>>>> "Daniel" == Daniel  <daniel@makrotopia.org> writes:
>>
>> Daniel> On 01/22/2014 07:04 AM, Russell Senior wrote:
>>>> Has anybody else seen this memory leak?  Leads on where it's coming
>>>> from?  Not a runaway process, at least not that top shows up.  Just
>>>> a gradual disappearance from MemFree that /proc/sys/vm/drop_caches
>>>> doesn't fix.  It isn't adhoc mode, and I can associate the two
>>>> devices over adhoc and move a bunch of data with no memory lost,
>>>> but turning on batman-adv seems to sink it.
>>
>> Daniel> Yes, and I tested (compile-time selected) with and without
>> Daniel> network coding, and (at run-time) with and without
>> Daniel> fragmentation (as I also bumped into the MTU calculation
>> Daniel> problem later fixed by the patch on this list) -- any 32MB RAM
>> Daniel> devices reboots after roughly 30 minutes due to OOM without
>> Daniel> substantial traffic, if there is traffic then apparently even
>> Daniel> faster...
>>
>> The memory leak I see seems to commence as soon as a batman-adv
>> neighbor (same version, in this case 15) appears and stops when the
>> neighbor goes away.
>>
> 
> Thank you very much for the hint Russel!
> Today I tried with one node only, but kmemleak did not report anything...
> 
>> I am going to try enabling kmemleak and see of that tells me anything.
>>
> 
> Thanks! Keep us informed!
> 
> Cheers,
>
  
Russell Senior Jan. 22, 2014, 11:57 p.m. UTC | #12
>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:

cmsv> I had the same problem which caused reboots but after the last
cmsv> batman-adv update i am not seeing it.  all my devices are have
cmsv> mb ram i am using network coding and 1560 MTU

Which version are you running?
  
cmsv Jan. 23, 2014, 12:10 a.m. UTC | #13
On 01/22/2014 06:57 PM, Russell Senior wrote:
>>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:
> 
> cmsv> I had the same problem which caused reboots but after the last
> cmsv> batman-adv update i am not seeing it.  all my devices are have
> cmsv> mb ram i am using network coding and 1560 MTU
> 
> Which version are you running?
> 
> 

Righ now with openwrt AA DISTRIB_REVISION="r39154" and batctl 2014.0.0
[batman-adv: 2014.0.0]

Routers are dir 601 dir 615* tl wr703n and it is not happening
I synced the feed less than 48h ago and recompiled.

what version are you using ?
  
Daniel Golle Jan. 23, 2014, 3:35 a.m. UTC | #14
On 01/23/2014 01:10 AM, cmsv wrote:
> 
> 
> On 01/22/2014 06:57 PM, Russell Senior wrote:
>>>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:
>>
>> cmsv> I had the same problem which caused reboots but after the last
>> cmsv> batman-adv update i am not seeing it.  all my devices are have
>> cmsv> mb ram i am using network coding and 1560 MTU
>>
>> Which version are you running?
>>
>>
> 
> Righ now with openwrt AA DISTRIB_REVISION="r39154" and batctl 2014.0.0
> [batman-adv: 2014.0.0]
> 
> Routers are dir 601 dir 615* tl wr703n and it is not happening
> I synced the feed less than 48h ago and recompiled.
> 
> what version are you using ?

out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8
with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix
batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU
computation" on top.
A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free
to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
  
Daniel Golle Jan. 26, 2014, 12:57 p.m. UTC | #15
I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of
2014.0.0, tried with all possible settings, no memory leak what-so-over, happy
uptimes of more than a day by now :)
all other system components and settings are exactly identical to my previous
setup with batman-adv 2014.0.0.

On 01/23/2014 04:35 AM, Daniel wrote:
> On 01/23/2014 01:10 AM, cmsv wrote:
>> what version are you using ?

> out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8
> with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix
> batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU
> computation" on top.
> A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free
> to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
  
Antonio Quartulli Jan. 26, 2014, 2:21 p.m. UTC | #16
On 26/01/14 13:57, Daniel wrote:
> I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of
> 2014.0.0, tried with all possible settings, no memory leak what-so-over, happy
> uptimes of more than a day by now :)
> all other system components and settings are exactly identical to my previous
> setup with batman-adv 2014.0.0.
> 
> On 01/23/2014 04:35 AM, Daniel wrote:
>> On 01/23/2014 01:10 AM, cmsv wrote:
>>> what version are you using ?
> 
>> out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8
>> with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix
>> batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU
>> computation" on top.
>> A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free
>> to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
> 

Thanks for testing guys!

I found something wrong in the code and I am going to send a patch soon.
I'd really appreciate if somebody could test it!

Thanks!
  
cmsv Jan. 26, 2014, 4:05 p.m. UTC | #17
Inline:

On 01/26/2014 09:21 AM, Antonio Quartulli wrote:
> On 26/01/14 13:57, Daniel wrote:
>> I now built OpenWrt trunk/BB r39365 with batman-adv 2013.4.0 instead of
>> 2014.0.0, tried with all possible settings, no memory leak what-so-over, happy
>> uptimes of more than a day by now :)
>> all other system components and settings are exactly identical to my previous
>> setup with batman-adv 2014.0.0.
>>
>> On 01/23/2014 04:35 AM, Daniel wrote:
>>> On 01/23/2014 01:10 AM, cmsv wrote:
>>>> what version are you using ?
>>
>>> out-of-memory every 20 minutes or so on OpenWrt trunk/BB r39365 on tl-wr841nd-v8
>>> with batman-adv 2014.0.0 from openwrt's routing feed with "batman-adv: fix
>>> batman-adv header overhead calculation" and "batman-adv: fix soft-interface MTU
>>> computation" on top.
>>> A sample node is (occasionally) reachable via DN42 at 104.61.99.104 (feel free
>>> to ask for ssh or any kind of logs, serial access, remote gdb or whatever)
>>
> 
> Thanks for testing guys!
> 
> I found something wrong in the code and I am going to send a patch soon.
> I'd really appreciate if somebody could test it!

Will this patch also be relevant to attitude adjustment ? The reason why
i was if because with AA right now i do not experience the reboots.
I should also add that i am using mac80211 r39150 and hostapd r39155 on
top of the latest AA.

Can you explain in what exactly your code findings have an impact on ?

> 
> Thanks!
>
  
Antonio Quartulli Jan. 26, 2014, 4:07 p.m. UTC | #18
On 26/01/14 17:05, cmsv wrote:
> 
> Will this patch also be relevant to attitude adjustment ? The reason why
> i was if because with AA right now i do not experience the reboots.
> I should also add that i am using mac80211 r39150 and hostapd r39155 on
> top of the latest AA.
> 
> Can you explain in what exactly your code findings have an impact on ?

This is a patch to fix the memleak we were discussing about.
This bug appeared with and it is meant to be applied on
batman-adv-2014.0.0 (regardless of the openwrt revision).

Cheers,
  
Antonio Quartulli Jan. 26, 2014, 4:13 p.m. UTC | #19
On 26/01/14 17:07, Antonio Quartulli wrote:Can you explain in what
> 
> This is a patch to fix the memleak we were discussing about.
> This bug appeared with and it is meant to be applied on
> batman-adv-2014.0.0 (regardless of the openwrt revision).

sorry, bad copy/paste.

The patch is for batman-adv-2014.0.0 (I don't know what version you have
in AA).
It fixes the memleak bug that we were discussing about.
  
Marek Lindner Jan. 27, 2014, 7:57 a.m. UTC | #20
On Tuesday 21 January 2014 11:31:08 Antonio Quartulli wrote:
> On 21/01/14 11:22, Antonio Quartulli wrote:
> > The current MTU computation always returns a value
> > smaller than 1500bytes even if the real interfaces
> > have an MTU large enough to compensate the batman-adv
> > overhead.
> >
> > 
> >
> > Fix the computation by properly returning the highest
> > admitted value.
> >
> > 
> 
> Introduced by f7f2fe494388fca828094a4ebdab918a7b2d64f8
> ("batman-adv: limit local translation table max size")
> 
> > Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

Applied in revision 2b108cc.

Thanks,
Marek
  
cmsv Jan. 27, 2014, 5:55 p.m. UTC | #21
Here is an update of some tests i ran in the past 24h with the following
build:

routers used:
dlink dir 601a and tplink wr703n in "ng" mode. (atheros)

My current AA DISTRIB_REVISION="r39154"
mac80211 r39150 from openwrt trunk
hostapd r39155 from trunk

From batman-adv i am using the following patches:
ls  feeds/routing/batman-adv/patches/
0001-batman-adv-fix-batman-adv-header-overhead-calculatio.patch

From d72756b97529b3c6afa08933216aaa912bb16ce6 Mon Sep 17 00:00:00 2001
From: Marek Lindner <mareklindner@neomailbox.ch>
Date: Wed, 15 Jan 2014 20:31:18 +0800
Subject: [PATCH] batman-adv: fix batman-adv header overhead calculation


batman-adv/Makefile
# $Id: Makefile 5624 2006-11-23 00:29:07Z nbd $

include $(TOPDIR)/rules.mk

PKG_NAME:=batman-adv

PKG_VERSION:=2014.0.0
BATCTL_VERSION:=2014.0.0
PKG_RELEASE:=1
PKG_MD5SUM:=8d58ecaede17dc05aab1b549dc09fa7d
BATCTL_MD5SUM:=b0bcf29fef80ddcc33769e13f5937d0a


I tried to find any memory leaks that could be causing reboots and i was
unable to find any after having compiled the build with
batman-adv-header-overhead-calculatio.patch. Before this patch i did get
reboots caused by the leak.


I keep monitoring memory usage with top, htop, ps and /proc/meminfo
since i was not able to install valgrind due to lack of available flash
memory given the size of the valgrind package.

Got some tips from here:
http://blog.thewebsitepeople.org/2011/03/linux-memory-leak-detection

Additionally i ran iperf tests on both routers against each other to force
them under heavy load during 24h: iperf -c <ip> -t 99999 -i 5

The mtu is 1560 for the adhoc.

After 24h i still had 6 mb of ram and above on both routers. Once i
stopped the tests; the ram increased.

Dmesh and logread output nothing wrong and or errors.

No reboots happened during this time which leads me to conclude that the
problem might not be all from batman-adv side or maybe not even at all
or maybe only happens when in use with something very specific.

I would like to run a few more tests to be more sure about possible
leaks but are there any other tools that someone might recommend ?

@ daniel
What did you use to find the leak and or how did you troubleshoot it ?



On 01/26/2014 11:13 AM, Antonio Quartulli wrote:
> On 26/01/14 17:07, Antonio Quartulli wrote:Can you explain in what
>>
>> This is a patch to fix the memleak we were discussing about.
>> This bug appeared with and it is meant to be applied on
>> batman-adv-2014.0.0 (regardless of the openwrt revision).
> 
> sorry, bad copy/paste.
> 
> The patch is for batman-adv-2014.0.0 (I don't know what version you have
> in AA).
> It fixes the memleak bug that we were discussing about.
> 
>
  
Russell Senior Jan. 28, 2014, 1:21 a.m. UTC | #22
>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:

cmsv> Here is an update of some tests i ran in the past 24h with the
cmsv> following build:

cmsv> routers used: dlink dir 601a and tplink wr703n in "ng"
cmsv> mode. (atheros)

cmsv> My current AA DISTRIB_REVISION="r39154" mac80211 r39150 from
cmsv> openwrt trunk hostapd r39155 from trunk

I just went to try to set up an AA build environment from:

  git://git.openwrt.org/12.09/openwrt.git

in order to replicate.  The default feeds.conf from that tree seems to
point at a 'for-12.09.x' branch of the routing feed, and the
batman-adv Makefile there seems to use 2013.4.0, not 2014.0.0.  

Can you paste your feeds.conf file?
  
cmsv Jan. 28, 2014, 1:30 a.m. UTC | #23
inline:

On 01/27/2014 08:21 PM, Russell Senior wrote:
>>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:
> 
> cmsv> Here is an update of some tests i ran in the past 24h with the
> cmsv> following build:
> 
> cmsv> routers used: dlink dir 601a and tplink wr703n in "ng"
> cmsv> mode. (atheros)
> 
> cmsv> My current AA DISTRIB_REVISION="r39154" mac80211 r39150 from
> cmsv> openwrt trunk hostapd r39155 from trunk
> 
> I just went to try to set up an AA build environment from:
> 
>   git://git.openwrt.org/12.09/openwrt.git
> 
> in order to replicate.  The default feeds.conf from that tree seems to
> point at a 'for-12.09.x' branch of the routing feed, and the
> batman-adv Makefile there seems to use 2013.4.0, not 2014.0.0.  
> 
> Can you paste your feeds.conf file?

Of course:


for AA and batman-adv 2014.0.0 in feeds.default.conf

src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09
src-git routing git://github.com/openwrt-routing/packages.git


For the hostapd and mentioned mac80211 you will need to clone
git clone git://git.openwrt.org/12.09/openwrt.git

Then obtain the specific revisions and replace  the original hostapd and
mac80211 from AA.
  
Russell Senior Jan. 29, 2014, 8:10 a.m. UTC | #24
>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:


>> Can you paste your feeds.conf file?

cmsv> Of course:


cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf

cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 
cmsv> src-git routing git://github.com/openwrt-routing/packages.git


cmsv> For the hostapd and mentioned mac80211 you will need to clone
cmsv> git clone git://git.openwrt.org/12.09/openwrt.git

cmsv> Then obtain the specific revisions and replace the original
cmsv> hostapd and mac80211 from AA.

I am not following exactly.  Do you know which change in particular
makes the memory leak come and go?  AA implies an older kernel, 3.3.8
or something.

Also, obtain specific revisions from trunk? and then copy
them into the AA tree?  

package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663
package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca

Is that right?  Do those mac80211/hostapd revisions come from
bisection (i.e. the last "good" rev) or happenstance?

Thanks for clarification!
  
cmsv Jan. 29, 2014, 9:48 p.m. UTC | #25
inline reply:

On 01/29/2014 03:10 AM, Russell Senior wrote:
>>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:
> 
> 
>>> Can you paste your feeds.conf file?
> 
> cmsv> Of course:
> 
> 
> cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf
> 
> cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 
> cmsv> src-git routing git://github.com/openwrt-routing/packages.git
> 
> 
> cmsv> For the hostapd and mentioned mac80211 you will need to clone
> cmsv> git clone git://git.openwrt.org/12.09/openwrt.git
> 
> cmsv> Then obtain the specific revisions and replace the original
> cmsv> hostapd and mac80211 from AA.
> 
> I am not following exactly.  Do you know which change in particular
> makes the memory leak come and go?  
I do not know exactly what causes the leak because i don't have the leak
in my builds and have not found a better way than the ones mentioned
before to try to find what may cause it.


> AA implies an older kernel, 3.3.8
> or something.
Yes 3.3.8

> 
> Also, obtain specific revisions from trunk? and then copy
> them into the AA tree?  

Not from trunk.  I posted the wrong git before.
git clone git://nbd.name/aa-mac80211.git

> package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663
> package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca
> 
> Is that right?  Do those mac80211/hostapd revisions come from
> bisection (i.e. the last "good" rev) or happenstance?
You have to ask the maintainer. To me they are in between AA and trunk
in terms of stability.



> Thanks for clarification!
> 
>
  
cmsv Feb. 8, 2014, 3:08 a.m. UTC | #26
I have an update in regards to this matter and i have CC' ed Felix
Fietkau from openwrt (athk) here too since i am using
nbd.name/aa-mac80211.git

I decided to compile new images with the latest batman-adv stable
patches and in the process of testing the new image as well as the old
one i thought to be stable i got the routers to reboot.
This time i tested this with more routers in the mesh and was able to
replicate it.

It happens that the routers reboot when the gateway disappears either by
doing batctl gw client/off or rebooting the gw router.
This then causes the others to reboot  with Kernel panic - not syncing:
Fatal exception in interrupt.

Rebooting the gw router while maintaining gw off did not seem to reboot
the other routers. With me the problem is easy to replicate when the
router gateway which is providing gateway to the clients disappears. It'
s disappearance causes the clients to reboot.

Here is the reboot log:

[  239.410000] CPU 0 Unable to handle kernel paging request at virtual
address 0000000c, epc == 80ea7914, ra == 80ea7910
[  239.420000] Oops[#1]:
[  239.420000] Cpu 0
[  239.420000] $ 0   : 00000000 00000001 00000000 00000000
[  239.420000] $ 4   : 81b12380 80f7fb00 00000000 00000000
[  239.420000] $ 8   : 00000037 00000000 00000000 00000000
[  239.420000] $12   : 00000000 0000015f 80e82540 00000000
[  239.420000] $16   : 81adbc00 00000000 81b12380 80f3e802
[  239.420000] $20   : 80f7fb00 00000000 00000189 00000000
[  239.420000] $24   : 00000002 80e365f0
[  239.420000] $28   : 80fe6000 80fe7ae8 00000043 80ea7910
[  239.420000] Hi    : 000001d5
[  239.420000] Lo    : 0011e189
[  239.420000] epc   : 80ea7914 0x80ea7914
[  239.420000]     Tainted: G           O
[  239.420000] ra    : 80ea7910 0x80ea7910
[  239.420000] Status: 1000f403    KERNEL EXL IE
[  239.420000] Cause : 00800008
[  239.420000] BadVA : 0000000c
[  239.420000] PrId  : 00019374 (MIPS 24Kc)
[  239.420000] Modules linked in: ath79_wdt batman_adv(O) nf_nat_irc
nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat
nf_nat xt_conntrack xt_CT xt_NOTRACK iptable


_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipt_REJECT
xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle
iptable_filter ip_tables xt_tcpudp x_tabl


es ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) libcrc32c
crc16 cfg80211(O) compat(O) arc4 aes_generic crc32c crypto_hash
crypto_algapi gpio_button_hotplug(O)
[  239.420000] Process udhcpc (pid: 1267, threadinfo=80fe6000,
task=81af8850, tls=77929440)
[  239.420000] Stack : 00000000 00000000 00000000 00000000 0000002a
81adbc00 00000000 81adbc00
[  239.420000]         81b12000 80f3e802 81b12380 00000000 00000189
80eb1fbc 81b12000 00000000
[  239.420000]         80e8bd00 80eb86c0 00000000 00000000 00000000
801e98ac 81adbc00 00000000
[  239.420000]         81b12000 00000000 80e8bd00 80eb86c0 00000000
801ec874 00000000 80dae000
[  239.420000]         00000000 00000014 80fb7ca8 0200bc00 00000001
00000001 802e0000 81adbc00
[  239.420000]         ...
[  239.420000] Call Trace:[<80eb1fbc>] 0x80eb1fbc
[  239.420000] [<801e98ac>] 0x801e98ac
[  239.420000] [<801ec874>] 0x801ec874
[  239.420000] [<801ecd5c>] 0x801ecd5c
[  239.420000] [<8026a388>] 0x8026a388
[  239.420000] [<80218750>] 0x80218750
[  239.420000] [<802689a4>] 0x802689a4
[  239.420000] [<801dbf88>] 0x801dbf88
[  239.420000] [<80218750>] 0x80218750
[  239.420000] [<801ec874>] 0x801ec874
[  239.420000] [<80216c50>] 0x80216c50
[  239.420000] [<80218750>] 0x80218750
[  239.420000] [<801ecd5c>] 0x801ecd5c
[  239.420000] [<80216c50>] 0x80216c50
[  239.420000] [<802689b4>] 0x802689b4
[  239.420000] [<80219eb0>] 0x80219eb0
[  239.420000] [<80237bb8>] 0x80237bb8
[  239.420000] [<80239734>] 0x80239734
[  239.420000] [<8024f668>] 0x8024f668
[  239.420000] [<801101d4>] 0x801101d4
[  239.420000] [<8020e3dc>] 0x8020e3dc
[  239.420000] [<801fd38c>] 0x801fd38c
[  239.420000] [<802179f8>] 0x802179f8
[  239.420000] [<8020ff04>] 0x8020ff04
[  239.420000] [<801d8154>] 0x801d8154
[  239.420000] [<80211184>] 0x80211184
[  239.420000] [<800d8890>] 0x800d8890
[  239.420000] [<800ec6f0>] 0x800ec6f0
[  239.420000] [<801d9f58>] 0x801d9f58
[  239.420000] [<801d93dc>] 0x801d93dc
[  239.420000] [<800d9114>] 0x800d9114
[  239.420000] [<800d93dc>] 0x800d93dc
[  239.420000] [<801d9a70>] 0x801d9a70
[  239.420000] [<8006a284>] 0x8006a284
[  239.420000]
[  239.420000]
[  239.420000] Code: 0c3a9ac3  00402821  0040a821 <8c42000c> 54400052
00008021  8e050054  10a00005  8fb10010
[  239.730000] ---[ end trace 7d873dc004108502 ]---
[  239.740000] Kernel panic - not syncing: Fatal exception in interrupt
[  239.740000] Rebooting in 3 seconds..


Routers used:
dir 601a & 615c1
tplink wr703n

aa: DISTRIB_REVISION="r39154"
hostapd and mac80211 from git://nbd.name/aa-mac80211.git

hostapd: sync with trunk (as of r39155)
mac80211: sync with openwrt trunk (as of r39150)

I am able to confirm that this problem does not happen with
[batman-adv: 2013.4.0] but it does happen with 2014.0.0 and it is easy
to replicate.
currently my batman-adv 2014.0.0 package as the following patches:

$ ls feeds/routing/batman-adv/patches/
0001-batman-adv-fix-batman-adv-header-overhead-calculatio.patch
0003-batman-adv-fix-soft-interface-MTU-computation.patch
0005-batman-adv-release-vlan-object-after-checking-the-CR.patch
0002-batman-adv-fix-potential-kernel-paging-error-for-uni.patch
0004-batman-adv-fix-TT-TVLV-parsing-on-OGM-reception.patch
0007-batman-adv-use-vlan_-eth_hdr-instead-of-skb-data-in-.patch



On 01/29/2014 04:48 PM, cmsv wrote:
> inline reply:
> 
> On 01/29/2014 03:10 AM, Russell Senior wrote:
>>>>>>> "cmsv" == cmsv  <cmsv@wirelesspt.net> writes:
>>
>>
>>>> Can you paste your feeds.conf file?
>>
>> cmsv> Of course:
>>
>>
>> cmsv> for AA and batman-adv 2014.0.0 in feeds.default.conf
>>
>> cmsv> src-svn packages svn://svn.openwrt.org/openwrt/branches/packages_12.09 
>> cmsv> src-git routing git://github.com/openwrt-routing/packages.git
>>
>>
>> cmsv> For the hostapd and mentioned mac80211 you will need to clone
>> cmsv> git clone  git://nbd.name/aa-mac80211.git
>>
>> cmsv> Then obtain the specific revisions and replace the original
>> cmsv> hostapd and mac80211 from AA.
>>
>> I am not following exactly.  Do you know which change in particular
>> makes the memory leak come and go?  
> I do not know exactly what causes the leak because i don't have the leak
> in my builds and have not found a better way than the ones mentioned
> before to try to find what may cause it.
> 
> 
>> AA implies an older kernel, 3.3.8
>> or something.
> Yes 3.3.8
> 
>>
>> Also, obtain specific revisions from trunk? and then copy
>> them into the AA tree?  
> 
> Not from trunk.  I posted the wrong git before.
> git clone git://nbd.name/aa-mac80211.git
> 
>> package/kernel/mac80211 r39150 = commit 886b3c876b71122ed9523834488f373908224663
>> package/network/services/hostapd r39155 = commit 64820db4b264472e03acb9ea6b5536fa7633a8ca
>>
>> Is that right?  Do those mac80211/hostapd revisions come from
>> bisection (i.e. the last "good" rev) or happenstance?
> You have to ask the maintainer. To me they are in between AA and trunk
> in terms of stability.
> 
> 
> 
>> Thanks for clarification!
>>
>>
>
  
Felix Fietkau Feb. 8, 2014, 10:53 a.m. UTC | #27
On 2014-02-08 04:08, cmsv wrote:
> [  239.410000] CPU 0 Unable to handle kernel paging request at virtual
> address 0000000c, epc == 80ea7914, ra == 80ea7910
> [  239.420000] Oops[#1]:
> [  239.420000] Cpu 0
> [  239.420000] $ 0   : 00000000 00000001 00000000 00000000
> [  239.420000] $ 4   : 81b12380 80f7fb00 00000000 00000000
> [  239.420000] $ 8   : 00000037 00000000 00000000 00000000
> [  239.420000] $12   : 00000000 0000015f 80e82540 00000000
> [  239.420000] $16   : 81adbc00 00000000 81b12380 80f3e802
> [  239.420000] $20   : 80f7fb00 00000000 00000189 00000000
> [  239.420000] $24   : 00000002 80e365f0
> [  239.420000] $28   : 80fe6000 80fe7ae8 00000043 80ea7910
> [  239.420000] Hi    : 000001d5
> [  239.420000] Lo    : 0011e189
> [  239.420000] epc   : 80ea7914 0x80ea7914
> [  239.420000]     Tainted: G           O
> [  239.420000] ra    : 80ea7910 0x80ea7910
> [  239.420000] Status: 1000f403    KERNEL EXL IE
> [  239.420000] Cause : 00800008
> [  239.420000] BadVA : 0000000c
> [  239.420000] PrId  : 00019374 (MIPS 24Kc)
> [  239.420000] Modules linked in: ath79_wdt batman_adv(O) nf_nat_irc
> nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat
> nf_nat xt_conntrack xt_CT xt_NOTRACK iptable
> 
> 
> _raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipt_REJECT
> xt_TCPMSS ipt_LOG xt_comment xt_multiport xt_mac xt_limit iptable_mangle
> iptable_filter ip_tables xt_tcpudp x_tabl
> 
> 
> es ath9k(O) ath9k_common(O) ath9k_hw(O) ath(O) mac80211(O) libcrc32c
> crc16 cfg80211(O) compat(O) arc4 aes_generic crc32c crypto_hash
> crypto_algapi gpio_button_hotplug(O)
> [  239.420000] Process udhcpc (pid: 1267, threadinfo=80fe6000,
> task=81af8850, tls=77929440)
> [  239.420000] Stack : 00000000 00000000 00000000 00000000 0000002a
> 81adbc00 00000000 81adbc00
> [  239.420000]         81b12000 80f3e802 81b12380 00000000 00000189
> 80eb1fbc 81b12000 00000000
> [  239.420000]         80e8bd00 80eb86c0 00000000 00000000 00000000
> 801e98ac 81adbc00 00000000
> [  239.420000]         81b12000 00000000 80e8bd00 80eb86c0 00000000
> 801ec874 00000000 80dae000
> [  239.420000]         00000000 00000014 80fb7ca8 0200bc00 00000001
> 00000001 802e0000 81adbc00
> [  239.420000]         ...
> [  239.420000] Call Trace:[<80eb1fbc>] 0x80eb1fbc
> [  239.420000] [<801e98ac>] 0x801e98ac
> [  239.420000] [<801ec874>] 0x801ec874
[...]
Just a quick note about logs like this: They're completely worthless
unless you enable CONFIG_KERNEL_KALLSYMS in your .config. Without that
option, the kernel does not resolve function names, and the addresses
shown with a custom build usually do not match the addresses of other
builds.

- Felix
  
Antonio Quartulli Feb. 12, 2014, 7:23 a.m. UTC | #28
On 08/02/14 04:08, cmsv wrote:
> [  239.420000] [<8020ff04>] 0x8020ff04
> [  239.420000] [<801d8154>] 0x801d8154
> [  239.420000] [<80211184>] 0x80211184
> [  239.420000] [<800d8890>] 0x800d8890
> [  239.420000] [<800ec6f0>] 0x800ec6f0
> [  239.420000] [<801d9f58>] 0x801d9f58
> [  239.420000] [<801d93dc>] 0x801d93dc
> [  239.420000] [<800d9114>] 0x800d9114
> [  239.420000] [<800d93dc>] 0x800d93dc
> [  239.420000] [<801d9a70>] 0x801d9a70
> [  239.420000] [<8006a284>] 0x8006a284
> [  239.420000]
> [  239.420000]
> [  239.420000] Code: 0c3a9ac3  00402821  0040a821 <8c42000c> 54400052
> 00008021  8e050054  10a00005  8fb10010
> [  239.730000] ---[ end trace 7d873dc004108502 ]---
> [  239.740000] Kernel panic - not syncing: Fatal exception in interrupt
> [  239.740000] Rebooting in 3 seconds..
> 


Hi!

Have you been able to run a test with kernel symbols enabled??
That would be a great help ;)

Cheers,
  
cmsv Feb. 12, 2014, 10:40 a.m. UTC | #29
inline

On 02/12/2014 02:23 AM, Antonio Quartulli wrote:
> On 08/02/14 04:08, cmsv wrote:
>> [  239.420000] [<8020ff04>] 0x8020ff04
>> [  239.420000] [<801d8154>] 0x801d8154
>> [  239.420000] [<80211184>] 0x80211184
>> [  239.420000] [<800d8890>] 0x800d8890
>> [  239.420000] [<800ec6f0>] 0x800ec6f0
>> [  239.420000] [<801d9f58>] 0x801d9f58
>> [  239.420000] [<801d93dc>] 0x801d93dc
>> [  239.420000] [<800d9114>] 0x800d9114
>> [  239.420000] [<800d93dc>] 0x800d93dc
>> [  239.420000] [<801d9a70>] 0x801d9a70
>> [  239.420000] [<8006a284>] 0x8006a284
>> [  239.420000]
>> [  239.420000]
>> [  239.420000] Code: 0c3a9ac3  00402821  0040a821 <8c42000c> 54400052
>> 00008021  8e050054  10a00005  8fb10010
>> [  239.730000] ---[ end trace 7d873dc004108502 ]---
>> [  239.740000] Kernel panic - not syncing: Fatal exception in interrupt
>> [  239.740000] Rebooting in 3 seconds..
>>
> 
> 
> Hi!
> 
> Have you been able to run a test with kernel symbols enabled??
> That would be a great help ;)

I have tried to compile images with with kernel symbols enabled; but no
matter how much i trim/strip down the build to non essencial features; i
am unable to create images that fit in 4 mb flash for the routers i have
which are mostly dlink routers.
Along with shortage of time that i have at the moment i will have to
postpone this testing for later and stick with batman-adv 2013.4.0 for
now since 2014 is not providing me the same stability.

Last night i tried 2014 again and changed the router that was going to
be the gateway and noticed that the reboot was only happening in 1
router instead of 2.
Replicating is easy as long as i make the gateway disappear in some way.

> 
> Cheers,
>
  
Antonio Quartulli Feb. 12, 2014, 11:41 a.m. UTC | #30
On 12/02/14 11:40, cmsv wrote:
>>
>>
>> Hi!
>>
>> Have you been able to run a test with kernel symbols enabled??
>> That would be a great help ;)
> 
> I have tried to compile images with with kernel symbols enabled; but no
> matter how much i trim/strip down the build to non essencial features; i
> am unable to create images that fit in 4 mb flash for the routers i have
> which are mostly dlink routers.
> Along with shortage of time that i have at the moment i will have to
> postpone this testing for later and stick with batman-adv 2013.4.0 for
> now since 2014 is not providing me the same stability.
> 
> Last night i tried 2014 again and changed the router that was going to
> be the gateway and noticed that the reboot was only happening in 1
> router instead of 2.
> Replicating is easy as long as i make the gateway disappear in some way.

You should perform the same test now with the new patches that I just
sent to the ml.

Maybe your problem was a merely consequence of the bug we just fixed.

Cheers,
  
cmsv Feb. 13, 2014, 12:55 a.m. UTC | #31
I have noticed a few patches being sent but unless i missing something
they are all for the development branch.
Next week i will be deploying new firmware and create new access points
and cannot afford "testing" on production environment.
I will be returning to the 2014 branch later on after my trip and will
try to debug the issue once and for all which by then i will report my
findings.

On 02/12/2014 06:41 AM, Antonio Quartulli wrote:
> On 12/02/14 11:40, cmsv wrote:
>>>
>>>
>>> Hi!
>>>
>>> Have you been able to run a test with kernel symbols enabled??
>>> That would be a great help ;)
>>
>> I have tried to compile images with with kernel symbols enabled; but no
>> matter how much i trim/strip down the build to non essencial features; i
>> am unable to create images that fit in 4 mb flash for the routers i have
>> which are mostly dlink routers.
>> Along with shortage of time that i have at the moment i will have to
>> postpone this testing for later and stick with batman-adv 2013.4.0 for
>> now since 2014 is not providing me the same stability.
>>
>> Last night i tried 2014 again and changed the router that was going to
>> be the gateway and noticed that the reboot was only happening in 1
>> router instead of 2.
>> Replicating is easy as long as i make the gateway disappear in some way.
> 
> You should perform the same test now with the new patches that I just
> sent to the ml.
> 
> Maybe your problem was a merely consequence of the bug we just fixed.
> 
> Cheers,
>
  
Antonio Quartulli Feb. 13, 2014, 7:23 a.m. UTC | #32
On 13/02/14 01:55, cmsv wrote:
> I have noticed a few patches being sent but unless i missing something
> they are all for the development branch.

No, most of them are for the maint branch (thus the 2014.0.0 branch).

> Next week i will be deploying new firmware and create new access points
> and cannot afford "testing" on production environment.

I understand. but if you could give it a try before leaving it would be
nice! :)


Thanks a lot anyway!
  

Patch

diff --git a/hard-interface.c b/hard-interface.c
index 6792e03..0eb0b3b 100644
--- a/hard-interface.c
+++ b/hard-interface.c
@@ -244,7 +244,7 @@  int batadv_hardif_min_mtu(struct net_device *soft_iface)
 {
 	struct batadv_priv *bat_priv = netdev_priv(soft_iface);
 	const struct batadv_hard_iface *hard_iface;
-	int min_mtu = ETH_DATA_LEN;
+	int min_mtu = INT_MAX;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(hard_iface, &batadv_hardif_list, list) {
@@ -259,8 +259,6 @@  int batadv_hardif_min_mtu(struct net_device *soft_iface)
 	}
 	rcu_read_unlock();
 
-	atomic_set(&bat_priv->packet_size_max, min_mtu);
-
 	if (atomic_read(&bat_priv->fragmentation) == 0)
 		goto out;
 
@@ -271,13 +269,21 @@  int batadv_hardif_min_mtu(struct net_device *soft_iface)
 	min_mtu = min_t(int, min_mtu, BATADV_FRAG_MAX_FRAG_SIZE);
 	min_mtu -= sizeof(struct batadv_frag_packet);
 	min_mtu *= BATADV_FRAG_MAX_FRAGMENTS;
-	atomic_set(&bat_priv->packet_size_max, min_mtu);
-
-	/* with fragmentation enabled we can fragment external packets easily */
-	min_mtu = min_t(int, min_mtu, ETH_DATA_LEN);
 
 out:
-	return min_mtu - batadv_max_header_len();
+	/* report to the other components the maximum amount of bytes that
+	 * batman-adv can send over the wire (without considering the payload
+	 * overhead). For example, this value is used by TT to compute the
+	 * maximum local table table size
+	 */
+	atomic_set(&bat_priv->packet_size_max, min_mtu);
+
+	/* the real soft-interface MTU is computed by removing the payload
+	 * overhead from the maximum amount of bytes that was just computed.
+	 *
+	 * However batman-adv does not support MTUs bigger than ETH_DATA_LEN
+	 */
+	return min_t(int, min_mtu - batadv_max_header_len(), ETH_DATA_LEN);
 }
 
 /* adjusts the MTU if a new interface with a smaller MTU appeared. */