[maint] batman-adv: fix potential TT client + orig-node memory leak

Message ID 1418509935-11849-1-git-send-email-linus.luessing@c0d3.blue (mailing list archive)
State Accepted, archived
Commit 21fa2647ad4547a48da1166a8fa2f951cd7163e2
Headers

Commit Message

Linus Lüssing Dec. 13, 2014, 10:32 p.m. UTC
  This patch fixes a potential memory leak which can occur once an
originator times out. On timeout the according global translation table
entry might not get purged correctly. Furthermore, the non purged TT
entry will cause its orig-node to leak, too. Which additionally can lead
to the new multicast optimization feature not kicking in because of a
therefore bogus counter.

In the wild with larger mesh networks we saw this leak quite regularly,
resulting in routers to reboot or killed processes. This was because
of a combination of two bugs: The bug fixed by commit
"batman-adv: fix delayed foreign originator recognition" (8a2ad5204674)
amplified this memory leak heavily. Since that commit I'd expect
it to happen rarely, probably only in paused and resumed VMs and
devices previously in stand-by.

The issue this patch fixes is caused by batadv_orig_node_free_rcu()
never being called because of not yet released references to the
orig-node. References which were supposed to be released through
batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().

Fixing the issue by moving batadv_tt_global_del_orig() out of the rcu
callback.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
---
 originator.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
  

Comments

Marek Lindner Dec. 29, 2014, 3:52 a.m. UTC | #1
On Saturday 13 December 2014 23:32:15 Linus Lüssing wrote:
> This patch fixes a potential memory leak which can occur once an
> originator times out. On timeout the according global translation table
> entry might not get purged correctly. Furthermore, the non purged TT
> entry will cause its orig-node to leak, too. Which additionally can lead
> to the new multicast optimization feature not kicking in because of a
> therefore bogus counter.

So far, I am with you ..


> In the wild with larger mesh networks we saw this leak quite regularly,
> resulting in routers to reboot or killed processes. This was because
> of a combination of two bugs: The bug fixed by commit
> "batman-adv: fix delayed foreign originator recognition" (8a2ad5204674)
> amplified this memory leak heavily. Since that commit I'd expect
> it to happen rarely, probably only in paused and resumed VMs and
> devices previously in stand-by.

This section shouldn't be part of the official commit message. It is hardly 
relevant to the reviewer how often a memleak occurs and whether or not you 
need a VM to trigger it. The provided commit id isn't valid in the Linux tree.


> The issue this patch fixes is caused by batadv_orig_node_free_rcu()
> never being called because of not yet released references to the
> orig-node. References which were supposed to be released through
> batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().

Could you please provide addition insight as to which references are still 
held ? I did look around but nothing obvious jumped at me. 

Generally, it wouldn't be bad if the commit message went into deeper detail 
describing the nature of the bug instead of the middle section above to make 
it easy to understand what is being fixed.


Cheers,
Marek
  
Linus Lüssing Dec. 29, 2014, 2:32 p.m. UTC | #2
On Mon, Dec 29, 2014 at 11:52:53AM +0800, Marek Lindner wrote:
> > The issue this patch fixes is caused by batadv_orig_node_free_rcu()
> > never being called because of not yet released references to the
> > orig-node. References which were supposed to be released through
> > batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
> 
> Could you please provide addition insight as to which references are still 
> held ? I did look around but nothing obvious jumped at me. 

The batadv_tt_global_entry->orig_list holds the reference to the
orig-node. Usually this reference is released after
BATADV_PURGE_TIMEOUT through: _batadv_purge_orig()->
batadv_purge_orig_node()->batadv_update_route()->_batadv_update_route()->
batadv_tt_global_del_orig() which purges this global tt entry and
releases the reference to the orig-node.

However, if between two batadv_purge_orig_node() calls the orig-node
timeout grew to 2*BATADV_PURGE_TIMEOUT then this call path isn't
reached (*). Instead the according orig-node is removed from the
originator hash in _batadv_purge_orig(), the batadv_update_route()
part is skipped and won't be reached anymore.

It seems that in that case batadv_orig_node_free_rcu()->batadv_tt_global_del_orig()
is supposed to purge the global tt entry and to release the orig-node
reference but it's not called because batadv_orig_node_free_rcu() is
only called once all references are freed: A chicken 'n' egg
situation.

> 
> Generally, it wouldn't be bad if the commit message went into deeper detail 
> describing the nature of the bug instead of the middle section above to make 
> it easy to understand what is being fixed.

Hm, hm, my intention was to somehow/somewhere document how these
two, small and rare bugs together created this severe bug.
The issue fixed by 8a2ad5204674 was the main trigger for (*) in
previous releases.

But we can also skip that middle section if you want. Then I'll
just add a note to the ticket on redmine.

> 
> 
> Cheers,
> Marek

Cheers, Linus
  
Antonio Quartulli Jan. 4, 2015, 4:05 p.m. UTC | #3
On 29/12/14 04:52, Marek Lindner wrote:
> On Saturday 13 December 2014 23:32:15 Linus Lüssing wrote:
>> This patch fixes a potential memory leak which can occur once an
>> originator times out. On timeout the according global translation table
>> entry might not get purged correctly. Furthermore, the non purged TT
>> entry will cause its orig-node to leak, too. Which additionally can lead
>> to the new multicast optimization feature not kicking in because of a
>> therefore bogus counter.
> 
> So far, I am with you ..
> 
> 
>> In the wild with larger mesh networks we saw this leak quite regularly,
>> resulting in routers to reboot or killed processes. This was because
>> of a combination of two bugs: The bug fixed by commit
>> "batman-adv: fix delayed foreign originator recognition" (8a2ad5204674)
>> amplified this memory leak heavily. Since that commit I'd expect
>> it to happen rarely, probably only in paused and resumed VMs and
>> devices previously in stand-by.
> 
> This section shouldn't be part of the official commit message. It is hardly 
> relevant to the reviewer how often a memleak occurs and whether or not you 
> need a VM to trigger it. The provided commit id isn't valid in the Linux tree.
> 
> 
>> The issue this patch fixes is caused by batadv_orig_node_free_rcu()
>> never being called because of not yet released references to the
>> orig-node. References which were supposed to be released through
>> batadv_orig_node_free_rcu()->batadv_tt_global_del_orig().
> 
> Could you please provide addition insight as to which references are still 
> held ? I did look around but nothing obvious jumped at me. 
> 
> Generally, it wouldn't be bad if the commit message went into deeper detail 
> describing the nature of the bug instead of the middle section above to make 
> it easy to understand what is being fixed.
> 


Hi Linus and thanks for fixing this TT bug!

I agree with Marek about extending the commit message with a better
explanation of the actual bug, but at the same time I think it is good
to keep the commit message and ID of the other involved patches.

Moreover, please keep the commit ID of our batman-adv.git tree - I'll
then take care of converting them when sending the patches upstream.

Cheers,
  
Antonio Quartulli Jan. 4, 2015, 4:11 p.m. UTC | #4
On 04/01/15 17:05, Antonio Quartulli wrote:
> 
> Hi Linus and thanks for fixing this TT bug!
> 
> I agree with Marek about extending the commit message with a better
> explanation of the actual bug, but at the same time I think it is good
> to keep the commit message and ID of the other involved patches.
> 
> Moreover, please keep the commit ID of our batman-adv.git tree - I'll
> then take care of converting them when sending the patches upstream.

I forgot..for the patch:

Acked-by: Antonio Quartulli <antonio@meshcoding.com>
  
Marek Lindner Jan. 5, 2015, 5:22 p.m. UTC | #5
On Sunday 04 January 2015 17:11:43 Antonio Quartulli wrote:
> On 04/01/15 17:05, Antonio Quartulli wrote:
> > Hi Linus and thanks for fixing this TT bug!
> > 
> > I agree with Marek about extending the commit message with a better
> > explanation of the actual bug, but at the same time I think it is good
> > to keep the commit message and ID of the other involved patches.
> > 
> > Moreover, please keep the commit ID of our batman-adv.git tree - I'll
> > then take care of converting them when sending the patches upstream.
> 
> I forgot..for the patch:
> 
> Acked-by: Antonio Quartulli <antonio@meshcoding.com>

Applied in revision 21fa264 with commit message beautifications.

Thanks,
Marek
  

Patch

diff --git a/originator.c b/originator.c
index 648bdba..bea8198 100644
--- a/originator.c
+++ b/originator.c
@@ -570,9 +570,6 @@  static void batadv_orig_node_free_rcu(struct rcu_head *rcu)
 
 	batadv_frag_purge_orig(orig_node, NULL);
 
-	batadv_tt_global_del_orig(orig_node->bat_priv, orig_node, -1,
-				  "originator timed out");
-
 	if (orig_node->bat_priv->bat_algo_ops->bat_orig_free)
 		orig_node->bat_priv->bat_algo_ops->bat_orig_free(orig_node);
 
@@ -978,6 +975,9 @@  static void _batadv_purge_orig(struct batadv_priv *bat_priv)
 			if (batadv_purge_orig_node(bat_priv, orig_node)) {
 				batadv_gw_node_delete(bat_priv, orig_node);
 				hlist_del_rcu(&orig_node->hash_entry);
+				batadv_tt_global_del_orig(orig_node->bat_priv,
+							  orig_node, -1,
+							  "originator timed out");
 				batadv_orig_node_free_ref(orig_node);
 				continue;
 			}