batman-adv: Fix crash on module shutdown with multiple ifaces

Message ID 1302891882-11246-1-git-send-email-linus.luessing@web.de (mailing list archive)
State Superseded, archived
Headers

Commit Message

Linus Lüssing April 15, 2011, 6:24 p.m. UTC
  hardif_remove_interfaces() removes all hard interfaces from the
hardif_list before freeing and cleaning up any device. However the clean
up procedures in orig_hash_del_if()
(hardif_remove_interface()->hardif_disable_interface()->
orig_hash_del_if())
need the other interfaces still to be present in the hardif_list.
Otherwise it won't renumber any preceding interfaces, which leads to an
unhandled kernel paging request in orig_node_del_if()'s "/* copy second
part */" due to wrong hard_if->if_num's.

With this commit the interface removal on module shutdown will be down
in the same way as removing single interfaces from batman only: One
interface will be removed and cleaned at a time.

Signed-off-by: Linus Lüssing <linus.luessing@web.de>
---
 hard-interface.c |   15 ++++-----------
 1 files changed, 4 insertions(+), 11 deletions(-)
  

Comments

Sven Eckelmann April 16, 2011, 7:54 a.m. UTC | #1
Linus Lüssing wrote:
> hardif_remove_interfaces() removes all hard interfaces from the
> hardif_list before freeing and cleaning up any device. However the clean
> up procedures in orig_hash_del_if()
> (hardif_remove_interface()->hardif_disable_interface()->
> orig_hash_del_if())
> need the other interfaces still to be present in the hardif_list.
> Otherwise it won't renumber any preceding interfaces, which leads to an
> unhandled kernel paging request in orig_node_del_if()'s "/* copy second
> part */" due to wrong hard_if->if_num's.
> 
> With this commit the interface removal on module shutdown will be down
> in the same way as removing single interfaces from batman only: One
> interface will be removed and cleaned at a time.
> 
> Signed-off-by: Linus Lüssing <linus.luessing@web.de>


Please use --patience as requested in
 http://www.open-mesh.org/wiki/open-mesh/Contribute

Please show us (as part of the commit message) why the information in
 http://www.open-mesh.org/projects/batman-adv/repository/revisions/132b776c22c9b71962a3ed9a33e5b6f9218dae1b
isn't valid anymore and explain why it is save to use the spin_lock only
inside the loop (but it would have to protect the loop in normal situations).

Kind regards,
	Sven
  
Sven Eckelmann April 16, 2011, 8:25 a.m. UTC | #2
Sven Eckelmann wrote:
> Linus Lüssing wrote:
> > hardif_remove_interfaces() removes all hard interfaces from the
> > hardif_list before freeing and cleaning up any device. However the clean
> > up procedures in orig_hash_del_if()
> > (hardif_remove_interface()->hardif_disable_interface()->
> > orig_hash_del_if())
> > need the other interfaces still to be present in the hardif_list.
> > Otherwise it won't renumber any preceding interfaces, which leads to an
> > unhandled kernel paging request in orig_node_del_if()'s "/* copy second
> > part */" due to wrong hard_if->if_num's.
> > 
> > With this commit the interface removal on module shutdown will be down
> > in the same way as removing single interfaces from batman only: One
> > interface will be removed and cleaned at a time.
> > 
> > Signed-off-by: Linus Lüssing <linus.luessing@web.de>
> 
> Please use --patience as requested in
>  http://www.open-mesh.org/wiki/open-mesh/Contribute
> 
> Please show us (as part of the commit message) why the information in
>  http://www.open-mesh.org/projects/batman-adv/repository/revisions/132b776c
> 22c9b71962a3ed9a33e5b6f9218dae1b isn't valid anymore and explain why it is
> save to use the spin_lock only inside the loop (but it would have to
> protect the loop in normal situations).

Sry, this was not the correct commit - The commit which fixed a problematic 
locking behaviour was 5d4b5a4d - but I didn't gave a lockdep output there.

The other question must still be answered.

Btw. what is the status of the sysfs_addrm_finish vs. rtnl_lock patch?

Kind regards,
 	Sven
  
Linus Lüssing April 27, 2011, 4:02 p.m. UTC | #3
On Sat, Apr 16, 2011 at 09:54:48AM +0200, Sven Eckelmann wrote:
> Linus Lüssing wrote:
> > hardif_remove_interfaces() removes all hard interfaces from the
> > hardif_list before freeing and cleaning up any device. However the clean
> > up procedures in orig_hash_del_if()
> > (hardif_remove_interface()->hardif_disable_interface()->
> > orig_hash_del_if())
> > need the other interfaces still to be present in the hardif_list.
> > Otherwise it won't renumber any preceding interfaces, which leads to an
> > unhandled kernel paging request in orig_node_del_if()'s "/* copy second
> > part */" due to wrong hard_if->if_num's.
> > 
> > With this commit the interface removal on module shutdown will be down
> > in the same way as removing single interfaces from batman only: One
> > interface will be removed and cleaned at a time.
> > 
> > Signed-off-by: Linus Lüssing <linus.luessing@web.de>
> 
> 
> Please use --patience as requested in
>  http://www.open-mesh.org/wiki/open-mesh/Contribute
> 
> Please show us (as part of the commit message) why the information in
>  http://www.open-mesh.org/projects/batman-adv/repository/revisions/132b776c22c9b71962a3ed9a33e5b6f9218dae1b
> isn't valid anymore and explain why it is save to use the spin_lock only
> inside the loop (but it would have to protect the loop in normal situations).
> 
> Kind regards,
> 	Sven

Hi Sven,

Ah, oki doki, didn't know about commit 5d4b5a4d and yes, a revert
of that commit looks kind of similar to my patch.

Commit 5d4b5a4d together with your statement confuse me a little. The
commit message does not say anything about a locking dependancy
issue, but seems to be a performance patch (which does not seem as
such a severe problem to me, as removing/adding interfaces /
removing the batman-adv module does not happen that frequently in
common setups ;) ). Could you explain a little further which
combinations of locks could introduce a problem?

Hmm, for the "and explain why it is save to use the spin_lock
only" part, aggreed, I think it was initially a mistake of mine.
And usually this would not protect us from a new interface being
added or an interface being removed from batman during a
NETDEV_REGISTER/UNREGISTER event while we are trying to flush the
if_list.
However, just before calling hardif_remove_interfaces(), we are
calling unregister_netdevice_notifier(&hard_if_notifier).
So as far as I know, no hardif_add_interface() or
hardif_remove_interface() and according list_add/del_rcu for the
if_list should be called anymore.

Cheers, Linus

PS: And it's actually not an "unhandled kernel paging request" but
a "Null pointer dereference". Also see this ticket:
http://www.open-mesh.org/issues/147

I'm also wondering why we'd actually need the rtnl_lock() in
hardif_remove_interfaces() then with that reasoning. What
operation in hardif_remove_interface() (without the 's') needs to
be protected with the rtnl_lock(), could be place the rtnl_lock a
little tighter instead to also fix the issue from here?
http://www.open-mesh.org/issues/145
  
Sven Eckelmann April 29, 2011, 7:58 p.m. UTC | #4
Linus Lüssing wrote:

> Ah, oki doki, didn't know about commit 5d4b5a4d and yes, a revert
> of that commit looks kind of similar to my patch.
> 
> Commit 5d4b5a4d together with your statement confuse me a little. The
> commit message does not say anything about a locking dependancy
> issue, but seems to be a performance patch (which does not seem as
> such a severe problem to me, as removing/adding interfaces /
> removing the batman-adv module does not happen that frequently in
> common setups ;) ). Could you explain a little further which
> combinations of locks could introduce a problem?

No

> Hmm, for the "and explain why it is save to use the spin_lock
> only" part, aggreed, I think it was initially a mistake of mine.
> And usually this would not protect us from a new interface being
> added or an interface being removed from batman during a
> NETDEV_REGISTER/UNREGISTER event while we are trying to flush the
> if_list.
> However, just before calling hardif_remove_interfaces(), we are
> calling unregister_netdevice_notifier(&hard_if_notifier).
> So as far as I know, no hardif_add_interface() or
> hardif_remove_interface() and according list_add/del_rcu for the
> if_list should be called anymore.

Interesting assumption, but how did you ensure that everything is in a 
synchronous state? The network core also uses rcu - and it doesn't use the 
atomic notifier functions.

> Cheers, Linus
> 
> PS: And it's actually not an "unhandled kernel paging request" but
> a "Null pointer dereference". Also see this ticket:
> http://www.open-mesh.org/issues/147
> 
> I'm also wondering why we'd actually need the rtnl_lock() in
> hardif_remove_interfaces() then with that reasoning. What
> operation in hardif_remove_interface() (without the 's') needs to
> be protected with the rtnl_lock(), could be place the rtnl_lock a
> little tighter instead to also fix the issue from here?
> http://www.open-mesh.org/issues/145

See 132b776c22c9b71962a3ed9a33e5b6f9218dae1b

I will propose two different patches.

Regards,
	Sven
  

Patch

diff --git a/hard-interface.c b/hard-interface.c
index b3058e4..f039a3d 100644
--- a/hard-interface.c
+++ b/hard-interface.c
@@ -490,20 +490,13 @@  static void hardif_remove_interface(struct hard_iface *hard_iface)
 void hardif_remove_interfaces(void)
 {
 	struct hard_iface *hard_iface, *hard_iface_tmp;
-	struct list_head if_queue;
 
-	INIT_LIST_HEAD(&if_queue);
-
-	spin_lock(&hardif_list_lock);
-	list_for_each_entry_safe(hard_iface, hard_iface_tmp,
-				 &hardif_list, list) {
+	rtnl_lock();
+	list_for_each_entry_safe(hard_iface, hard_iface_tmp, &hardif_list, list) {
+		spin_lock(&hardif_list_lock);
 		list_del_rcu(&hard_iface->list);
-		list_add_tail(&hard_iface->list, &if_queue);
-	}
-	spin_unlock(&hardif_list_lock);
+		spin_unlock(&hardif_list_lock);
 
-	rtnl_lock();
-	list_for_each_entry_safe(hard_iface, hard_iface_tmp, &if_queue, list) {
 		hardif_remove_interface(hard_iface);
 	}
 	rtnl_unlock();