batman-adv: Always synchronize rcu's on module shutdown

Message ID 1283729393-13842-1-git-send-email-linus.luessing@web.de (mailing list archive)
State Superseded, archived
Commit f3c1780e4d8b203131fa9b758e497e9a19cb4d0c
Headers

Commit Message

Linus Lüssing Sept. 5, 2010, 11:29 p.m. UTC
  During the module shutdown procedure in batman_exit(), a rcu callback is
being scheduled (batman_exit -> hardif_remove_interfaces ->
hardif_remove_interfae -> call_rcu). However, when the kernel unloads
the module, the rcu callback might not have been executed yet, resulting
in a "unable to handle kernel paging request" in __rcu_process_callback
afterwards, causing the kernel to freeze.
Therefore, we should always flush all rcu callback functions scheduled
during the shutdown procedure.

Signed-off-by: Linus Lüssing <linus.luessing@web.de>
---
 main.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)
  

Comments

Sven Eckelmann Sept. 6, 2010, 7:30 a.m. UTC | #1
On Mon, Sep 06, 2010 at 01:29:53AM +0200, Linus Lüssing wrote:
> During the module shutdown procedure in batman_exit(), a rcu callback is
> being scheduled (batman_exit -> hardif_remove_interfaces ->
> hardif_remove_interfae -> call_rcu). However, when the kernel unloads
> the module, the rcu callback might not have been executed yet, resulting
> in a "unable to handle kernel paging request" in __rcu_process_callback
> afterwards, causing the kernel to freeze.
> Therefore, we should always flush all rcu callback functions scheduled
> during the shutdown procedure.

I am really irritated by your patch. I would have expected that you add a
synchronyze_rcu in batman_exit and that was it. Instead I see a synchronize_net
added and a synchronize_net/-_rcu removed from mesh_free. This doesn't seem to
match at all. Could you please explain further why it is implemented that way?

thanks,
	Sven
  
Linus Lüssing Sept. 6, 2010, 10:09 a.m. UTC | #2
Hi Sven,

synchronize_net already contains a synchronize_rcu at its end, so
the synchronize_rcu in the batman code there has always been
redundant.

I've removed the synchronize_rcu instead of the synchronize_net to
be on the safe side. I guess usually no more packets should arrive
anyway as the batman packet type is not registered anymore. But I
wasn't sure if the might_sleep() of synchronize_net() might be
needed for something, so I didn't dare to remove synchronize_net.

If someone says it'd be ok to remove synchronize_net() instead,
I could make a new patch, no problem.

Cheers, Linus

On Mon, Sep 06, 2010 at 09:30:46AM +0200, Sven Eckelmann wrote:
> On Mon, Sep 06, 2010 at 01:29:53AM +0200, Linus Lüssing wrote:
> > During the module shutdown procedure in batman_exit(), a rcu callback is
> > being scheduled (batman_exit -> hardif_remove_interfaces ->
> > hardif_remove_interfae -> call_rcu). However, when the kernel unloads
> > the module, the rcu callback might not have been executed yet, resulting
> > in a "unable to handle kernel paging request" in __rcu_process_callback
> > afterwards, causing the kernel to freeze.
> > Therefore, we should always flush all rcu callback functions scheduled
> > during the shutdown procedure.
> 
> I am really irritated by your patch. I would have expected that you add a
> synchronyze_rcu in batman_exit and that was it. Instead I see a synchronize_net
> added and a synchronize_net/-_rcu removed from mesh_free. This doesn't seem to
> match at all. Could you please explain further why it is implemented that way?
> 
> thanks,
> 	Sven
  
Sven Eckelmann Sept. 6, 2010, 12:07 p.m. UTC | #3
Linus Lüssing wrote:
> Hi Sven,
> 
> synchronize_net already contains a synchronize_rcu at its end, so
> the synchronize_rcu in the batman code there has always been
> redundant.
> 
> I've removed the synchronize_rcu instead of the synchronize_net to
> be on the safe side. I guess usually no more packets should arrive
> anyway as the batman packet type is not registered anymore. But I
> wasn't sure if the might_sleep() of synchronize_net() might be
> needed for something, so I didn't dare to remove synchronize_net.
> 
> If someone says it'd be ok to remove synchronize_net() instead,
> I could make a new patch, no problem.

Ok, it would have been nice to state such things in the commit message 
(otherwise the stable@kernel.org will drop such a patch quite easily). Marek 
and I have ausgekaspert why it only happens in 1765 and also in 1766. So it 
will not be a patch for stable.

And the might_sleep is only for debugging purposes. But yes, it makes sense to 
use synchronize_net here (for example due to the usage of dev_remove_pack 
before).

That means that technically the patch seems to be ok, but didn't liked the 
explanation with the problem that we might have to justify it to the 
stable@kernel.org guys that way.

So I would ack the patch with a minor change in the commit message. So instead 
of

> During the module shutdown procedure in batman_exit(), a rcu callback is
> being scheduled (batman_exit -> hardif_remove_interfaces ->
> hardif_remove_interfae -> call_rcu). However, when the kernel unloads
> the module, the rcu callback might not have been executed yet, resulting
> in a "unable to handle kernel paging request" in __rcu_process_callback
> afterwards, causing the kernel to freeze.
> Therefore, we should always flush all rcu callback functions scheduled
> during the shutdown procedure.

something like

> During the module shutdown procedure in batman_exit(), a rcu callback is                                                                                                                                  
> being scheduled (batman_exit -> hardif_remove_interfaces ->                                                                                                                                               
> hardif_remove_interfae -> call_rcu). However, when the kernel unloads                                                                                                                                     
> the module, the rcu callback might not have been executed yet, resulting                                                                                                                                  
> in a "unable to handle kernel paging request" in __rcu_process_callback                                                                                                                                   
> afterwards, causing the kernel to freeze.                                                                                                                                                                 
                                                                                                                                                                                                          
> The synchronize_net and synchronize_rcu in mesh_free are currently                                                                                                                                        
> called before the call_rcu in hardif_remove_interface and have no real                                                                                                                                    
> effect on it.                                                                                                                                                                                             
                                                                                                                                                                                                          
> Therefore, we should always flush all rcu callback functions scheduled                                                                                                                                    
> during the shutdown procedure using synchronize_net. The call to                                                                                                                                          
> synchronize_rcu can be omitted because synchronize_net already calls it.

thanks,
	Sven
  
Linus Lüssing Sept. 6, 2010, 12:37 p.m. UTC | #4
> So I would ack the patch with a minor change in the commit message. So instead 
> of
> 
> > During the module shutdown procedure in batman_exit(), a rcu callback is
> > being scheduled (batman_exit -> hardif_remove_interfaces ->
> > hardif_remove_interfae -> call_rcu). However, when the kernel unloads
> > the module, the rcu callback might not have been executed yet, resulting
> > in a "unable to handle kernel paging request" in __rcu_process_callback
> > afterwards, causing the kernel to freeze.
> > Therefore, we should always flush all rcu callback functions scheduled
> > during the shutdown procedure.
> 
> something like
> 
> > During the module shutdown procedure in batman_exit(), a rcu callback is                                                                                                                                  
> > being scheduled (batman_exit -> hardif_remove_interfaces ->                                                                                                                                               
> > hardif_remove_interfae -> call_rcu). However, when the kernel unloads                                                                                                                                     
> > the module, the rcu callback might not have been executed yet, resulting                                                                                                                                  
> > in a "unable to handle kernel paging request" in __rcu_process_callback                                                                                                                                   
> > afterwards, causing the kernel to freeze.                                                                                                                                                                 
>                                                                                                                                                                                                           
> > The synchronize_net and synchronize_rcu in mesh_free are currently                                                                                                                                        
> > called before the call_rcu in hardif_remove_interface and have no real                                                                                                                                    
> > effect on it.                                                                                                                                                                                             
>                                                                                                                                                                                                           
> > Therefore, we should always flush all rcu callback functions scheduled                                                                                                                                    
> > during the shutdown procedure using synchronize_net. The call to                                                                                                                                          
> > synchronize_rcu can be omitted because synchronize_net already calls it.

Yep, sounds good :). Thanks for reviewing and the info about
synchronize_net.

Cheers, Linus

> 
> thanks,
> 	Sven
  

Patch

diff --git a/main.c b/main.c
index 209a46b..e8acb46 100644
--- a/main.c
+++ b/main.c
@@ -73,6 +73,8 @@  static void __exit batman_exit(void)
 	flush_workqueue(bat_event_workqueue);
 	destroy_workqueue(bat_event_workqueue);
 	bat_event_workqueue = NULL;
+
+	synchronize_net();
 }
 
 int mesh_init(struct net_device *soft_iface)
@@ -135,9 +137,6 @@  void mesh_free(struct net_device *soft_iface)
 	hna_local_free(bat_priv);
 	hna_global_free(bat_priv);
 
-	synchronize_net();
-
-	synchronize_rcu();
 	atomic_set(&bat_priv->mesh_state, MESH_INACTIVE);
 }