Debug Malloc Problem

Message ID 4313f3060905261150o6c9fe82at6dd40d26456ac03a@mail.gmail.com (mailing list archive)
State Not Applicable, archived
Headers

Commit Message

Nathan Wharton May 26, 2009, 6:50 p.m. UTC
  Looks like this one didn't go to all either.  Sorry for the mess.

On Tue, May 26, 2009 at 10:47 AM, Nathan Wharton <naw@greptar.com> wrote:
> On Fri, May 22, 2009 at 11:31 PM, Marek Lindner <lindner_marek@yahoo.de> wrote:
>>> I am getting set up to do remote debugging.  I'll see what I can find
>>> doing that.
>>
>> Yeah, that would be another option.
>
> ok, here is what I found on the debug malloc problem I am having:
>
> In the function debugMalloc:
> =================================
>        chunkHeader = (struct chunkHeader *)memory;
>        chunk = memory + sizeof(struct chunkHeader);
>        chunkTrailer = (struct chunkTrailer *)(memory + sizeof(struct
> chunkHeader) + length);
>
>        chunkHeader->length = length;
>        chunkHeader->tag = tag;
>        chunkHeader->magicNumber = MAGIC_NUMBER;
>
>        chunkTrailer->magicNumber = MAGIC_NUMBER;
>
> =>      pthread_mutex_lock(&chunk_mutex);
> =================================
> I get the following results:
>
> (gdb) p chunkHeader
> $26 = (struct chunkHeader *) 0x308c8
> (gdb) p chunk
> $27 = (unsigned char *) 0x308d8 "\020\322\n"
> (gdb) p chunkTrailer
> $28 = (struct chunkTrailer *) 0x308dd
> (gdb) p *chunkHeader
> $29 = {next = 0x40096c34, length = 5, tag = 15, magicNumber = 305419896}
> (gdb) p *chunkTrailer
> $30 = {magicNumber = 878082050}
> (gdb) p length
> $31 = 5
>
> I think the magic number is not getting into the chunk trailer
> correctly because it is not aligned.
> chunkHeader is aligned because it was returned by malloc
> chunk is aligned because the size of the header is 16
> chunkTrailer is not aligned because it is chunk + 5
>
> Hope this helps.  If not, and that shouldn't be a problem, I'll see
> what else I can find.
>

I added the following patch:
chunkTrailer)); */

=========================================
And I don't get any debug malloc problems anymore.

I do, however, still get the kernel crashes when using batgat.
  

Comments

Sven Eckelmann May 26, 2009, 10:13 p.m. UTC | #1
On Tuesday 26 May 2009 20:50:56 Nathan Wharton wrote:
> I added the following patch:
> =========================================
> --- batmand-r1269/batman/allocate.c     2009-05-20 13:54:18.000000000 -0500
> +++ batmand-r1269.mod/batman/allocate.c 2009-05-26 12:25:07.000000000 -0500
> @@ -206,6 +206,10 @@
>        struct chunkHeader *chunkHeader;
>        struct chunkTrailer *chunkTrailer;
>        unsigned char *chunk;
> +       uint32_t remainder = length % 4;
> +
> +       if (remainder > 0)
> +         length += 4 - remainder;
>
>  /*     printf("sizeof(struct chunkHeader) = %u, sizeof (struct
> chunkTrailer) = %u\n", sizeof (struct chunkHeader), sizeof (struct
> chunkTrailer)); */
Please don't introduce more magic numbers. If you need an alignment on integer 
granularity then please use sizeof(int) instead of a simple constant.
The rest looks quite good and logical. I assumed that the only architecture 
without automatic catching of alignment problems is mips with load operators 
to the fp registers (never had such problems on simple arm machine nor could I 
produce that problem on them). Is there any other documentation about such 
problems on xscale or can you proof this with an test program (malloc 6 byte, 
write uint32_t to first 4 byte, write another uint32_t to byte 2-5, compare 
bytes with input - don't forget to set it volatile)? Your current proof if 
concept test only shows that the "overriden" bytes will not get overriden 
after moving them some bytes higher.

=========================================
#include <stdio.h>
#include <stdlib.h>

#define x_moved(pos) ((int*)&x[pos])[0]
static const int value1 = 0x01234567;
static const int value2 = 0x89abcdef;

int main() {
	volatile char *x = (char*)malloc(sizeof(int) * 2);
	x_moved(0) = value1;
	x_moved(2) = value2;

	if (x_moved(2) != value2) {
		printf("ERROR Value: ");
	} else {
		printf("Value: ");
	}
	printf("%x %x\n", x_moved(0) , x_moved(2));

	free(x);
	return 0;
}
=========================================


Best regards,
	Sven
  
Nathan Wharton May 27, 2009, 12:06 a.m. UTC | #2
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> On Tuesday 26 May 2009 20:50:56 Nathan Wharton wrote:
>> I added the following patch:
> Please don't introduce more magic numbers. If you need an alignment on integer
> granularity then please use sizeof(int) instead of a simple constant.
> The rest looks quite good and logical. I assumed that the only architecture
> without automatic catching of alignment problems is mips with load operators
> to the fp registers (never had such problems on simple arm machine nor could I
> produce that problem on them). Is there any other documentation about such
> problems on xscale or can you proof this with an test program (malloc 6 byte,
> write uint32_t to first 4 byte, write another uint32_t to byte 2-5, compare
> bytes with input - don't forget to set it volatile)? Your current proof if
> concept test only shows that the "overriden" bytes will not get overriden
> after moving them some bytes higher.
>

I didn't really mean for that to be an accepted patch, just a quick
test.  Maybe the trailing magic number could just be a byte so you
don't have to allocate more memory.  I'm sure there are other things
to do too.

I haven't looked at any documentation for the xscale for this problem.
 I can do a test program.  I have seen this type of problem before,
but it would throw a bus error instead of going along with an
incorrect result.


> =========================================
> #include <stdio.h>
> #include <stdlib.h>
>
> #define x_moved(pos) ((int*)&x[pos])[0]
> static const int value1 = 0x01234567;
> static const int value2 = 0x89abcdef;
>
> int main() {
>        volatile char *x = (char*)malloc(sizeof(int) * 2);
>        x_moved(0) = value1;
>        x_moved(2) = value2;
>
>        if (x_moved(2) != value2) {
>                printf("ERROR Value: ");
>        } else {
>                printf("Value: ");
>        }
>        printf("%x %x\n", x_moved(0) , x_moved(2));
>
>        free(x);
>        return 0;
> }
> =========================================
>
>
> Best regards,
>        Sven
>

Thanks for your help.
  
Marek Lindner May 27, 2009, 9:28 a.m. UTC | #3
On Wednesday 27 May 2009 08:06:16 Nathan Wharton wrote:
> I haven't looked at any documentation for the xscale for this problem.
>  I can do a test program.  I have seen this type of problem before,
> but it would throw a bus error instead of going along with an
> incorrect result.

Yes, I'd expect some kind of interrupt error that the kernel handles. Please 
try to run the code that Sven sent and let us know about the results.

If this test is negative I would make a debug patch for you that calls 
checkIntegrity() during the different stages of the startup phase to narrow 
down the issue.

Regards,
Marek
  
Nathan Wharton May 27, 2009, 2:20 p.m. UTC | #4
On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> =========================================
> #include <stdio.h>
> #include <stdlib.h>
>
> #define x_moved(pos) ((int*)&x[pos])[0]
> static const int value1 = 0x01234567;
> static const int value2 = 0x89abcdef;
>
> int main() {
>        volatile char *x = (char*)malloc(sizeof(int) * 2);
>        x_moved(0) = value1;
>        x_moved(2) = value2;
>
>        if (x_moved(2) != value2) {
>                printf("ERROR Value: ");
>        } else {
>                printf("Value: ");
>        }
>        printf("%x %x\n", x_moved(0) , x_moved(2));
>
>        free(x);
>        return 0;
> }
> =========================================

This results in:
ERROR Value: 89abcdef cdef89ab
  
Nathan Wharton May 27, 2009, 4:07 p.m. UTC | #5
On Wed, May 27, 2009 at 9:20 AM, Nathan Wharton <naw@greptar.com> wrote:
> On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
>> =========================================
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> #define x_moved(pos) ((int*)&x[pos])[0]
>> static const int value1 = 0x01234567;
>> static const int value2 = 0x89abcdef;
>>
>> int main() {
>>        volatile char *x = (char*)malloc(sizeof(int) * 2);
>>        x_moved(0) = value1;
>>        x_moved(2) = value2;
>>
>>        if (x_moved(2) != value2) {
>>                printf("ERROR Value: ");
>>        } else {
>>                printf("Value: ");
>>        }
>>        printf("%x %x\n", x_moved(0) , x_moved(2));
>>
>>        free(x);
>>        return 0;
>> }
>> =========================================
>
> This results in:
> ERROR Value: 89abcdef cdef89ab
>

Some good info is here:
http://lecs.cs.ucla.edu/wiki/index.php/XScale_alignment
  
Sven Eckelmann May 27, 2009, 4:16 p.m. UTC | #6
On Wednesday 27 May 2009 16:20:16 Nathan Wharton wrote:
> On Tue, May 26, 2009 at 5:13 PM, Sven Eckelmann <sven.eckelmann@gmx.de> 
wrote:
> > =========================================
> > #include <stdio.h>
> > #include <stdlib.h>
> >
> > #define x_moved(pos) ((int*)&x[pos])[0]
> > static const int value1 = 0x01234567;
> > static const int value2 = 0x89abcdef;
> >
> > int main() {
> >        volatile char *x = (char*)malloc(sizeof(int) * 2);
> >        x_moved(0) = value1;
> >        x_moved(2) = value2;
> >
> >        if (x_moved(2) != value2) {
> >                printf("ERROR Value: ");
> >        } else {
> >                printf("Value: ");
> >        }
> >        printf("%x %x\n", x_moved(0) , x_moved(2));
> >
> >        free(x);
> >        return 0;
> > }
> > =========================================
>
> This results in:
> ERROR Value: 89abcdef cdef89ab
Thanks. This is completely unexpecting for a kernel with alignment trap. The 
expected result is  "Value: 12389ab 89abcdef". Can you check the value of 
/proc/cpu/alignment ? I will look further at it this evening.

Regards,
	Sven
  
Nathan Wharton May 27, 2009, 4:21 p.m. UTC | #7
On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> Thanks. This is completely unexpecting for a kernel with alignment trap. The
> expected result is  "Value: 12389ab 89abcdef". Can you check the value of
> /proc/cpu/alignment ? I will look further at it this evening.
>
> Regards,
>        Sven
>

User:  0
System:  0
Skipped: 0
Half:  0
Word:  0
DWord:  0
Multi:  0
User faults: 0 (ignored)
  
Nathan Wharton May 27, 2009, 4:35 p.m. UTC | #8
On Wed, May 27, 2009 at 11:21 AM, Nathan Wharton <naw@greptar.com> wrote:
> On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
>> Thanks. This is completely unexpecting for a kernel with alignment trap. The
>> expected result is  "Value: 12389ab 89abcdef". Can you check the value of
>> /proc/cpu/alignment ? I will look further at it this evening.
>>
>> Regards,
>>        Sven
>>
>
> User:  0
> System:  0
> Skipped: 0
> Half:  0
> Word:  0
> DWord:  0
> Multi:  0
> User faults: 0 (ignored)
>

I do get the correct result after echo 2 > /proc/cpu/alignment
  
Sven Eckelmann May 27, 2009, 5:13 p.m. UTC | #9
On Wednesday 27 May 2009 18:35:42 Nathan Wharton wrote:
> On Wed, May 27, 2009 at 11:21 AM, Nathan Wharton <naw@greptar.com> wrote:
> > On Wed, May 27, 2009 at 11:16 AM, Sven Eckelmann <sven.eckelmann@gmx.de> 
wrote:
> >> Thanks. This is completely unexpecting for a kernel with alignment trap.
> >> The expected result is  "Value: 12389ab 89abcdef". Can you check the
> >> value of /proc/cpu/alignment ? I will look further at it this evening.
> >>
> >> Regards,
> >>        Sven
> >
> > User:  0
> > System:  0
> > Skipped: 0
> > Half:  0
> > Word:  0
> > DWord:  0
> > Multi:  0
> > User faults: 0 (ignored)
Thanks.

> I do get the correct result after echo 2 > /proc/cpu/alignment
Ok, so 100% sure that it is a alignment problem. The kernel fixup is always 
activated and shouldn't create such problems as the page faults you mentioned 
in your first post. Maybe some communication between kernelland and userland 
has gone wrong. Can you test batmand with batgat and activated alignment 
fixup?

Best regards,
	Sven

(Thanks for the link - it is a real nice overview).
  
Nathan Wharton May 27, 2009, 5:21 p.m. UTC | #10
On Wed, May 27, 2009 at 12:13 PM, Sven Eckelmann <sven.eckelmann@gmx.de> wrote:
> Ok, so 100% sure that it is a alignment problem. The kernel fixup is always
> activated and shouldn't create such problems as the page faults you mentioned
> in your first post. Maybe some communication between kernelland and userland
> has gone wrong. Can you test batmand with batgat and activated alignment
> fixup?
>
> Best regards,
>        Sven
>
> (Thanks for the link - it is a real nice overview).
>

Changing the mode from -r 2 to -g 12000 still causes the kernel to
crash with alignment fixup.

On the console I am doing a logread when it happens.  The last thing
printed out before the crash is:
May 27 17:12:23 OpenWrt user.debug kernel: batgat: [batgat_ioctl:238]
disconnect daemon
  

Patch

=========================================
--- batmand-r1269/batman/allocate.c     2009-05-20 13:54:18.000000000 -0500
+++ batmand-r1269.mod/batman/allocate.c 2009-05-26 12:25:07.000000000 -0500
@@ -206,6 +206,10 @@ 
       struct chunkHeader *chunkHeader;
       struct chunkTrailer *chunkTrailer;
       unsigned char *chunk;
+       uint32_t remainder = length % 4;
+
+       if (remainder > 0)
+         length += 4 - remainder;

 /*     printf("sizeof(struct chunkHeader) = %u, sizeof (struct
chunkTrailer) = %u\n", sizeof (struct chunkHeader), sizeof (struct