Syntactical twists of C (*p != *p) | Cloud, Computing, Chaos

[Moved an old post from 2006 to my new blog]

Few days back one of my colleague asked me to debug a problem. She wrote a program and it was crashing in strcpy. I looked at the the code and it looked just fine to me. I thought lets debug it to see whats going on. I started the debug session, variables were pointing to the right data, the stack was fine and she was copying a fixed string to a big enough buffer. I stepped over strcpy and bammm…access violation. Weird huh…For a second i thought how can a simple code like this crash. It was time to dig into the disassembly to see what exactly is going on. But before we do that, lets take a look at two C functions below:

void
foo()
{
    char buffer[16];
    char *p = buffer;
    *p = 1; /// Interesting code
    p[1] = 1; /// Interesting code
    p[2] = 2; /// Interesting code
}

void
hoo()
{
    char p[16];
    *p = 1; /// Interesting code
    p[1] = 1; /// Interesting code
    p[2] = 2; /// Interesting code
}

Look at the lines marked “Interesting code” above. They are exactly the same in both functions. They also perform the same task i.e *p in both functions modifies the value of first element of array. Similary p[1] = 1; modifies the second element of array in both function and so on. But trust not what you see. The code in function “foo” and “hoo” above even though looks exactly the same, performs the same task, will generate different sets of machine instructions. Disassembly of the two functions is shown below:

; Disassembly of foo
; void 
; foo() 
    ; { 
    push        ebp  
    mov         ebp,esp 

    ; char buffer[16]; 
    sub         esp,14h 

    ; char *p = buffer; 
    lea         eax,[buffer] 
    mov         dword ptr [p],eax 
  
    ; *p = 1; // Interesting code 
    mov         ecx,dword ptr [p] 
    ; <-- Get the address pointed by p 
    ; <-- ecx contains address pointed by p 
                                      
    mov         byte ptr [ecx],1 
    ; Move 1 in the address pointed by ecx 
  
  
    ; p[1] = 1; // Interesting code 
    mov         edx,dword ptr [p] 
    ; <-- Get the address pointed by p 
    ; <-- edx contains address pointed by p 
                                  
    mov         byte ptr [edx+1],1 
    ; <-- Move 1 in the address pointed by edx+1 
  
  
    ; p[2] = 2; // Interesting code 
    mov         eax,dword ptr [p] 
    ; <-- Get the address pointed by p 
    ; <-- eax contains address pointed by p 
                                  
    mov         byte ptr [eax+2],2 
    ; <-- Move 1 in the address pointed by eax+2 
                                  
  ; }
  mov         esp,ebp 
  pop         ebp
  ret
  
  
; Disassembly of hoo 
; void 
; hoo() 
    ; { 
    push        ebp  
    mov         ebp,esp 

    ; char p[16]; 
    sub         esp,10h 
    ; <-- allocate space for p[16] 
  
    ; *p = 1; // Interesting code 
    mov         byte ptr [p],1 
    ; <-- Move 1 in the address pointed p 
    ; <-- p is actually ebp-10h here 
                                  
    ; p[1] = 1; // Interesting code 
    mov         byte ptr [ebp-0Fh],1 
    ; <-- Since stack grows from bottom to top, p[1] 
    ; <-- will be p+1 => ebp-10h-1 => ebp-0Fh. Thus 
    ; <-- the above statement moves 1 in the address 
    ; <-- pointed p[1] 
                                  
    ; p[2] = 2; // Interesting code 
    mov         byte ptr [ebp-0Eh],2 
    ; <-- similarly move 2 in the address 
    ; <-- pointed by p[2] 

    ; } 
    esp,ebp 
    pop         ebp  
    ret

As you can see from the disassembly, in function “foo”, compiler generates code such that all references to the “char array buffer” are made by derefencing p. On the other hand, in function “hoo” all references to the “char array p” in function hoo are made directly and not by derefencing any other variable.

Now back to the problem i was debugging…my colleague did this genuine mistake, which i am sure anyone could have done. She had a global variable declared as char gBuffer[MAX_PATH]; in one file but was using it as extern char *gBuffer in another. She was using gBuffer in the function like strcpy(gBuffer, “test”); The compiler treated gBuffer as a pointer and was passing the address of memory pointed by gBuffer, which would be the contents of first 4 bytes (on 32 bit, or 8 bytes on 64 bit machines) to function strcpy. In this case, the contents were 0x00000000 and thus the strcpy call was actually resulting in something like strcpy(NULL, “test”). No wonder it crashed. I changed the extern char *gBuffer to “extern char gBuffer[];”. Compiler got the hint that it is an array of characters and passed the correct address in strcpy.

These syntactic twists in C/C++ language makes it a bit harder to learn and prone to mistakes. However, it also make sure that programmers understand what compiler is doing behind their back. I guess a lot of programmer out there live for C/C++ because the power it provides. It is a two edged sword, you can use it properly to do amazing things or you can cut yourself if you are careless (even a little bit).

This posting is provided “AS IS” with no warranties and confers no rights.