Advertisement

Understanding the Execution of a C Program

1: Introduction

The execution of a C program is always considered as a mystery. How a C executable loads into the memory, how the command line parameters are passed to the main() function and how the operating system starts the executing from the main() function etc. are few of the points whose in depth understanding is a must to unravel this mystery.

In this article, we will use a very small program (Listing 1) as case study to illustrate the execution mechanism of a C program in detail. We will use Linux environment to discuss this case study. In some of the previous articles we have already dissected all the steps involved in the C program processing using Linux environment. Please refer to those articles for the detailed explanation concerning the source code to executable conversion process.

Note: Even though we will be using Linux environment for the discussion, this discussion is valid for all the other environments (i.e. DOS, Windows etc.) with some minor changes. 

/* simple.c */
main()
{
return(0);
}

(Listing 1: simple.c)

Build the above source code file using the following command:

gcc -o simple simple.c

This will generate an executable file having the file name as simple.

 

2: Understanding the Structure of Executable File

To understand the nature of resultant executable file we will unravel its contents. We will use the objdump tool for this purpose (Listing 2). objdump is a program for displaying various information about object files. For instance, it can be used as a disassemble tool to view executable in assembly form. It is part of the GNU binutils for fine-grained control over executable and other binary data. The options control what particular information to display (-f option displays summary information from the overall header). This information is mostly useful to programmers who are working on the compilation tools, as opposed to programmers who just want their program to compile and work.

objdump -f simple

simple:     file format elf32-i386
architecture: i386, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x080482d0

(Listing 2: objdump Output)

The objdump output (Listing 2) gives us some critical information about the executable file.

3: Understanding the Executable File Format

The resultant sample executable file is in ELF file format. ELF is an acronym for Executable and Linking Format. It is one of the several object and executable file formats used on Linux (and Unix) systems. For our discussion, the interesting thing about ELF is its header format. Every ELF executable file has ELF header, which is as the following (Listing 3):

typedef struct
{
unsigned char    e_ident[EI_NIDENT];       /* Magic number and other info */
Elf32_Half       e_type;          /* Object file type */
Elf32_Half       e_machine;       /* Architecture */
Elf32_Word       e_version;       /* Object file version */
Elf32_Addr       e_entry;         /* Entry point virtual address */
Elf32_Off                 e_phoff;         /* Program header table file offset */
Elf32_Off                 e_shoff;         /* Section header table file offset */
Elf32_Word       e_flags;         /* Processor-specific flags */
Elf32_Half       e_ehsize;                 /* ELF header size in bytes */
Elf32_Half       e_phentsize;     /* Program header table entry size */
Elf32_Half       e_phnum;         /* Program header table entry count */
Elf32_Half       e_shentsize;     /* Section header table entry size */
Elf32_Half       e_shnum;         /* Section header table entry count */
Elf32_Half       e_shstrndx;      /* Section header string table index */
} Elf32_Ehdr;

(Listing 3: Structure of ELF Header)

Note: In the ELF structure the starting address in the executable file is represented by the e_entry field.

4: Starting Address of an Executable File

Now we will disassemble the executable file simple to have a look at the starting address. There are several tools to disassemble an executable. We will use objdump for this purpose.

objdump --disassemble simple

As the corresponding output is a little bit too long the Listing 4 contains a snapshot of it for the sake of brevity. Our intention is see what is at the address 0x080482d0.

080482d0 <_start>:
80482d0:       31 ed                   xor    %ebp,%ebp
80482d2:       5e                      pop    %esi
80482d3:       89 e1                   mov    %esp,%ecx
80482d5:       83 e4 f0                and    $0xfffffff0,%esp
80482d8:       50                      push   %eax
80482d9:       54                      push   %esp
80482da:       52                      push   %edx
80482db:       68 20 84 04 08  push   $0x8048420
80482e0:       68 74 82 04 08  push   $0x8048274
80482e5:       51                      push   %ecx
80482e6:       56                      push   %esi
80482e7:       68 d0 83 04 08  push   $0x80483d0
80482ec:       e8 cb ff ff ff          call   80482bc <_init+0x48>
80482f1:       f4                      hlt   
80482f2:       89 f6                   mov    %esi,%esi

(Listing 4: Snapshot of the disassembled output from objdump)

The disassembled output (Listing 4) shows that at address 0x080482d0 a starting routine called _start is present. An analysis of the corresponding assembly code unravels its functionality as follows:

  1. Clear a register (at address 80482d0)
  2. Push some values into stack (at address 80482d2 to 80482e7)
  3. Call a function (at address 80482ec)

According to these instructions, the stack frame should look like following (Figure 1):


Now let us understand the meaning of Hex values stored in the stack frame and contents of address 80482bc which is called by _start

 
4.1: The HEX Values

To understand the meaning of Hex value we need to carefully look at the disassembled output from objdump.

0x80483d0: Address of our main() function.
0x8048274: Address of _init function.
0x8048420: Address of _fini function

Note: _init and _fini are initialization and finalization functions provided by GCC.

That means all these Hex values are function pointers to the above mentioned functions.

4.2: Address 80482bc

The objdump output displays the following contents at address 80482bc:

80482bc:    ff 25 48 95 04 08       jmp    *0x8049548

Here *0x8049548 is a pointer operation. It just jumps to an address stored at address 0x8049548.

 

5: Understanding the Dynamic Linking

With ELF, we can build an executable linked dynamically to the libraries. Here linked dynamically means the actual linking process happens at runtime. Otherwise we should have to build a huge executable containing all the required libraries (statically-linked executable).
Now we will have a look at all the libraries dynamically linked with our executable file simple.
We will use the tool ldd for this purpose (Listing 5).

ldd simple

libc.so.6 => /lib/i686/libc.so.6 (0x42000000)
/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

(Listing 5: Output of ldd)

All the dynamically linked data and functions have dynamic relocation entries. The dynamic linking mechanism works as follows:

  1. We don't know actual address of a dynamic symbol at link time. We can know the actual address of the symbol only at runtime.
  2. So for the dynamic symbol we reserve a memory location for the actual address.
    The memory location will be filled with actual address of the symbol at runtime by loader. 
  3. Our executable sees the dynamic symbol indirectly with the memory location by using kind of pointer operation. In our case, at address 80482bc, there is just a simple jump instruction. And the jump location is stored at address 0x8049548 by loader during runtime. We can see all dynamic link entries with objdump tool (Listing 6).

 
objdump -R simple

simple:     file format elf32-i386

DYNAMIC RELOCATION RECORDS
OFFSET   TYPE              VALUE
0804954c R_386_GLOB_DAT    __gmon_start__
08049540 R_386_JUMP_SLOT   __register_frame_info
08049544 R_386_JUMP_SLOT   __deregister_frame_info
08049548 R_386_JUMP_SLOT   __libc_start_main

(Listing 6: Dynamic Link Entries)

The address 0x8049548 is called jump slot, which perfectly makes sense. According to the table actually we want to call __libc_start_main.

6: __libc_start_main Function:

Jump to the __libc_start_main function enters into the realm of standard C library (on Linux environment it is libc). __libc_start_main is a function in libc.so.6. If you look for __libc_start_main in glibc source code, its prototype looks like following (Listing 7):

extern int BP_SYM (__libc_start_main) (int (*main) (int, char **, char **),
int argc,
char *__unbounded *__unbounded ubp_av,
void (*init) (void),
void (*fini) (void),
void (*rtld_fini) (void),
void *__unbounded stack_end)
__attribute__ ((noreturn));

(Listing 7: Prototype of __libc_start_main function)

All the assembly instructions before this jump do is set up argument stack and call __libc_start_main. Then this function setups and initializes some data structures and environments before calling our main() function.

Now let’s look at the stack frame with this function prototype (Figure 2).



According to this stack frame esi, ecx, edx, esp, eax registers should be filled with appropriate values before __libc_start_main() is executed. As we have seen these registers are not set by the startup assembly instructions discussed earlier (Listing 4). Operating system kernel sets up these registers.
 

7: Understanding the Role of Operating System Kernel

Let us now understand the process of initiating execution of an executable by entering its corresponding executable file name at the shell prompt. It involves the following steps in Linux environment:

  1. The shell calls the kernel system call execve with argc and argv
  1. The kernel system call handler gets control and start handling the system call. In kernel code, the handler is sys_execve. On x86, the user-mode application passes all required parameters to kernel with the following registers:
  1. The generic execve kernel system call handler (which is do_execve), is called. What it does is set up a data structure and copy some data from user space to kernel space and finally calls search_binary_handler(). Linux can support more than one executable file format such as a.out and ELF at the same time. For this functionality, there is a data structure struct linux_binfmt, which has a function pointer for each binary format loader. The search_binary_handler() looks up an appropriate handler and calls it. In our case study, load_elf_binary() is the handler. This function first sets up kernel data structures for file operation to read the ELF executable image in. Then it sets up a kernel data structures as code size, data segment start, stack segment start etc. Then it allocates user mode pages for this process and copies the argv and environment variables to those allocated page addresses. Finally argc, the argv pointer and the environment variable array pointer are pushed to user mode stack by create_elf_tables(), and start_thread() starts the process execution rolling.

Note: Lots of things are happening here. For the sake of brevity here we are looking only at an overview. We will dissect it in detail in some other article.

When the _start assembly instruction gets control of execution, the stack frame looks like the following (Figure 3):

The assembly instructions get all information from stack as follows:

pop %esi               <--- get argc
move %esp, %ecx        <--- get argv

Actually the argv address is same as the current stack pointer. At this stage everything is set to start the execution of actual program code.

8: Involvement of the Processor Registers

Register esp is used for stack end in application program. After popping all necessary information the _start routine simply adjusts the stack pointer (esp) by turning off lower 4 bits from esp register. For our main function it is the end of stack. For register edx, which is in rtld_fini function (it’s a kind of application destructor) the kernel just sets it to 0 with the following macro (Listing 8):

#define ELF_PLAT_INIT(_r)      do { \
_r->ebx = 0; _r->ecx = 0; _r->edx = 0; \
_r->esi = 0; _r->edi = 0; _r->ebp = 0; \
_r->eax = 0; \
} while (0)

(Listing 8: ELF_PLAT_INIT Macro)

Note: The 0 means we do not use that functionality on x86 Linux.

9: Location of the Supporting Code

All the supporting code corresponding to the various discussed functions is a part of GCC code. You can usually find all the object files for the code at /usr/lib/gcc-lib/i386-redhat-linux/XXX and /usr/lib where XXX is the gcc version. Their path will be relatively same on other Linux distributions. These routines are basically stored in three files namely crtbegin.o,crtend.o and gcrt1.o (we have discussed these files in one of the previous articles).

10: Conclusion

The above entire discussion about the execution of a C program can be summarized as follows:

  1. GCC build your program with crtbegin.o, crtend.o and gcrt1.o. The other default libraries (e.g. standard C library libc) are dynamically linked by default. Starting address of the executable is set to that of _start function.
  2. Kernel loads the executable and sets up text, data and stack. It also allocates pages for arguments and environment variables and pushes all necessary information on stack.
  3. Then the control is passed to _start. _start gets all information from the stack set up by kernel. It sets up argument stack for __libc_start_main and calls it. 
  4. __libc_start_main initializes the necessary structures especially the standard C library (for functions like malloc) and thread environment. Then it calls our main
  5. Our main is called with main(argc, argv). Here one interesting point is that the signature of main() function. __libc_start_main thinks main's signature as main(int, char **, char **). The following program illustrates this prototype (Listing 9).

#include <stdio.h>
int main(int argc, char** argv, char** env)
{
int i = 0;
while(env[i] != 0)
{
printf("%s\n", env[i++]);
}
return(0);
}

(Listing 9: The prototype of main() function)

Thus an executable file created through C language is executed by the cooperative work of language processing toolkit (i.e. GCC in Linux environment), standard C library (i.e. libc in Linux environment) and corresponding binary loader.

About Author
Author is working as a Director at Sinhgad Institute of Management and Computer Application (SIMCA) in Pune. He can be reached at  sachin_a_kadam@rediffmail.com








Added on January 6, 2012 Comment

Comments

#1

Divesh commented, on January 11, 2012 at 4:48 p.m.:

As a consultant, are you interested in C or C++? contact at diveshsr@gmail.com

#2

Ramu commented, on January 25, 2012 at 2:27 p.m.:

great post…. understand lots of things from this ..

#3

Preethi commented, on February 4, 2012 at 1:28 p.m.:

Thanks for the article. I have been thinking of doing something like this for the long time. And now I know what to use the execution of the C proble. solve my problem. :)

Post a comment