C/C++ compilation and linking model analysis

C and C++ both use separate compilation to support a multi-source file modular mechanism, but why this is done and how it is achieved is a topic worth exploring. This article is not about the syntactic rules that create different linking in C and C++, but rather analyzes how C/C++ compilers implement the compilation and linking model.

Before introducing the following content, let’s first understand the concept of the Translation environment:

[ISO/IEC 9899:1999] A C program need not all be translated at the same time. The text of the program is kept in units called source files, (or preprocessing files) in this International Standard. A source file together with all the headers and source files included via the preprocessing directive #include is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit.
Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program.

The key concept to understand is translation unit, which refers to the code generated after a source file is processed by the preprocessor (all #define macros are replaced, conditional compilation #ifndef/#endif, and files included by #include are incorporated).
For example:

// main.c
#include <stdio.h>
int main(void){
    printf("helloworld!\n");
    return 0;
}

After preprocessing:


### Separate Compilation

In C and C++, source code is not compiled directly into the operating system executable **executable object file** in one step, but rather through several steps to achieve **separate compilation** — multiple source files are individually compiled into several separate modules (a program has multiple translation units), and finally these modules are unified and pieced together (**linkage**) to form an executable binary. (The reasons for separating declaration and definition will be discussed later.)

C++ uses the same compilation model, inherited from C, primarily because one of C++'s design goals was to be **compatible with C** (along with **zero overhead** (if certain features of the language are not used, there should not be additional costs) and **value semantics** (objects become independent after being copied from the source object)), for more details see the works of C++'s creator, Bjarne, in [The Design and Evolution of C++](https://book.douban.com/subject/1456860/).

> The meaning of "compatible with C" is rich; it not only pertains to syntactical compatibility but also, more importantly, to the compatibility of C's compilation model and runtime model — that is, being able to directly use C language files and libraries. — Chen Shuo, Linux Multithreaded Server Programming: Using the muduo C++ Network Library

In fact, the compilation model in C++ is much more complex than in C. This article aims to provide a concept and method of separate compilation, with more specific content to be covered in future articles.

Early C language compilers were not standalone programs (many of today's compilers are also composed of separate functional program modules in a compilation toolchain). **Dennis Ritchie**’s C language compiler, written for the **PDP-11**, consisted of seven executable files: **cc**/**cpp**/**as**/**ld**/**c0**/**c1**/**c2**.

### Why Do This?

The main reason is that early computer performance was quite limited, and the aforementioned **Dennis Ritchie** initially used [**PDP-11**](https://en.wikipedia.org/wiki/PDP-11), which had only 24KB of memory, of which 16KB ran the operating system and 8KB ran user code. Therefore, due to performance limitations, the compiler could not fully represent the abstract syntax tree of a single source file in memory, let alone run the entire compiler in memory. Thus, due to memory constraints, the C language employs separate compilation to compile multiple source files independently, generating multiple object files, and then some method is employed to combine them (linking), allowing a complete executable file to be compiled within the limited memory.

### GCC Compiler

Modern compilers (GCC/Clang) are typically composed of four parts: **preprocessor**, **compiler**, **assembler**, and **linker**. Relevant compilation parameters and their meanings can be seen in GCC's `gcc --help`:

|         Parameter       |                    Meaning                    |
| :---------------------: | :------------------------------------------: |
|         -E              | Preprocess only; do not compile, assemble or link. |
|         -S              |  Compile only; do not assemble or link.    |
|         -c              |  Compile and assemble, but do not link.    |
|      -o &lt;file&gt;    |   Place the output into &lt;file&gt;.      |
|        -pie             | Create a position independent executable.   |
|       -shared           |         Create a shared library.            |
| -x &lt;language&gt;     | Specify the language of the following input files.<br/>Permissible languages include: c c++ assembler none, where 'none' means it will revert to default behavior of guessing the language based on the file's extension. |
| -Wl,&lt;options&gt;     | Pass comma-separated &lt;options&gt; to the linker.<br/> When generating a dynamic link library, parameters can be sent to the linker to generate an import library. |

**ld** (linker) parameters (`ld --help`):

|        Parameter         | Meaning |
| :-----------------------: | :------: |
| --out-implib &lt;file&gt; |   Generate import library<br/>Generate import library for shared library   |

> Note: Dynamic link libraries generally require an import library to facilitate the linking of dynamic link libraries during static program compilation, otherwise, you would need to load the DLL file using LoadLibrary yourself and use GetProcAddress to obtain the corresponding function pointer (DLL).
>
> With the import library, you just need to link it during the code compilation, and you can directly call the functions of the dynamic link library in the code after including the header file.

**ldd**: Outputs the shared libraries that a program depends on, usage:

```bash
$ ldd main.exe
  ntdll.dll => /c/WINDOWS/SYSTEM32/ntdll.dll (0x7ffa14b80000)
  KERNEL32.DLL => /c/WINDOWS/System32/KERNEL32.DLL (0x7ffa12b00000)
  KERNELBASE.dll => /c/WINDOWS/System32/KERNELBASE.dll (0x7ffa11d90000)
  msvcrt.dll => /c/WINDOWS/System32/msvcrt.dll (0x7ffa128a0000)
  dynamicLib.dll => /c/Users/visionsmile/Desktop/a/dynamicLib.dll (0x64740000)
  libstdc++-6.dll => /mingw64/bin/libstdc++-6.dll (0x6fc40000)
  USER32.dll => /c/WINDOWS/System32/USER32.dll (0x7ffa14060000)
  win32u.dll => /c/WINDOWS/System32/win32u.dll (0x7ffa11730000)
  GDI32.dll => /c/WINDOWS/System32/GDI32.dll (0x7ffa12bc0000)
  gdi32full.dll => /c/WINDOWS/System32/gdi32full.dll (0x7ffa117b0000)
  msvcp_win.dll => /c/WINDOWS/System32/msvcp_win.dll (0x7ffa11950000)
  libwinpthread-1.dll => /mingw64/bin/libwinpthread-1.dll (0x64940000)
  ucrtbase.dll => /c/WINDOWS/System32/ucrtbase.dll (0x7ffa11c90000)
  libgcc_s_seh-1.dll => /mingw64/bin/libgcc_s_seh-1.dll (0x61440000)

In Simple Terms, compiling a source file in gcc requires four steps:

Preprocess (-E)
Compile (-S)
Object file (-c)
Link (no parameters)

What has been presented above may seem convoluted, but in simple terms, the concept of allowing separate compilation lies in a multi-source file modular mechanism. This means I can use code from other source files within my current module source file without having to place all the code into a single source file. It is akin to cross-referencing in books; I indicate which chapter and section of another book discusses a particular subject, and I am referencing a concept defined in it, which you need to consult to understand what it represents.

In TCPL, the introduction of internal and external linkage during linking:

Within a translation unit, all declarations of the same object or function identifier with internal linkage refer to the same thing, and the object or function is unique to that translation unit. All declarations for the same object or function identifier with external linkage refer to the same thing, and the object or function is shared by the entire program.

Note that in the C language, a function declaration (that does not specify linkage) has implicit external linkage.

[ISO/IEC 9899:1999] If no prior declaration is visible, or if the prior declaration specifies no linkage, then the identifier has external linkage.

This means:

1 2	extern int max(int,int); int max(int,int);

Both have the same meaning.

If you want to explicitly specify an identifier as having internal linkage, you can declare it as static.
[ISO/IEC 9899:1999] A function declaration can contain the storage-class specifier static only if it is at file scope;

Compilation and Linking Example

In C, the extern keyword is used to specify that a name has external linkage:

// main.c
extern int max(int,int);
int main(void){
    max(11,12);
    return 0;
}

The function int max(int,int) is defined in another file:

// maxDefine.c
int max(int x,int y){
    return x>=y?x:y;
}

Using the four steps mentioned earlier, let’s manually compile and link these two source files using separate compilation.

First, preprocess one of the source files (#include/conditional compilation/macro expansion):

1 2	# Preprocessed file goes to main.i $ gcc -E main.c -o main.i

Next, execute compilation (generate assembly code from the preprocessed file):

1 2	# Compiled result goes to main.s $ gcc -S main.i -o main.s

Use -c to generate an object file (from assembly code):

1 2	# Saving object file as main.o $ gcc -c main.s -o main.o

We can see that three files: main.i, main.s, and main.o have been generated.

Then we also perform the same process for maxDefine.c, which will similarly generate maxDefine.o. Next, let’s try linking:

1 2	# Default gcc/g++ performs linking for object files $ gcc main.o maxDefine.o

As expected, an a.exe will be generated in the current directory, which is the executable program we need.

What is the Linking Behavior?

However, what happens in between? Why does main.c correctly find the definition of max in maxDefine.c without defining max?

I illustrate with a simple diagram:

Using extern int max(int,int) in main.c effectively creates a slot called max in its object file, while the definition of max in maxDefine.c acts as the key corresponding to that max slot. If we only use main.o for linking, an undefined reference error will occur (indeed, many errors in C/C++ programming are not compile errors but linking errors).

If we link main.o without specifying maxDefine.o, an error will occur:

1 2	$ gcc main.o # main.o:main.c:(.text+0x1f): undefined reference to `max'

This is akin to trying to open a door without a key, resulting in an error.

Note that the function prototype and definition in different translation units do not need to precisely match in C and can still link successfully. For instance, if the definition of max in maxDefine.c is changed to int max(int x,double y), it will still link and run successfully (C language has integer promotion and implicit conversion; during the call to max within main, the second parameter is implicitly converted to double, with a matching return type). This is because the function symbols in the object files do not depend on parameter types, but solely on function identifiers. Hence, despite differing types in main.c and maxDefine.c, the symbol information for the function max in their object files remains as max, allowing for a successful symbol match. This characteristic of the C language embodies its belief in programmers, which is often cited as one of the reasons why C is considered more free and less safe compared to C++.

However, this issue would be an error in C++, owing to the fact that the max declared in main.c is defined as int max(int,int). As mentioned earlier, the symbol information within C++ object files is dependent on both the function names and parameter types for name mangling, so int max(int,int) and int max(int,double) become two distinct symbols in the object files, leading to an undefined symbol error.

GCC Toolchain Compilation and Linking Parameters

# Preprocessing: #include and macro definitions and conditional compilation
$ cpp main.cc main.i
$ g++ -E main.cc -o main.i
# Compile to assembly code
$ g++ -S main.i -o main.s
# Generate object file from assembly code
$ g++ -c main.s -o main.o
# Directly generate object file from source file
$ g++ -c main.cc
# View symbols in .o
$ nm main.o
# Convert name mangled symbols back to original symbols
$ c++filt _Z3maxii

# From source file generate static link library
# First need to generate the object file
$ g++ -c slib.cc -o slib.o
# Then use ar to create a lib.a
$ ar rcs libslib.a slib.o
# The locally produced libslib.a is the static link library created from our source file
# Usage
# -L specifies the directory of the link library (here it's the current directory), -l shows gcc's linking method which will automatically match to libxxx.a such a link library
# For example, -L. -lslib will link the libslib.a file in the current directory
$ g++ -o main.exe main.cc -L. -llib

# Compiling dynamic link library from source file
# Generate object file with -fPIC/-fpic parameter
# -fPIC parameter declares that the code segment of the library is shareable
$ gcc -c -fpic dylib.c -o dylib.o
# Use -shared parameter to generate dynamic link library .so
$ gcc -shared -fpic dylib.o -o dylib.so
# Generate DLL and an import library; symbols in the dynamic link library can be imported through this import library during linkage
$ g++ dylib.cpp -shared -o dylib.dll -Wl,--out-implib,impdylib.lib

# View binary-dependent symbols
$ ldd main.exe

# nm can also be used to view symbol information in static and dynamic libraries, as well as executable programs
$ nm slib.a
$ nm sylib.so
$ nm main.exe

# ldd can view external linkage information in the executable program
$ ldd main.exe
ntdll.dll => /c/Windows/SYSTEM32/ntdll.dll (0x7ffb24f80000)
KERNEL32.DLL => /c/Windows/System32/KERNEL32.DLL (0x7ffb22d40000)
KERNELBASE.dll => /c/Windows/System32/KERNELBASE.dll (0x7ffb218a0000)
msvcrt.dll => /c/Windows/System32/msvcrt.dll (0x7ffb22e50000)

Common Symbol Types in Object Files:

A This symbol’s value will not change in further linking;
B This symbol is in the BSS segment, typically representing uninitialized global variables;
D This symbol is in the data segment, typically representing initialized global variables;
T This symbol is in the code segment, typically representing global non-static functions;
U This symbol is undefined and needs to be linked from other object files;
W Weakly linked symbols that are not explicitly specified; if there are definitions for it in other linked object files, use that; otherwise, use a system-defined default value.
R The symbol is located in the read-only data area, such as a const int with file scope in C (which differs from C++).

For more information about the symbol format in object files, refer to this article: nm Object File Format Analysis.

Why Separate Declaration and Definition?

As mentioned before, the role and implementation of “separate compilation” have been discussed, but what do C/C++’s common “separation of declaration and implementation” mean? It mainly serves to prevent multiple definitions caused when multiple source files simultaneously include the same source file.
Suppose we have three source files customMax.cpp, libMin.cpp, and main.cpp:

// customMax.cpp

#ifndef __CUSTOM_MAX_H__
#define __CUSTOM_MAX_H__
int customMax(int x,int y){
    return x>=y?x:y;
}
#endif

In another file libMin.cpp, #include for customMax.cpp:

// libMin.cpp
#ifndef __LIB_MIN_H__
#define __LIB_MIN_H__
#include "customMax.cpp"

extern int customMax(int,int);
int libMin(int x,int y){
    return customMax(x,y)==x?y:x; // Just an example
}
#endif

In the third file main.cpp, we include customMax.cpp, but specify the symbol libMin from libMin.cpp via external linkage:

// main.cpp
#include "customMax.cpp"

extern int libMin(int,int);

int main(void){
    customMax(11,12);
    libMin(11,12);
    return 0;
}

When linking the main.cpp code with libMin.cpp, a multiple definition error will occur:

1 2	$ g++ -o main main.cc libMin.cpp ...multiple definition of `customMax(int, int)'

This occurs because the definition and implementation of int customMax(int,int) are combined, but customMax.cpp is included in both libMin.cpp and main.cpp, where main.cpp and libMin.cpp are two separate translation units. This is analogous to:

$ g++ -c main.cpp -o main.o
$ g++ -c libMin.cpp -o libMin.o
$ g++ main.o libMin.o
...multiple definition of `customMax(int, int)'

Let’s check the symbol information in the two object files, main.o and libMin.o:

You can see that the part marked in red indicates that T represents that the customMax symbol is defined in both object files, hence the linking will yield a redefinition error.

To resolve this issue, the solution is to maintain a single implementation and only include the declaration when used. Therefore, we employ a declaration and implementation separation mechanism.

// customMax.h
#ifndef __CUSTOM_MAX_H__
#define __CUSTOM_MAX_H__
void customMax(int,int);
#endif

Being defined elsewhere:

// customMax.cpp
#ifndef __CUSTOM_MAX_D__
#define __CUSTOM_MAX_D__
#include "customMax.h"
int customMax(int x,int y){
    return x>=y?x:y;
}
#endif

Then, in other source files, only include customMax.h:

// main.cpp
#include "customMax.h"

extern int libMin(int,int);

int main(void){
    customMax(11,12);
    libMin(11,12);
    return 0;
}

And in libMin.cpp, also just include customMax.h:

#ifndef __LIB_MIN_H__
#define __LIB_MIN_H__
#include "customMax.h"

extern int customMax(int,int);
int libMin(int x,int y){
    return customMax(x,y)==x?y:x; // Just an example
}
#endif

When we compile these two translation units separately and check their symbols:

We can see that both object files now have undefined symbols for customMax. This allows us to specify the symbol customMax during compilation to be used by both translation units:

1 2	# Generate the object file for customMax $ g++ -c customMax.cpp -o customMax.o

Then, we link main.o, libMin.o, and customMax.o together:

1	$ g++ main.o libMin.o customMax.o -o main.exe

Of course, the previous separated operations are quite cumbersome (just for demonstration), so we can execute all of these operations in one line:

1 2	# In multiple translation units, only include declarations, and this will not yield linking errors $ g++ main.cpp libMin.cpp customMax.cpp -o main.exe

In summary: The separation of implementation and definition was introduced to allow multiple translation units to use the same symbol information without generating multiple definition errors.

Note: In C++, the declaration and definition of templates must be together; this is another pitfall that will not be discussed for now.

C++ Template Linking

Because C++ templates require the instantiation of relevant objects at compile time, the code for templates must be distributed in source code form; shared implementations must expose this part of the code. From a source code perspective, template code and inline code are similar (even though templates do not necessarily need to be declared as inline): All code for a template must be fully visible to the client code that uses it. This is known as the inclusion model, as we effectively need to include all the template definition code in the template header file.

Let’s look at an example where multiple translation units include the same (symbol) implementation:

// max.cpp
#ifndef MAX_HPP_
#define MAX_HPP_
int max(const int& a,const int& b){
    return a>b?a:b;
}
#endif

// main.cpp
#include "max.cpp"
extern int delegateMax(const int&,const int&);
int main()
{
    int imax=max(123,456);
    int imaxCopy=delegateMax(123, 456);
}

// delegateMax.cpp
#ifndef DELEGATE_MAX_HPP_
#define DELEGATE_MAX_HPP_
#include "max.cpp"

int delegateMax(const int& a,const int& b){
    return max(a,b);
}
#endif

The primary issue arises with the two translation units main.cpp and delegateMax.cpp. Let’s compile and link them:

$ clang++ -c main.cpp -o main.o
$ clang++ -c delegateMax.cpp -o delegateMax.o
$ llvm-nm main.o
-------- U _Z11delegateMaxRKiS0_
00000000 T _Z3maxRKiS0_
-------- U __main
00000050 T main
$ llvm-nm delegateMax.o
00000050 T _Z11delegateMaxRKiS0_
00000000 T _Z3maxRKiS0_
$ clang++ main.o delegateMax.o -o main.exe
delegateMax.o:(.text+0x0): multiple definition of `max(int const&, int const&)` main.o:(.text+0x0): first defined here
clang++.exe: error: linker command failed with exit code 1 (use -v to see invocation)

This generates a symbol max(const int&,const int&) redefinition error. However, if we modify the implementation of max(const int&,const int&) to be a template or inline implementation:

#ifndef MAX_HPP_
#define MAX_HPP_
template<typename T>
T max(const T& a,const T& b){
    return a>b?a:b;
}
#endif

Compiling and linking again:

$ clang++ -c main.cpp -o main.o
$ clang++ -c delegateMax.cpp -o delegateMax.o
$ llvm-nm main.o
-------- U _Z11delegateMaxRKiS0_
00000000 T _Z3maxIiET_RKS0_S2_
-------- U __main
00000000 T main
$ llvm-nm delegateMax.o
00000000 T _Z11delegateMaxRKiS0_
00000000 T _Z3maxIiET_RKS0_S2_
$ clang++ main.o delegateMax.o -o main.exe

Now it links through with no redefinition error occurring.

Comparing the symbol information between the two attempts, there appears to be no difference in main.o and delegateMax.o symbols. Why does this yield different linking outcomes?

This is due to the fact that non-member function templates in C++ allow for linking in a manner distinct from normal functions:
The C++ standard describes linkage for names declared at global scope:

[ISO/IEE 14882:2011] A name declared in a namespace scope without a storage-class-specifier has external linkage unless it has internal linkage because of a previous declaration and provided it is not declared const. Objects declared const and not explicitly declared extern have internal linkage.

So the initial declaration of max(const int&,const int&) possesses external linkage, and since both main.o and delegateMax.o include its implementation, they effectively contain the same symbol for max(const int&,const int&), leading to a redefinition error during linkage.

However, the linkage for non-member function templates differs from normal function linkage in C++:

[ISO/IEE 14882:2014] A template name has linkage (3.5). A non-member function template can have internal linkage; any other template name shall have external linkage. Specializations (explicit or implicit) of a template that has internal linkage are distinct from all specializations in other translation units.

Thus, the max(const T&,const T&) implemented via templates has internal linkage, which means during linking, calls to max in main.o and delegateMax.o can only locate their own symbol names (internal linkage is not visible externally), thus allowing for successful linking.

Moreover, this is precisely why template code needs to be distributed in source form; to prevent identical symbols with external visibility from being created across different compilation units, resulting in redefinition linking errors.

External References

Update Log

2017.04.11

Optimized phrasing and corrected punctuation errors.
Added more examples.

2017.04.14

Added GCC toolchain compilation and linking parameters.

2017.04.16

Optimized phrasing and added more examples.
Referenced the C99 standard in the concept of Translation environment.
Added “Why should declaration be separated from definition?”

2017.04.17

Further elaborated on translation unit and the preprocessor.
Corrected potentially ambiguous descriptions and provided more examples for clarity.

2017.07.01

Added the section on C++ template linking.