C and C++ both use separate compilation to support a multi-source file modular mechanism, but why this is done and how it is achieved is a topic worth exploring. This article is not about the syntactic rules that create different linking in C and C++, but rather analyzes how C/C++ compilers implement the compilation and linking model.
Before introducing the following content, let’s first understand the concept of the Translation environment
:
[ISO/IEC 9899:1999] A C program need not all be translated at the same time. The text of the program is kept in units called source files, (or preprocessing files) in this International Standard. A source file together with all the headers and source files included via the preprocessing directive
#include
is known as a preprocessing translation unit. After preprocessing, a preprocessing translation unit is called a translation unit.
Previously translated translation units may be preserved individually or in libraries. The separate translation units of a program communicate by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units may be separately translated and then later linked to produce an executable program.
The key concept to understand is translation unit
, which refers to the code generated after a source file is processed by the preprocessor (all #define
macros are replaced, conditional compilation #ifndef/#endif
, and files included by #include
are incorporated).
For example:
1 | // main.c |
After preprocessing:
1 |
|
In Simple Terms, compiling a source file in gcc requires four steps:
- Preprocess (-E)
- Compile (-S)
- Object file (-c)
- Link (no parameters)
What has been presented above may seem convoluted, but in simple terms, the concept of allowing separate compilation lies in a multi-source file modular mechanism. This means I can use code from other source files within my current module source file without having to place all the code into a single source file. It is akin to cross-referencing in books; I indicate which chapter and section of another book discusses a particular subject, and I am referencing a concept defined in it, which you need to consult to understand what it represents.
In TCPL, the introduction of internal and external linkage during linking:
Within a translation unit, all declarations of the same object or function identifier with internal linkage refer to the same thing, and the object or function is unique to that translation unit. All declarations for the same object or function identifier with external linkage refer to the same thing, and the object or function is shared by the entire program.
Note that in the C language, a function declaration (that does not specify linkage) has implicit external linkage.
[ISO/IEC 9899:1999] If no prior declaration is visible, or if the prior declaration specifies no linkage, then the identifier has external linkage.
This means:
1 | extern int max(int,int); |
Both have the same meaning.
If you want to explicitly specify an identifier as having internal linkage, you can declare it as
static
.
[ISO/IEC 9899:1999] A function declaration can contain the storage-class specifier static only if it is at file scope;
Compilation and Linking Example
In C, the extern
keyword is used to specify that a name has external linkage:
1 | // main.c |
The function int max(int,int)
is defined in another file:
1 | // maxDefine.c |
Using the four steps mentioned earlier, let’s manually compile and link these two source files using separate compilation.
First, preprocess one of the source files (#include/conditional compilation/macro expansion):
1 | # Preprocessed file goes to main.i |
Next, execute compilation (generate assembly code from the preprocessed file):
1 | # Compiled result goes to main.s |
Use -c
to generate an object file (from assembly code):
1 | # Saving object file as main.o |
We can see that three files: main.i, main.s, and main.o have been generated.
Then we also perform the same process for maxDefine.c
, which will similarly generate maxDefine.o. Next, let’s try linking:
1 | # Default gcc/g++ performs linking for object files |
As expected, an a.exe
will be generated in the current directory, which is the executable program we need.
What is the Linking Behavior?
However, what happens in between? Why does main.c
correctly find the definition of max
in maxDefine.c
without defining max
?
I illustrate with a simple diagram:
Using extern int max(int,int)
in main.c
effectively creates a slot called max
in its object file, while the definition of max
in maxDefine.c
acts as the key corresponding to that max
slot. If we only use main.o
for linking, an undefined reference
error will occur (indeed, many errors in C/C++ programming are not compile errors but linking errors).
If we link main.o
without specifying maxDefine.o
, an error will occur:
1 | $ gcc main.o |
This is akin to trying to open a door without a key, resulting in an error.
Note that the function prototype and definition in different translation units do not need to precisely match in C and can still link successfully. For instance, if the definition of max
in maxDefine.c
is changed to int max(int x,double y)
, it will still link and run successfully (C language has integer promotion and implicit conversion; during the call to max
within main
, the second parameter is implicitly converted to double
, with a matching return type). This is because the function symbols in the object files do not depend on parameter types, but solely on function identifiers. Hence, despite differing types in main.c
and maxDefine.c
, the symbol information for the function max
in their object files remains as max
, allowing for a successful symbol match. This characteristic of the C language embodies its belief in programmers, which is often cited as one of the reasons why C is considered more free and less safe compared to C++.
However, this issue would be an error in C++, owing to the fact that the max
declared in main.c
is defined as int max(int,int)
. As mentioned earlier, the symbol information within C++ object files is dependent on both the function names and parameter types for name mangling, so int max(int,int)
and int max(int,double)
become two distinct symbols in the object files, leading to an undefined symbol error.
GCC Toolchain Compilation and Linking Parameters
1 | # Preprocessing: #include and macro definitions and conditional compilation |
Common Symbol Types in Object Files:
- A This symbol’s value will not change in further linking;
- B This symbol is in the BSS segment, typically representing uninitialized global variables;
- D This symbol is in the data segment, typically representing initialized global variables;
- T This symbol is in the code segment, typically representing global non-static functions;
- U This symbol is undefined and needs to be linked from other object files;
- W Weakly linked symbols that are not explicitly specified; if there are definitions for it in other linked object files, use that; otherwise, use a system-defined default value.
- R The symbol is located in the read-only data area, such as a
const int
with file scope in C (which differs from C++).
For more information about the symbol format in object files, refer to this article: nm Object File Format Analysis.
Why Separate Declaration and Definition?
As mentioned before, the role and implementation of “separate compilation” have been discussed, but what do C/C++’s common “separation of declaration and implementation” mean? It mainly serves to prevent multiple definitions caused when multiple source files simultaneously include the same source file.
Suppose we have three source files customMax.cpp
, libMin.cpp
, and main.cpp
:
1 | // customMax.cpp |
In another file libMin.cpp
, #include
for customMax.cpp
:
1 | // libMin.cpp |
In the third file main.cpp
, we include customMax.cpp
, but specify the symbol libMin
from libMin.cpp
via external linkage:
1 | // main.cpp |
When linking the main.cpp
code with libMin.cpp
, a multiple definition error will occur:
1 | $ g++ -o main main.cc libMin.cpp |
This occurs because the definition and implementation of int customMax(int,int)
are combined, but customMax.cpp
is included in both libMin.cpp
and main.cpp
, where main.cpp
and libMin.cpp
are two separate translation units. This is analogous to:
1 | $ g++ -c main.cpp -o main.o |
Let’s check the symbol information in the two object files, main.o
and libMin.o
:
You can see that the part marked in red indicates that T represents that the customMax
symbol is defined in both object files, hence the linking will yield a redefinition error.
To resolve this issue, the solution is to maintain a single implementation and only include the declaration when used. Therefore, we employ a declaration and implementation separation mechanism.
1 | // customMax.h |
Being defined elsewhere:
1 | // customMax.cpp |
Then, in other source files, only include customMax.h
:
1 | // main.cpp |
And in libMin.cpp
, also just include customMax.h
:
1 |
|
When we compile these two translation units separately and check their symbols:
We can see that both object files now have undefined symbols for customMax
. This allows us to specify the symbol customMax
during compilation to be used by both translation units:
1 | # Generate the object file for customMax |
Then, we link main.o
, libMin.o
, and customMax.o
together:
1 | $ g++ main.o libMin.o customMax.o -o main.exe |
Of course, the previous separated operations are quite cumbersome (just for demonstration), so we can execute all of these operations in one line:
1 | # In multiple translation units, only include declarations, and this will not yield linking errors |
In summary: The separation of implementation and definition was introduced to allow multiple translation units to use the same symbol information without generating multiple definition errors.
Note: In C++, the declaration and definition of templates must be together; this is another pitfall that will not be discussed for now.
C++ Template Linking
Because C++ templates require the instantiation of relevant objects at compile time, the code for templates must be distributed in source code form; shared implementations must expose this part of the code. From a source code perspective, template code and inline code are similar (even though templates do not necessarily need to be declared as inline): All code for a template must be fully visible to the client code that uses it. This is known as the inclusion model, as we effectively need to include all the template definition code in the template header file.
Let’s look at an example where multiple translation units include the same (symbol) implementation:
1 | // max.cpp |
1 | // main.cpp |
1 | // delegateMax.cpp |
The primary issue arises with the two translation units main.cpp
and delegateMax.cpp
. Let’s compile and link them:
1 | $ clang++ -c main.cpp -o main.o |
This generates a symbol max(const int&,const int&)
redefinition error. However, if we modify the implementation of max(const int&,const int&)
to be a template or inline implementation:
1 |
|
Compiling and linking again:
1 | $ clang++ -c main.cpp -o main.o |
Now it links through with no redefinition error occurring.
Comparing the symbol information between the two attempts, there appears to be no difference in main.o
and delegateMax.o
symbols. Why does this yield different linking outcomes?
This is due to the fact that non-member function templates in C++ allow for linking in a manner distinct from normal functions:
The C++ standard describes linkage for names declared at global scope:
[ISO/IEE 14882:2011] A name declared in a namespace scope without a storage-class-specifier has external linkage unless it has internal linkage because of a previous declaration and provided it is not declared const. Objects declared const and not explicitly declared extern have internal linkage.
So the initial declaration of max(const int&,const int&)
possesses external linkage, and since both main.o
and delegateMax.o
include its implementation, they effectively contain the same symbol for max(const int&,const int&)
, leading to a redefinition error during linkage.
However, the linkage for non-member function templates differs from normal function linkage in C++:
[ISO/IEE 14882:2014] A template name has linkage (3.5). A non-member function template can have internal linkage; any other template name shall have external linkage. Specializations (explicit or implicit) of a template that has internal linkage are distinct from all specializations in other translation units.
Thus, the max(const T&,const T&)
implemented via templates has internal linkage, which means during linking, calls to max
in main.o
and delegateMax.o
can only locate their own symbol names (internal linkage is not visible externally), thus allowing for successful linking.
Moreover, this is precisely why template code needs to be distributed in source form; to prevent identical symbols with external visibility from being created across different compilation units, resulting in redefinition linking errors.
External References
- Linkers and Loaders
- ISO/IEC 9899:1999
- Computer System: A Programmer’s Perspective, 2e
- Linux Multithreaded Server Programming: Using the muduo C++ Network Library
Update Log
2017.04.11
- Optimized phrasing and corrected punctuation errors.
- Added more examples.
2017.04.14
- Added GCC toolchain compilation and linking parameters.
2017.04.16
- Optimized phrasing and added more examples.
- Referenced the C99 standard in the concept of Translation environment.
- Added “Why should declaration be separated from definition?”
2017.04.17
- Further elaborated on
translation unit
and the preprocessor. - Corrected potentially ambiguous descriptions and provided more examples for clarity.
2017.07.01
- Added the section on C++ template linking.