Tag: c

C preprocessor tricks

The C preprocessor is a heritage from an ancient age (the 70′s). Modern languages provide better ways to do most things the C preprocessor was (and is) used for (the D programming language has removed the need for the preprocessor with “normal” import statements, support for conditional compilation using static if statements etc.), but in C you’re stuck with the preprocessor, however, it isn’t that bad and, as a matter of fact, it can do some rather neat things.

Remember that even though the C preprocessor has many valid uses, it can probably be abused in more ways than so, leading to weird errors or code that is hard to understand. The preprocessor has its uses, but often, you should simply just avoid it.

#include can include any kind of text, not just source code

The #include statement includes any kind of text, not just source code. For instance, you can use #include to include values for an array, assuming they are formatted as valid C code within a text file, e.g. 1, 2, 3, 4, 5. If this was stored in a file called values.txt, we could declare an array with these values simply by typing in the following in our C program:

int[] arr={
#include "values.txt"
};

For larger sets of data, this could come in very handy.

Put more complex #define values inside parenthesis

This is not okay:

#define TWO 1+1

First of all, you already have a constant for the value 2, it looks like this: 2. Additionally, you should not define it as 1+1, but instead simply as 2. But disregarding all those flaws, there is yet another kind of problem. If we try to multiply our TWO with say 3, we will get 4 as the answer, even though 2 times 3 is 6. Why? Because C changes 3*TWO to 3*1+1 and multiplication has higher precedence than addition. This is solved simply by adding parenthesis around the value in the definition.

This is an example of those funny bugs careless use of the preprocessor can cause.

Retrieving the name of a variable

By putting a hash (#) in front of a variable in a macro, you can add the identifier as a string literal into your code. The following is a macro that prints the name of a variable coupled with its value, which could be useful while debugging:

#define PRINT_VAL(val) printf("%s=%d\n", #val, val)

Note that the above example will only work with integers because of how printf() is called.

#error and #warning

You can use the #error and #warning directives to create your own compiler errors and warnings. For instance, if you would like to raise an error if someone tries to compile your program on Windows, you could add this little snippet of code:

#ifdef __WIN32__
#error "Disturbing, your choice of operating system is."
#endif

You should only do this if you have a valid reason (e.g. don’t compile on Linux if you use DirectX, since Linux doesn’t support that), I guess, even though the example above could arguably be a valid reason, depending on whom you ask.

#warning is used in the same way, though it will only cause the compiler to print a warning instead of the more severe error, which will halt compilation whereas a warning won’t. You could, for instance, set the compiler to print warnings whenever you are compiling with debug mode on, to avoid accidentally shipping the debugging version as production software.

Debugging preprocessor macros

Most compilers are able to do only the preprocessing step of the build, allowing you to see how preprocessor macros are expanded in your code, greatly easing the hunt for bugs related to the use of the preprocessor in the code. For GCC, the option to toggle this is -E.

Compiler specific preprocessor features with #pragma

The C standard includes the #pragma preprocessor statement for compilers to define their own preprocessor features. To see which pragma directives are provided by your compiler of choice, you should consult its documentation.

Most modern compilers support the #pragma once statement. This will tell the compiler to include this file only one, removing the need for (the much more verbose) header guards commonly used. Still, the traditional header guards have been used for so long that programmers probably won’t stop using them any time soon, even if this alternative would generally be superior (some legacy compilers may not support this, though).

__FILE__ and __LINE__

__FILE__ will insert the name of the current file, and __LINE__ will insert the the line number of the current line of the source code. These can be very useful for generating debug information, especially if you are unable to use a debugger (think kernel programming (technically, it is possible to use a debugger for OS kernels, but it isn’t as easy and practical as it is to use one for user-space programming)).

__DATE__ and __TIME__

__DATE__ and __TIME__ can be used to insert the current date and time (at the time of preprocessing) into the code, respectively. You could for instance show this information in the version information for your program:

printf("%s v%s\nCompiled on %s at %s\n", PRG_NAME, PRG_VERSION, __DATE__, __TIME__);

Void pointers in C

Void pointers are pointers pointing to some data of no specific type.

A void pointer is defined like a pointer of any other type, except that void* is used for the type:

void *pt;

You can’t directly dereference a void pointer; you must cast it to a pointer with a specific type first, for instance, to a pointer of type int*:

*(int*)pt;

Thus to assign a value to a void pointer, you will have to do something like:

*(int*)pt=42;
*(float*)pt=3.14; /* You can assign a value of any type to the pointer */

The use of void pointers is mainly allowing for generic types. You can create data structures that can hold generic values, or you can have functions that take arguments of no specific type. If you wanted a linked list allowing for generic values, you would define your list node like this:

struct ListNode{
  struct ListNode *next;
  void *data;
};

A generic function for doubling a value might look like the following:

#define TYPE_INT 0
#define TYPE_FLOAT 1
void doubleVal(int type, void *var){
  if(type==TYPE_INT){
    *(int*)var*=2;
  } else if(type==TYPE_FLOAT){
    *(float*)var*=2;
  }
}

Called, rather obviously, like the following:

doubleVal(TYPE_INT, &integer);
doubleVal(TYPE_FLOAT, &floatingPoint);

You can’t perform pointer arithmetic on void pointers since the compiler doesn’t know the size of the data which is pointed to, however, you may cast a void pointer to a pointer of some other type and perform pointer arithmetic on that.

The most useful GCC options and extensions

This post contains information about some of the most useful GCC options. It is meant for people new to GCC, but you should already know how to compile, link and the other basic things using GCC; this is no introduction. If you want an introductory tutorial to GCC, just .

The things this article attempts to cover include (as well as a few other things):

  • Optimization
  • Compiler warning options
  • GCC-specific extensions to C
  • Standards compliance
  • Options for debugging
  • Runtime checks (e.g. stack overflow protection)

For more information on GCC, the freely available An Introduction to GCC book is pretty good. A manual with over 700 pages is available as well (it’s a reference, not a tutorial, though) from the GCC website. The manpages for GCC (man gcc) can also be useful.

Basic compiler warning and error options

An option many programmers always use while compiling C programs is the -Wall option. It enables several compiler warnings not enabled by default, such as -Wformat warning at incorrect format strings. To enable even more warnings, use the -Wextra option. All warnings can be turned off with -w. More warnings will make catching eventual bugs easier, but it may also raise the amount of false-positives. The exact implications of these different options, as well as individual warnings can be found here.

To treat compiler warnings as errors, use the -Werror option. To stop compilation when the first error occurs, use -Wfatal-errors.

Standards compliance

By default, GCC may compile C code that is not necessary standards-compliant, or it might not even compile code that complies to the C standard (“the C standard” here means either C89 or C99). Some C standard features are disabled by default, such as trigraphs (can be enabled with -trigraphs), and several GCC extensions (will be talked about later on in this article) will work, even if they aren’t parts of the official C standard.

The -ansi option can be used to make GCC correctly compile any valid C89 program (if not, it is due to a compiler bug). It will still accept some GCC extensions (those that aren’t incompatible with the standard); use the -pedantic option to make GCC a pedant when it comes to standards compliance. The -std= option can be used to set the specific standard. There’s a bunch of supported standards (and most standards have several valid names), but the important ones to know are c89 (equal to the earlier -ansi option), c99, gnu89 (C89 with GCC extensions, which is the default) and gnu99 (C99 with GCC extensions). You can also use c1x to enable experimental support for the upcoming C1X standard (or gnu1x for the same with GCC extensions).

Code optimization levels

You can set the code optimization level for GCC, which decides how aggressively GCC will optimize the code. By default, GCC will try to compile fast, thus no optimizations will be made. By setting an optimization level, GCC will spend more time compiling, and the code might be harder to debug as well, but optimize the code better, possibly resulting in a faster executing program and/or smaller binary filesize. Because of longer compile-time and possible complications making debugging harder, it can be a good idea not to optimize during the development process and wait with that for building the production binary.

The default optimization(less) level is set with the -O0 option, or by giving no optimization option at all.

Some of the most common optimization forms can be activated by using the -O1 (or simply just -O) option. This option tries to produce smaller and faster binaries, and in many cases it can compile faster than -O0 because some optimizations will simplify the program for the compiler as well.

The next level is, perhaps unsurprisingly, -O2. It tries to improve the speed of programs even more than -O1 does, without increasing the size. It can take a much more considerable amount of time to compile. This is recommended for production releases as it optimizes well for speed without sacrificing space.

The -O3 level does some of the heaviest, most time consuming optimizations. It may also increase the size of the binary. In some cases, the optimizations may backfire and actually produce a slower binary.

If you want a small binary (most of all), you should use the -Os option.

Just pre-process, compile or assemble

When you ask GCC to compile a C program, the following steps are usually taken:

  1. Pre-processing
  2. Compilation
  3. Assembling
  4. Linking

For different reasons, you may want to stop at some of the steps. You might just want to pre-process, for instance, to find an error you suspect comes from a faulty pre-processor directive. If you do so, you will see the output from the pre-processor instead of getting the complete finished binary. Likewise, you may wish not to link because you are going to link manually later on, or maybe you just want to get the assembly output, modify that in some way and then manually assemble and link it. The reasons why you would want to do that isn’t the important thing here though, but how to do it.

To only pre-process, you should use the -E option. To stop after compilation, use -S. To do all steps but linking, use -c.

Controlling assembly output

Normally GCC produces AT&T syntax assembly output, but if you want to use Intel’s syntax (which is, in my opinion, much more readable), you should set the assembly dialect with the -masm= option, with intel as the value (-masm=intel). Note that this won’t work on Mac OS X.

A useful option for making the Assembly code more readable is the -fverbose-asm, which adds comments to the assembly output.

Adding debug information

If you are going to debug your program later, and don’t want to debug the assembly version, the -g option is absolutely essential. It adds debugging information, so that you can do source-level debugging later on the binary. The -g option produces debug information specifically for GDB, so what you get will not necessarily work on other debuggers.

You can set the level of debug information to generate. The default level is 2. With -g1 you can inform GCC to produce minimal debugging information, and with -g3 you can tell GCC that you want even more debug information than what you get by default.

Adding runtime checks

GCC can add different runtime checks to C programs at compilation, making debugging easier and avoiding some of the most common security vulnerabilities in C programs (as long as vulnerabilities/bugs don’t exist in the checking…). Note that runtime checks can degrade performance of programs.

There is an incredibly useful GCC option, -fmudflap, which can be used to find pointer arithmetic errors during runtime. This can help you find many pointer arithmetic related errors.

Stack overflow protection can be enabled by using -fstack-check.

The -gnato option enables checking for overflows during integer operations.

GCC extensions to the C language

GCC provides several extensions to the C programming language that aren’t actually parts of the C standard. You should always be careful while using non-standard features as that would, in most cases, make your code incompatible with other compilers. Anyway, I will cover some of the most useful extensions GCC provides, and you decide if you use them or not.

All extensions can be found in the GCC documentation.

Likely and unlikely cases

One GCC extension frequently used in the Linux kernel is the GCC extension __builtin_expect option, commonly known as the likely() and unlikely() macros. The Linux kernel would use something like the following for telling GCC which if statements are likely and unlikely to execute, so that GCC can do better branch prediction:

/*This is the likely case which will occur most of the time*/
if(likely(x>0)){
  return 1;
}
/*This is the unlikely case which will occur much more
 *seldom than the earlier case*/
if(unlikely(x<=0)){
  return 0;
}

The likely() and unlikely() macros are defined in the Linux kernel as:

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

If you want to use this outside the Linux kernel, you could always type __builtin_expect(condition, 1) for likely cases and __builtin_expect(condition, 0) for unlikely cases, but it would be much easier to use the same macros as the Linux kernel uses.

Additional datatypes

GCC provides some additional datatypes to the C programming langauge not defined by the standard. These are:

Note that 80- and 128-bit floating point values are not supported on all architectures (they are supported on common x86 and x86_64 systems, though). On ARM platforms, half-precision (16-bit) floating points are supported.

Ranges in switch/case

A GCC extension provides the support for ranges in switch/case statements, so you can have a case for values between 10 and 1000, for instance. A range is defined as x ... y, where x is the lower-bound and y is the upper-bound. You may not leave the spaces before and after the dots out. An example switch statement using cases with ranges:

switch(x){
  case 0 ... 9:
    puts("One digit"); break;
  case 10 ... 99:
    puts("Two digits"); break;
  case 100 ... 999:
    puts("Three digits"); break;
  default:
    puts("I sense a disturbance in the force");
}

This is more convenient than writing 1000 different cases (but you wouldn’t solve the problem like that, would you?).

Binary literals

GCC supports binary literals in C programs using the 0b prefix, pretty much like you would use 0x for hexadecimal literals. In the following example, we initialize an integer using a binary literal for its value:

int integer=0b10111001;

Book recommendations for C programmers

The following is a list of recommendations on good reads for programmers in the C programming language:

The C Programming Language, aka the K&R

The C Programming Language
The classic book, describing all of ANSI C in roughly 200 pages. Written by Dennis Ritchie, who created C, and Brian W. Kernighan. Definitely a book that every C programmer should read and have in their library.

C Programming: A Modern Approach

This is the book I learned C from, and it is definitely one of the best technical books I have ever read. It is around 900 pages long, and very comprehensive. It contains both C89 and C99, and it also tells you about coding best practices and warns you about common gotcha’s.C Programming: A Modern Approach
It uses graphics to explain many concepts and in general, it is easy to read and understand, while it doesn’t skip details because the author felt like they were “unnecessary”. It has many and high-quality exercises and programming projects, and each chapter ends with a question-and-answer part where common questions are answered. If anyone asked me about a book to learn C from, this is what I would suggest.

The C Puzzle Book

The C Puzzle BookThe C Puzzle Book engages the reader in some C puzzles, where knowledge of C’s darker corners might be necessary. It is an entertaining read, and you will learn a lot about C from it.

The Standard C Library

The Standard C LibraryThis, rather old, jewel explains all of the C standard library and how to use it, but it doesn’t stop there: it shows full sample source code for all of the standard library as well! If you want to understand C’s standard library, this would be the book to get.

Expert C Programming

Expert C ProgrammingExplains how to code like a C expert (as far as a book is able to explain that, the rest will be about practice, practice, practice…). Tells you about the secrets making a programmer an expert at C.

C in a Nutshell

C in a NutshellThis is the book that I use as my C reference and the book which I look into when I need documentation (and don’t have a working Internet connection and manpages aren’t sufficient).

C Traps and Pitfalls

C Traps and PitfallsC isn’t a language that is going to play nice with you. It has many hazardous (if not aware of) hidden traps. This book dissects those, making you a more confident C programmer (hopefully). Maybe C will stop blowing up in your face as well ;-)

Don’t forget…

These were all books on C, but the programming language doesn’t make the programmer, or the knowledge of it. You will need practice, and knowledge on many other topics, such as algorithms, data structures and program design, but these books will give a solid foundation on the C programming language.

Bytes and bitwise operators in C

Bitwise operations have many uses. I asked a question a few months ago at programmers.stackexchange.com, where I was taught that. The answer which I accepted contained the following list of uses (credit goes to user whatsisname):

* Juggling blocks of bytes around that don’t fit in the programming languages data types
* Switching encoding back and forth from big to little endian.
* Packing 4 6bit pieces of data into 3 bytes for some serial or usb connection
* Many image formats have differing amounts of bits assigned to each color channel.
* Anything involving IO pins in embedded applications
* Data compression, which often does not have data fit nice 8-bit boundaries.\
* Hashing algorithms, CRC or other data integrity checks.
* Encryption
* Psuedorandom number generation
* Raid 5 uses bitwise XOR between volumes to compute parity.
* Tons more

I could myself further add the following:

  • You can improve performance
  • You can decrease system memory usage

Since I asked the question, I have played around with them, and additionally, I have started programming at a much lower level than before. I will now present an introduction to bitwise operators in C and a refresher about binary numbers. In the end, I will show and explain an implementation of a boolean datatype, with 8 booleans per byte, that is, 100% of the bits will get used. Programmers who know about the bitwise operators can skip ahead to that.

Bits and bytes

As you (should) know, everything in most modern computers is stored as sequences of bits, often represented as 0′s and 1′s (though really it is about differences in electrical charges, but 1′s and 0′s are more handy representations). A byte consists of 8 bits, for instance 1001 1101. The first bit, read from the right, represents a 1, the second a 2, the third a 4, the fourth an 8 etc. In decimal, 1001 1101 would be (1*128)+(0*64)+(0*32)+(1*16)+(1*8)+(1*4)+(0*2)+(1*1), which is 128+16+8+4+1, or 157. Typing something like 1001 1101 can get a bit tedious after some time. Programmers don’t use base 10 to simplify this though (10 is not a power of 2), but base 16 (hexadecimal), which, as a power of 2, has a pretty neat relation between base 2 (binary).

You can represent every combination one byte might have using 2 numbers in hexadecimal. There are 16 numbers in hexadecimal (including 0), and each of the two groups which we divided the byte into earlier can be represented by one number. The first number, coming from 1001 would be 9 and the second, coming from 1101 would be D (13, hexadecimal uses the letters A-F when using numbers higher than 9). Let’s check this, 0x9D (programmers often prefix numbers represented in base 16 with 0x) would be (9*16)+(13*1)=144+13=157. Yes, that is correct.

Now, how do you represent things other than numbers? Well, RGB color values, for instance, are represented using 3 bytes, one for red, one for green and one for blue. Each has 256 different
combinations (a value of 0 means none of that color and 255 means all of that color), and thus RGB can be used to represent 256^3=16 777 216 combinations, roughly 16,8 millions. An example, some sort of grey, would be 0x888888.

Another example is characters. Characters are most of the time represented in ASCII or Unicode (Unicode is basically a superset of ASCII containing characters for many international alphabets etc.). ASCII has 128 characters (uses 7-bits). Different characters have different values, for instance, an upper-case A is 65 (0x41). In C, strings are null-terminated; strings (should) end with a null-character (ASCII character 0). The string ABC would have a 0x41 byte, followed by a 0x42 byte, 0x43 byte, ended with a 0x00 byte (the null character). In memory, you could have a chunk of 4 bytes representing that string, like 01000001 01000010 01000011 00000000. Note that there is nothing that says that this is four characters that should be used as a string, you could use it like one 32-bit integer, or something else. Memory has no meaning at the byte level.

Big- and little endians

There are two orders in which you store bytes in memory that are common in modern machines, these called big- and little endian. In big endian, the most significiant byte comes first, and in little endian, the least significant number comes first. Big endian is what we humans use most of the time, for instance, in 113, one hundred is the most significant number, and it comes first. One hundred thirteen would have been 311 if it was written in little endian form. Intel (and compatible) processors use little endian. Note that bits in bytes are stored the same way in both formats (the most significiant bit is first), only the order in which whole bytes come in collections of bytes is altered by this. A 32-bit integer with the value 303153 would be stored as 00 04 A0 31 in big endian form and as 31 A0 04 00 in little endian form.

C deals with endianness for you and you won’t have to worry about it. Only Assembly programmers do have to worry about this in general (and IT security folks and reverse engineers and some more).

Integer literals in C

C provides several options which you can use when writing down integer literals. We will only concern us with the ability to use different bases. If you type out an integer “like you normally would”, C will interpret it as a decimal value (e.g. 73). If it has a leading 0, C will think that it is a base 8 number (e.g. 0777). We are not concerned with base 8 numbers here, but some older computers used them. Hexadecimal literals start with 0x, followed by a value in hex, with no spaces in between (e.g. 0xF7).

Sadly, C has no way to define binary literals.

The bitwise operators

A bitwise operation is an operation that works on the individual bits of a byte (or a composition of bytes). We will now look at the bitwise operators which are available to C programmers.

Bitshift

C has two bitshift operators, left- and right bitshift, or << and >>, respectively.

The action of both of these is common. They shift n bits, either to the left or to the right (I will let you guess yourself which one shifts to which direction). Bitshift left by one step is the same as integer multiplication by 2, and bitshift one step to the right is the same as integer division by 2. The following (bitshift 3 steps to the left) is equal to integer multiplication by 8:

0x2A << 3 /*0x2A=0000000 00101010, (0x2A << 3)=00000001 01010000*/

Note that you set how many steps you want to do your bitshift on the right side of the operator. You must always set the amount of steps.

We can create a simple function for calculating the n:th power of 2 using this:

int powerOf2(int n){
  return (1 << n);
}

Bitwise AND

Bitwise AND takes two numbers and returns a new one, where 1's are located at and only at places where both values had 1's:

10011011 AND        11001100 AND
10110101            00011011
--------            --------
10010001            00001000

The operator is an ampersand (&). So we can for instance run the following:

19 & 18

We can break this doẃn to:

19 & 18=   00010011 &
           00010010
           --------
           00010010   (=18)

So we can use this to find common 1's for two bytes, but what is that good for? We can for instance check if a number is odd by doing:

XXXXXXXX
00000001
--------
0000000?

The only case when you get a non-zero number is when the 1 is "on", and that is the only case a number is odd. The following is this as a C function (it could also have been implemented using the modulo operator, as it is normally done):

int isOdd(int x){
  return(1 & x);
}

Bitwise OR

Bitwise OR, compared to bitwise AND, is happy if just one of the numbers has a 1 at a location (both may have it though). Some examples:

10011011 OR         11001100 OR
10110101            00011011
--------            --------
10111111            11011111

The operator for bitwise OR is a pipe (|). We can run something like the following in C:

35 | 92

Broken down:

35 | 92=   00100011 |
           01011100
           --------
           01111111   (=127)

Bitwise XOR

The bitwise XOR (exclusive or) is like bitwise OR, but only one of the numbers can have a 1 at a given position in order to return a 1 at that position. Examples:

10011011 XOR        11001100 XOR
10110101            00011011
--------            --------
00101110            11010111

A hat (^) is the bitwise XOR operator in C.

Bitwise NOT

Bitwise NOT basically flips the value of each position (0 becomes 1 and vice verse). Examples:

11010111 NOT = 00101000
01011001 NOT = 10100110

It is the only unary bitwise operator in C and it is a tilde (~), which you should place in front of the value to "invert".

Booleans that use memory efficiently

In C, it is customary to use integers to store boolean values (no "real" boolean datatype exists), where 0 is false and everything else is true. So, as an example, 00000000 00000000 00000000 00000000 is false, and e.g. 00000000 00000000 00000000 00000001 is true. What do we do with the other 31 bits or the other 96.875% of the 4 bytes? Umm... nothing, actually. Using a char would be a bit better (24 bits better, actually), wasting "only" 7 out of 8 bits (87.5%). That is still not very good, one char has enough room for 8, not 1, boolean values. While this doesn't matter that much on modern home computers, on some embedded computers this is the difference between success and failure; a considerable difference. What we are going to do is to have a char variable, and then we will add functions allowing us to easily use that single variable as if it was 8 different boolean values.

We will use a typedef to create our boolean variable (which is just a char in disguise):

typedef char bool;

You can change this to something else than a char to allow for more values than 8.

We will also define macros for TRUE and FALSE:

#define TRUE 1
#define FALSE 0

Our boolean will act like an array, to some extent. A simple function to get one individual value would be the following, getBool, which takes 2 arguments; the variable and an "index":

bool getBool(bool boolean, int index){
  return ((boolean >> index) & 1);
}

What this does is basically to shift the value we are interested to the right, and then it uses AND to clear all values but the last, and if any value is left, we will have a 1, which is true, and else, a 0, false.

We will also have a function called setBool, with 3 arguments, a pointer to the boolean, the "index" of the boolean we want to access and the value we want to set at that position:

void setBool(bool *boolean, int index, bool value){
  if(value==0){
    *boolean=*boolean & ~(1<
          

The following graphic explains line 3:

*boolean=*boolean & ~(1<
          

The following graphic explains line 5:

*boolean=*boolean | (1<
          

I apologise if my solution is suboptimal; I'm new to low-level programming. And that was it.