Bytes and bitwise operators in C

Bitwise operations have many uses. I asked a question a few months ago at programmers.stackexchange.com, where I was taught that. The answer which I accepted contained the following list of uses (credit goes to user whatsisname):

* Juggling blocks of bytes around that don’t fit in the programming languages data types
* Switching encoding back and forth from big to little endian.
* Packing 4 6bit pieces of data into 3 bytes for some serial or usb connection
* Many image formats have differing amounts of bits assigned to each color channel.
* Anything involving IO pins in embedded applications
* Data compression, which often does not have data fit nice 8-bit boundaries.\
* Hashing algorithms, CRC or other data integrity checks.
* Encryption
* Psuedorandom number generation
* Raid 5 uses bitwise XOR between volumes to compute parity.
* Tons more

I could myself further add the following:

  • You can improve performance
  • You can decrease system memory usage

Since I asked the question, I have played around with them, and additionally, I have started programming at a much lower level than before. I will now present an introduction to bitwise operators in C and a refresher about binary numbers. In the end, I will show and explain an implementation of a boolean datatype, with 8 booleans per byte, that is, 100% of the bits will get used. Programmers who know about the bitwise operators can skip ahead to that.

Bits and bytes

As you (should) know, everything in most modern computers is stored as sequences of bits, often represented as 0′s and 1′s (though really it is about differences in electrical charges, but 1′s and 0′s are more handy representations). A byte consists of 8 bits, for instance 1001 1101. The first bit, read from the right, represents a 1, the second a 2, the third a 4, the fourth an 8 etc. In decimal, 1001 1101 would be (1*128)+(0*64)+(0*32)+(1*16)+(1*8)+(1*4)+(0*2)+(1*1), which is 128+16+8+4+1, or 157. Typing something like 1001 1101 can get a bit tedious after some time. Programmers don’t use base 10 to simplify this though (10 is not a power of 2), but base 16 (hexadecimal), which, as a power of 2, has a pretty neat relation between base 2 (binary).

You can represent every combination one byte might have using 2 numbers in hexadecimal. There are 16 numbers in hexadecimal (including 0), and each of the two groups which we divided the byte into earlier can be represented by one number. The first number, coming from 1001 would be 9 and the second, coming from 1101 would be D (13, hexadecimal uses the letters A-F when using numbers higher than 9). Let’s check this, 0x9D (programmers often prefix numbers represented in base 16 with 0x) would be (9*16)+(13*1)=144+13=157. Yes, that is correct.

Now, how do you represent things other than numbers? Well, RGB color values, for instance, are represented using 3 bytes, one for red, one for green and one for blue. Each has 256 different
combinations (a value of 0 means none of that color and 255 means all of that color), and thus RGB can be used to represent 256^3=16 777 216 combinations, roughly 16,8 millions. An example, some sort of grey, would be 0x888888.

Another example is characters. Characters are most of the time represented in ASCII or Unicode (Unicode is basically a superset of ASCII containing characters for many international alphabets etc.). ASCII has 128 characters (uses 7-bits). Different characters have different values, for instance, an upper-case A is 65 (0x41). In C, strings are null-terminated; strings (should) end with a null-character (ASCII character 0). The string ABC would have a 0x41 byte, followed by a 0x42 byte, 0x43 byte, ended with a 0x00 byte (the null character). In memory, you could have a chunk of 4 bytes representing that string, like 01000001 01000010 01000011 00000000. Note that there is nothing that says that this is four characters that should be used as a string, you could use it like one 32-bit integer, or something else. Memory has no meaning at the byte level.

Big- and little endians

There are two orders in which you store bytes in memory that are common in modern machines, these called big- and little endian. In big endian, the most significiant byte comes first, and in little endian, the least significant number comes first. Big endian is what we humans use most of the time, for instance, in 113, one hundred is the most significant number, and it comes first. One hundred thirteen would have been 311 if it was written in little endian form. Intel (and compatible) processors use little endian. Note that bits in bytes are stored the same way in both formats (the most significiant bit is first), only the order in which whole bytes come in collections of bytes is altered by this. A 32-bit integer with the value 303153 would be stored as 00 04 A0 31 in big endian form and as 31 A0 04 00 in little endian form.

C deals with endianness for you and you won’t have to worry about it. Only Assembly programmers do have to worry about this in general (and IT security folks and reverse engineers and some more).

Integer literals in C

C provides several options which you can use when writing down integer literals. We will only concern us with the ability to use different bases. If you type out an integer “like you normally would”, C will interpret it as a decimal value (e.g. 73). If it has a leading 0, C will think that it is a base 8 number (e.g. 0777). We are not concerned with base 8 numbers here, but some older computers used them. Hexadecimal literals start with 0x, followed by a value in hex, with no spaces in between (e.g. 0xF7).

Sadly, C has no way to define binary literals.

The bitwise operators

A bitwise operation is an operation that works on the individual bits of a byte (or a composition of bytes). We will now look at the bitwise operators which are available to C programmers.

Bitshift

C has two bitshift operators, left- and right bitshift, or << and >>, respectively.

The action of both of these is common. They shift n bits, either to the left or to the right (I will let you guess yourself which one shifts to which direction). Bitshift left by one step is the same as integer multiplication by 2, and bitshift one step to the right is the same as integer division by 2. The following (bitshift 3 steps to the left) is equal to integer multiplication by 8:

0x2A << 3 /*0x2A=0000000 00101010, (0x2A << 3)=00000001 01010000*/

Note that you set how many steps you want to do your bitshift on the right side of the operator. You must always set the amount of steps.

We can create a simple function for calculating the n:th power of 2 using this:

int powerOf2(int n){
  return (1 << n);
}

Bitwise AND

Bitwise AND takes two numbers and returns a new one, where 1's are located at and only at places where both values had 1's:

10011011 AND        11001100 AND
10110101            00011011
--------            --------
10010001            00001000

The operator is an ampersand (&). So we can for instance run the following:

19 & 18

We can break this doẃn to:

19 & 18=   00010011 &
           00010010
           --------
           00010010   (=18)

So we can use this to find common 1's for two bytes, but what is that good for? We can for instance check if a number is odd by doing:

XXXXXXXX
00000001
--------
0000000?

The only case when you get a non-zero number is when the 1 is "on", and that is the only case a number is odd. The following is this as a C function (it could also have been implemented using the modulo operator, as it is normally done):

int isOdd(int x){
  return(1 & x);
}

Bitwise OR

Bitwise OR, compared to bitwise AND, is happy if just one of the numbers has a 1 at a location (both may have it though). Some examples:

10011011 OR         11001100 OR
10110101            00011011
--------            --------
10111111            11011111

The operator for bitwise OR is a pipe (|). We can run something like the following in C:

35 | 92

Broken down:

35 | 92=   00100011 |
           01011100
           --------
           01111111   (=127)

Bitwise XOR

The bitwise XOR (exclusive or) is like bitwise OR, but only one of the numbers can have a 1 at a given position in order to return a 1 at that position. Examples:

10011011 XOR        11001100 XOR
10110101            00011011
--------            --------
00101110            11010111

A hat (^) is the bitwise XOR operator in C.

Bitwise NOT

Bitwise NOT basically flips the value of each position (0 becomes 1 and vice verse). Examples:

11010111 NOT = 00101000
01011001 NOT = 10100110

It is the only unary bitwise operator in C and it is a tilde (~), which you should place in front of the value to "invert".

Booleans that use memory efficiently

In C, it is customary to use integers to store boolean values (no "real" boolean datatype exists), where 0 is false and everything else is true. So, as an example, 00000000 00000000 00000000 00000000 is false, and e.g. 00000000 00000000 00000000 00000001 is true. What do we do with the other 31 bits or the other 96.875% of the 4 bytes? Umm... nothing, actually. Using a char would be a bit better (24 bits better, actually), wasting "only" 7 out of 8 bits (87.5%). That is still not very good, one char has enough room for 8, not 1, boolean values. While this doesn't matter that much on modern home computers, on some embedded computers this is the difference between success and failure; a considerable difference. What we are going to do is to have a char variable, and then we will add functions allowing us to easily use that single variable as if it was 8 different boolean values.

We will use a typedef to create our boolean variable (which is just a char in disguise):

typedef char bool;

You can change this to something else than a char to allow for more values than 8.

We will also define macros for TRUE and FALSE:

#define TRUE 1
#define FALSE 0

Our boolean will act like an array, to some extent. A simple function to get one individual value would be the following, getBool, which takes 2 arguments; the variable and an "index":

bool getBool(bool boolean, int index){
  return ((boolean >> index) & 1);
}

What this does is basically to shift the value we are interested to the right, and then it uses AND to clear all values but the last, and if any value is left, we will have a 1, which is true, and else, a 0, false.

We will also have a function called setBool, with 3 arguments, a pointer to the boolean, the "index" of the boolean we want to access and the value we want to set at that position:

void setBool(bool *boolean, int index, bool value){
  if(value==0){
    *boolean=*boolean & ~(1<
          

The following graphic explains line 3:

*boolean=*boolean & ~(1<
          

The following graphic explains line 5:

*boolean=*boolean | (1<
          

I apologise if my solution is suboptimal; I'm new to low-level programming. And that was it.

man is your friend as a programmer

The GNU manpages are useful for programmers; first of all, you can lookup commands that you might use while programming, in case you have forgotten an option or something like that, but additionally, they contain documentation for many C functions, so you can for instance run man printf to get information on C’s formatted output function. Also, GNU/Linux kernel calls are documented.

You can also run man ascii to get a manpage containing every ASCII character in the set:

man ascii

There is no need to open up Firefox or Chrome/Chromium (but dare not use Internet Explorer) each time you need some of that information, just use the manpages (especially useful if you happen to be without an Internet connection).

Pythonic Coding: Techniques for idiomatic Python code

It is possible to write Java, C etc. in Python; that is, you use Python as your programming language, but you write code as if it wasn’t Python but another programming language and all you do is adjust the syntax to avoid (syntax) errors and get the program to run. Preferably, your Python code should be like Python code, not just be written in Python. Code that is written like that is called pythonic code; good Python code can be called pythonic. In this post, we will look at various techniques and idioms of the Python programming language that are typical in pythonic code.

Many techniques unique to Python, or at least not very common in other languages, are presented here. They are however not explained in very great detail and in such cases, links will be provided to more detailed documentation.

Use ‘in’ properly

A highly unpythonic way to iterate and print values in a list would be:

l=[8, 32, -53, 4]
for i in range(len(l)):
  print l[i]

This might be acceptable in C or similar languages because of the lack of any better way to do it, but it is bad in Python, since a superior way is provided. The pythonic and encouraged way would be:

l=[8, 32, -53, 4]
for i in l:
  print i

This is much simpler and clearer (these two are usually related). The important thing on the second line in the pythonic code example is that the in keyword is used correctly. When the in keyword is used with the for statement, it will go through every one item of an iterable object (one per iteration of the for loop), and i (or whatever variable you use) will will contain the value for the current object in the iterable. Because of this, we can also change the third line from print l[i] to print i.

It is good to know that the range() builtin returns an iterable object with values from something (default is 0) to the integer given as an argument. In the unpythonic code, we basically iterate through 0, 1, 2 and 3 and use those numbers as indexes for different items in our list. In the pythonic code, we “directly” iterate through the list.

Avoid parenthesis around conditional statements and loops

The following is correct Python code:

if(True==False):

It is possible to leave the parenthesis out, and you should do so. The parenthesis are considered a code bloat, and the pythonic way to write that if statement would be:

if True==False:

While C, Java and many other languages do require the parenthesis, Python doesn’t, so don’t use them (the reason the former is allowed is because the expression will just be treated as a parenthesized expression).

Checking for one out of multiple conditions

One way to check if, say, the variable x is equal to 8, 0 or 6 using just one single if statement is to write:

if x==8 or x==0 or x==6:

That code is, among other things, ugly and harder to maintain than how it should be. The in keyword will help you here. Using a list with in, you could have the following code, meaning exactly the same as the above, but with several advantages (pythonic and easier to read and maintain):

if x in [8, 0, 6]:

It iterates through every one of the list items, and in case one (or more) of them match, the statement evaluates to True. The pythonic way here can only replace or, not and or any other operators bearing similar roles.

Use properties

Don’t write this kind of class in Python:

class Clock(object):
  def __init__(self):
    self.__hour=1
  def setHour(self, hour):
    if 25>hour>0:
      self.__hour=hour
    else:
      raise BadHourException
  def getHour(self):
    return self.__hour

These traditional getters and setters should be replaced with properties in Python. To replace hour with a property, we can use the property() builtin. property() takes up to 4 arguments, the first being a function which is used as a getter, the second a function which is used as a setter, the third a deletion function and the fourth a docstring to be used for the property.

To upgrade our last example to use properties, we can use the following:

class Clock(object):
  def __init__(self):
    self.__hour=1
  def __setHour(self, hour):
    if 25>hour>0:
      self.__hour=hour
    else:
      raise BadHourException
  def __getHour(self):
    return self.__hour
  hour=property(__getHour, __setHour)

Note how the old getters and setters are now hidden. Before, we would have used something like the following to manage the hour variable:

time=Clock()
time.setHour(9)
print time.getHour()

This is what Java code could (and even should) look like, but Python isn’t Java and Java isn’t very pythonic. Our later implementation, which uses properties, would work like the following:

time=Clock()
time.hour=9
print time.hour

When we assign to time.hour at the second line, Python actually calls our __setHour() method, and the value we are giving it becomes the first (second if you include self) argument of that method. Likewise, when we ask for the value at line 3, we are given the return value of the __getHour() method, not necessarily the actual value of __hour.

Note that properties work only on new-style classes (classes that extend object).

List comprehensions

List comprehensions is one of my favorite Python features. If you were tasked to write a program that returns all odd values from a list, you might type something like the following if you were unaware of this feature:

l=[4, 2, 1, 9, 14, 25]
new=[]
for i in l:
  if i%2==1:
    new.append(i)

Using list comprehensions, you could type this on one line if we exclude the definition of the list:

new=[i for i in l if i%2==1]

The syntax might seem scary at first, but it isn’t, actually. Let’s break it down a bit. If you type the following in a Python interpreter, you will get the whole list:

[i for i in l]

This iterates through l, and for each iteration, it adds the value of i to a new list, and finally, when the iteration has finished, the new list is returned. We can add a statement which specifies a condition that each item must satisfy in order to be included in the new list. This is the if i%2==1 part.

That was the most basic form of a list comprehension. List comprehensions are extremely powerful, and you can do much more complex stuff than this. See the following for examples of complex list comprehensions.

Use enumerate()

You have a list of 5 names:

names=["Peter", "Jack", "Lucy", "Lucas", "Linus"]

You want to print them as a numbered list like the following:

1. Peter
2. Jack
3. Lucy
4. Lucas
5. Linus

The problem is that you need to know the index of the name you are printing in the list, but by using for i in names, you won’t know that, and using range() for this purpose is not just pythonic. Don’t fear, enumerate() solves this problem in a pythonic fashion:

for i, name in enumerate(names, 1):
  print "{0}. {1}".format(i, name)

The first argument which we give to enumerate is an iterable. The second argument is optional (it defaults to 0), and it sets the value at which to start the enumeration (we want to start from 1, thus we used that).

Generators

Generators in Python is an advanced and powerful feature that, when correctly used, can ease the coding of many tasks. Let’s implement a custom range function (it will have less features than Python’s builtin, note that while is used instead of a for i in range(n) as we want to implement our own range-like function without using the builtin):

def myRange(length):
  n=0
  result=[]
  while n
          

The 3rd line is highlighted; we don’t want that line; we don’t want a temporary list. This is where generators are useful. Instead, we could have used the following code for our range implementation:

def myRange(length):
  n=0
  while n
          

The nice trick here is the yield keyword; it adds a value to a list, which is later on returned as a generator object (note the lack of an explicit return statement, the generator is returned when the function has finished its execution). Generator objects can, for instance, become iterators, now allowing us to use our own range implementation instead of the one built into Python:

for i in myRange(7):
  print i

Some final things

To swap two values, don’t use a temporary variable, instead use:

a, b=b, a

You don’t have to use the readlines() method of files to iterate through files, instead you can directly iterate with a for loop:

for line in file:
  #do something with line

If possible, you should call the open() function at the same line as the for statement:

for line in open("file.txt"):
  #do something with line

Use tuple unpacking to return multiple values from a function:

def doubleUs(a, b)
  return a*2, b*2
x, y=doubleUs(4, 9)

The idiom for an infinite loop in Python is a while loop with the expression 1 (not True):

while 1:
  #Do things

Conclusion

Python is a language with many great features. These features are the things that make Python Python and not “just another programming language”. These features and idioms are pythonic and define the Python way of doing things. This was just an introduction to these, and there is much more to the word pythonic than these features.

New Site

So I completely changed AntoArts.Com today. This was a much needed change, mostly because I do very different things from what I did when I originally created the site. The site was meant, in the beginning, to be a place to store my Flash games, something that I don’t even work on any more. I have moved to desktop application programming, and now at the moment I am moving to do more low-level, systems programming; that is just remotely related to Flash development.

The blog will be the main part of the site, and I will post things I am working on, as well as thoughts, mainly on programming, here. Otherwise, this site, as of now, contains links to the old flash games, and a photo gallery (photography is actually something I still do, and I have no plans at all to stop doing that). Things are a bit unfinished here at the moment, but browse and enjoy (NB there is not much to browse, or to enjoy). All pages but the frontpage of the old site are available from their old URLs. When this new site has matured a bit, I will redirect some of the old pages to the newer ones.

I think I was 12 when I started this site, now I am 15. This site haven’t been updated in roughly a year, so something was due to happen. This was it.