Weasel Program - C Version

I've just started playing around with my Python version of the Weasel Program (see my last post). I can already see that if I want to automate scenarios (for a range of parameters: target genome length, mutation probability, and generation size) that I will need to address performance. So I think I'll just bite the bullet now.

Here is a version I just wrote in C. I have a vague notion I can refactor to call C from Python if my investigation of the Weasel gets that far. And I am under no illusions that even C is up to the task of handling simulated genomes anywhere near the size found in real life, but the principle that faster is better still holds.

Just in passing, I've been using C over 30 years and during that time I've also used a lot of other languages. But if I had to choose to program in one computer language exclusively -- C would be my choice. For example, first and foremost, I think of C++ as a better C. I'm not using C++ here because the OO design style, verbosity, and memory overhead are not needed for such a small program. But I digress.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating 
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.

Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.

    Stop when child genome = target genome

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

*/

/*
Note, used following command lines for speed:
    gcc -O3 weasel.c -o weasel
    ./weasel
or debugging:
    gcc -Wall -g weasel.c -o weasel.dbg
    gdb weasel.dbg
*/

const char TARGET_GENOME[] = "METHINKS IT IS LIKE A WEASEL!";
const char GENES[] = " ABCDEFGHIJKLMNOPQRSTUVWXYZ!";
const float MUTATION_PROBABILITY = 0.05;
const int NUM_CHILDREN_PER_GENERATION = 100;

/*
 * Also convenient to define globally.
 */

// minus null terminator
const size_t GENOME_LEN = sizeof(TARGET_GENOME) - 1;
const int PERFECT_FITNESS_SCORE = sizeof(TARGET_GENOME) - 1;
const size_t NUM_GENES = sizeof(GENES) - 1;

/* Mechanism of evolution. */

int calc_fitness(const char * child_genome) {
    // How many positions match the DNA of the perfect child?
    int i, sum = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        sum += (child_genome[i] == TARGET_GENOME[i]);
    }
    return sum;
}

/* 
Usually, a child is different from its parent. 
That is, allow chance of mutation.
*/
void create_child(char * parent_genome, char * child_genome) {
    int i;
    char parent_gene;
    float random_num;
    char random_gene;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_gene = parent_genome[i];
        random_num = (float)rand() / (float)RAND_MAX;  // 0.0 to 1.0
        random_gene = GENES[rand() % NUM_GENES];
        child_genome[i] = 
            random_num <= MUTATION_PROBABILITY ? 
                random_gene 
            : 
                parent_gene;
    }
    child_genome[i] = 0;  // if print as string
}

/*
Starting with a parent, 
generate children that have chance of mutated genes.
Most fit (only) child survives to become next generation parent.
(A parent never survives from one generation to the next.)
*/
int main() {

    struct timeval start_time, end_time;
    gettimeofday(&start_time, NULL);

    char parent_genome[GENOME_LEN+1];   // string terminators
    char child_genome[GENOME_LEN+1];
    char surviving_child_genome[GENOME_LEN+1];

    int i, child_fitness, max_fitness = 0;

    // Not nowadays considered a great random number generator, 
    // but good enough for our purposes. */
    srand(time(NULL));

    // Initially, just a random string genome of character genes.
    // Fitness will no doubt be quite low.
    int generation = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_genome[i] = GENES[rand() % NUM_GENES];
    }
    parent_genome[i] = 0;  // null terminator
    max_fitness = calc_fitness(parent_genome);
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    // Iterate until a child reaches perfection.
    while (max_fitness < PERFECT_FITNESS_SCORE) {
        generation++;
        max_fitness = 0;  // yes, generation can go downhill!
        for (i = 0; i < NUM_CHILDREN_PER_GENERATION; i++) {
            create_child(parent_genome, child_genome);
            child_fitness = calc_fitness(child_genome);
            if (child_fitness >= max_fitness) {
                max_fitness = child_fitness;
                strcpy(surviving_child_genome, child_genome);
            }
        }
        strcpy(parent_genome, surviving_child_genome);
    }
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    gettimeofday(&end_time, NULL);
    long microseconds = 
        (end_time.tv_sec * 1000000 + end_time.tv_usec) - 
        (start_time.tv_sec * 1000000 + start_time.tv_usec);
    printf("Elapsed time: %7.2fms\n", microseconds / 1000.);

    return 0;
}

The C version is about twice the number of lines of code as the Python version. IMHO, the C code is more complicated (even though I managed to avoid dynamic memory allocation). I did not keep count, but the defect rate per line of code (all minor defects for both programs for such a simple simulation) seemed about the same for both programs.

The performance reward? Execution is about seven times faster. Of course, increase the size of the simulation parameters and this number would grow.

No comments:

Post a Comment