Weasel Program - Go Version

One more implementation of the Weasel Program. (See previous two posts.) Then I'll get into the simulation issues.

I really like the ideas behind Go. Based on the still largely theoretical arguments for it being created (not much data yet) I hope it becomes popular for the type of programming problems it's designed for. It's very C-like mostly, which I think is good. But unfortunately, this little simulation doesn't allow it to show off any of the advantages it has over C.

I don't have very much experience in Go and so am taking every opportunity I can to see what it can do.

My bug rate per line of code has always been pretty consistent regardless of language. Go seems no different. Performance wise, for this particular problem, Go split the difference in execution time between the program written in Python and the program written in C.

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating 
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.

Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.

    Stop when child genome = target genome

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

*/
package main

import (
    "fmt"
    "math/rand"
    "time"
)

// Adjust the following to see how they affect number of generations and
// time to convergence. (Convenient to define here, visible and easy to
// change.
var targetGenome = []byte("METHINKS IT IS LIKE A WEASEL!")
var genes = []byte(" ABCDEFGHIJKLMNOPQRSTUVWXYZ!")
var mutationProbability float32 = 0.05
var numChildrenPerGeneration int = 100

// Useful to define globally
var genomeLen, perfectFitnessScore, numGenes int
var parentGenome, childGenome, survivingChildGenome []byte

func init() {
    rand.Seed(time.Now().UTC().UnixNano())

    genomeLen = len(targetGenome)
    perfectFitnessScore = len(targetGenome)
    numGenes = len(genes)

    // First generation parent is just a random set of genes.
    parentGenome = make([]byte, genomeLen)
    for i := range parentGenome {
        parentGenome[i] = genes[rand.Intn(numGenes)]
    }

    childGenome = make([]byte, genomeLen)
    survivingChildGenome = make([]byte, genomeLen)
}

func calcFitness(genome []byte) (fitness int) {
    for i, targetGene := range targetGenome {
        if genome[i] == targetGene {
            fitness++
        }
    }
    return
}

func createChild() {
    for i, parentGene := range parentGenome {
        if rand.Float32() < mutationProbability {
            childGenome[i] = genes[rand.Intn(numGenes)]
        } else {
            childGenome[i] = parentGene
        }
    }
}

func main() {

    var i, generation, fitness, maxFitness int

    startTime := time.Now()
    // Let user know how we started.
    maxFitness = calcFitness(parentGenome)
    fmt.Printf("%3d %3d %s\n", generation, maxFitness, string(parentGenome))

    // Evolve.  Loop each generation.
    for maxFitness < perfectFitnessScore {
        generation++
        maxFitness = 0
        for i = 0; i < numChildrenPerGeneration; i++ {
            createChild()
            fitness = calcFitness(childGenome)
            if fitness >= maxFitness {
                maxFitness = fitness
                copy(survivingChildGenome, childGenome)
            }
        }
        copy(parentGenome, survivingChildGenome)
    }
    fmt.Printf("%3d %3d %s\n", generation, maxFitness, string(parentGenome))
    elapsedTime := time.Since(startTime)
    fmt.Printf("Elapsed time: %v\n", elapsedTime)
}

Weasel Program - C Version

I've just started playing around with my Python version of the Weasel Program (see my last post). I can already see that if I want to automate scenarios (for a range of parameters: target genome length, mutation probability, and generation size) that I will need to address performance. So I think I'll just bite the bullet now.

Here is a version I just wrote in C. I have a vague notion I can refactor to call C from Python if my investigation of the Weasel gets that far. And I am under no illusions that even C is up to the task of handling simulated genomes anywhere near the size found in real life, but the principle that faster is better still holds.

Just in passing, I've been using C over 30 years and during that time I've also used a lot of other languages. But if I had to choose to program in one computer language exclusively -- C would be my choice. For example, first and foremost, I think of C++ as a better C. I'm not using C++ here because the OO design style, verbosity, and memory overhead are not needed for such a small program. But I digress.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating 
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.

Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.

    Stop when child genome = target genome

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

*/

/*
Note, used following command lines for speed:
    gcc -O3 weasel.c -o weasel
    ./weasel
or debugging:
    gcc -Wall -g weasel.c -o weasel.dbg
    gdb weasel.dbg
*/

const char TARGET_GENOME[] = "METHINKS IT IS LIKE A WEASEL!";
const char GENES[] = " ABCDEFGHIJKLMNOPQRSTUVWXYZ!";
const float MUTATION_PROBABILITY = 0.05;
const int NUM_CHILDREN_PER_GENERATION = 100;

/*
 * Also convenient to define globally.
 */

// minus null terminator
const size_t GENOME_LEN = sizeof(TARGET_GENOME) - 1;
const int PERFECT_FITNESS_SCORE = sizeof(TARGET_GENOME) - 1;
const size_t NUM_GENES = sizeof(GENES) - 1;

/* Mechanism of evolution. */

int calc_fitness(const char * child_genome) {
    // How many positions match the DNA of the perfect child?
    int i, sum = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        sum += (child_genome[i] == TARGET_GENOME[i]);
    }
    return sum;
}

/* 
Usually, a child is different from its parent. 
That is, allow chance of mutation.
*/
void create_child(char * parent_genome, char * child_genome) {
    int i;
    char parent_gene;
    float random_num;
    char random_gene;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_gene = parent_genome[i];
        random_num = (float)rand() / (float)RAND_MAX;  // 0.0 to 1.0
        random_gene = GENES[rand() % NUM_GENES];
        child_genome[i] = 
            random_num <= MUTATION_PROBABILITY ? 
                random_gene 
            : 
                parent_gene;
    }
    child_genome[i] = 0;  // if print as string
}

/*
Starting with a parent, 
generate children that have chance of mutated genes.
Most fit (only) child survives to become next generation parent.
(A parent never survives from one generation to the next.)
*/
int main() {

    struct timeval start_time, end_time;
    gettimeofday(&start_time, NULL);

    char parent_genome[GENOME_LEN+1];   // string terminators
    char child_genome[GENOME_LEN+1];
    char surviving_child_genome[GENOME_LEN+1];

    int i, child_fitness, max_fitness = 0;

    // Not nowadays considered a great random number generator, 
    // but good enough for our purposes. */
    srand(time(NULL));

    // Initially, just a random string genome of character genes.
    // Fitness will no doubt be quite low.
    int generation = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_genome[i] = GENES[rand() % NUM_GENES];
    }
    parent_genome[i] = 0;  // null terminator
    max_fitness = calc_fitness(parent_genome);
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    // Iterate until a child reaches perfection.
    while (max_fitness < PERFECT_FITNESS_SCORE) {
        generation++;
        max_fitness = 0;  // yes, generation can go downhill!
        for (i = 0; i < NUM_CHILDREN_PER_GENERATION; i++) {
            create_child(parent_genome, child_genome);
            child_fitness = calc_fitness(child_genome);
            if (child_fitness >= max_fitness) {
                max_fitness = child_fitness;
                strcpy(surviving_child_genome, child_genome);
            }
        }
        strcpy(parent_genome, surviving_child_genome);
    }
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    gettimeofday(&end_time, NULL);
    long microseconds = 
        (end_time.tv_sec * 1000000 + end_time.tv_usec) - 
        (start_time.tv_sec * 1000000 + start_time.tv_usec);
    printf("Elapsed time: %7.2fms\n", microseconds / 1000.);

    return 0;
}

The C version is about twice the number of lines of code as the Python version. IMHO, the C code is more complicated (even though I managed to avoid dynamic memory allocation). I did not keep count, but the defect rate per line of code (all minor defects for both programs for such a simple simulation) seemed about the same for both programs.

The performance reward? Execution is about seven times faster. Of course, increase the size of the simulation parameters and this number would grow.

Weasel Program -- Python Version

I recently stumbled across a type of thought experiment that is both very simple to simulate on a computer and yet very interesting to study and think about. That's too good to pass up!

It's called The Weasel Program. Its "aim is to demonstrate that the process that drives evolutionary systems -- random variation combined with non-random cumulative selection -- is different from pure chance." It's very a very basic example of a genetic algorithm.

(I've had a lot of fun in the past with another class of computer optimization techniques based on simulated annealing, but I have to save that story for another day.)

It also turns out that Dawkin's Weasel has created a bit of a controversy between certain teleological argument believers, particularly Intelligent Design types, and certain outspoken scientists. See The War of the Weasels. I see the Weasel controversy as a very simplified (and so perhaps instructive) version of the difficulties encountered in the Global Climate Model verification and validation debate.

But before getting into that and possibly going off into the deep end of the pool, I know I need to start off carefully and come up to speed by actually implementing a version (or two or three) of The Weasel Program. And then playing around until I get a feel for the numbers.

Of course, there are many examples of Weasel programs in various computer languages on the web. I could just download something. But to really understand code, I have to write code.

I've been coding a lot lately in Java and PHP. Practical languages -- but personally not very exciting or pleasing to write in. So as a chance to keep fresh in other languages I know and like to code in a lot more, I decided to implement a Weasel in Python. Here it is:

'''Weasel Program -- evolutionary simulation demonstration

A simple computer simulation demonstrating that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome modelled as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.
    
Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.
        
    Stop when child genome = target genome
    
    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

'''
import random
import time

# Also implementing this in other languages, so keep track of performance.
start_time = time.time()

# Adjust the following to see how they affect 
# number of generations and time to convergence:

TARGET_GENOME = list("METHINKS IT IS LIKE A WEASEL!")
GENES = " ABCDEFGHIJKLMNOPQRSTUVWXYZ!"  # each char is a gene
MUTATION_PROBABILITY = 0.05
NUM_CHILDREN_PER_GENERATION = 100

# Convenient to define here:

GENOME_LENGTH = len(TARGET_GENOME)  # how many genes can match?
PERFECT_FITNESS_SCORE = GENOME_LENGTH  # all genes match target

# The mechanism of evolutionary selection.

def calc_fitness(child):
    '''How many positions match the target genome?'''
    fitness = 0
    for i in range(GENOME_LENGTH):
        if child[i] == TARGET_GENOME[i]:
            fitness += 1
    return fitness

def create_child(parent):
    '''Allow chance of mutation.'''
    return [(random.choice(GENES) if random.random() < MUTATION_PROBABILITY 
               else gene) for gene in parent]

# Initialization (the first generation is just a random string of genes)
generation = 0
surviving_child = [random.choice(GENES) for _ in range(GENOME_LENGTH)]
max_fitness = calc_fitness(surviving_child)
print "%3d" % generation, "%3d" % max_fitness, "".join(surviving_child)

# Then evolution proceeds as series of successive generations,
# using random mutations and survival of the fittest.
while max_fitness < PERFECT_FITNESS_SCORE:
    parent = surviving_child[:]
    generation += 1
    max_fitness = 0
    for _ in range(NUM_CHILDREN_PER_GENERATION):
        child = create_child(parent)
        child_fitness = calc_fitness(child)
        if child_fitness > max_fitness:
            max_fitness = child_fitness
            surviving_child = child

print "%3d" % generation, "%3d" % max_fitness, "".join(surviving_child)

print "Elapsed time: %7.2fms" % ((time.time() - start_time) * 1000.)
 
The program runs in about 100ms +/-50% on my desktop machine depending on how lucky the mutations are for that particular run.

One of the things I like about Python is how easy it is to read. Even for someone who is not that familiar with programming in general or Python in particular. But there are a couple of places in the above program where the syntax might seem confusing. One is where generators are used, for example:

surviving_child = [random.choice(GENES) for _ in range(GENOME_LENGTH)]

surviving_child is a list of random genes of length GENOME_LENGTH. The underscore (_) is just a common way of indicating that the for loop's index is not being used. And the generator's syntax can seem strange since it seems to put the body of the loop before the for keyword.

Another place where Python may be confusing is its form of what C-like languages call the ternary operator.

random.choice(GENES) if random.random() < MUTATION_PROBABILITY else gene

In a C-like language, this expression would be:

random.random() < MUTATION_PROBABILITY ? random.choice(GENES) : gene