Weasel Program - Python + C Version

This is a continuation of recent posts on the Weasel Program. There are various claims about what the program demonstrates about the processes that drive evolution. And I am trying to sort it out for myself.

Computer models are great for studying the relative importance of the factors affecting the thing being modeled. One aspect of this is sensitivity analysis. In order to do a sensitivity study of the parameters affecting the Weasel Program, it would be convenient to use Python. This would allow me to use tools such as matplotlib and numpy. However, Python is relatively slow. C is much faster. In this post I show how to use SWIG (Simplified Wrapper and Interface Generator) to get a combination that is just about the best of both languages. I've used SWIG since the 90's and it remains popular today. It has been much improved over the years and it works not only with C, but C++ and about 20 other languages as well. So it's a tool made for polyglots.

The first thing I did was put all source code in the same folder (weasel.c and weasel-driver.py). Then I refactored the C program I orginally wrote. Now that I understand the problem better I improved its design a bit. (All software designs evolve, no pun intended.) But most importantly, I decided which of the model's defining parameters and functions need to be externally available. (I used the Go convention that what I thought needed to be a publicly available name begin with a capital letter.) Actual implementation details remain hidden. Here is the refactored weasel.c file.

#include < stdio.h >
#include < stdlib.h >
#include < string.h >
#include < time.h >
#include < sys/time.h >

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Program has been made more general by allowing a gene to be any integer.

Program allows selection process to be modified.
That is, does not have to be number of matching genes.

Basic algorithm:
    Generate children from single parent keeping most fit as next parent.
    Single parent = no crossover

    Target genome does not drift (change, shrink, enlarge).
    Target genome is array of integers.
    So a gene is just an integer value.(from 0 to RAND_MAX)

    Stop when child genome fitness >= target fitness

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

Note, used following command lines for speed:
    gcc -O3 weasel.c -o weasel
    ./weasel
or debugging:
    gcc -Wall -g weasel.c -o weasel.dbg
    gdb weasel.dbg
*/

// Avoids any malloc/free issues:
#define MAX_GENOME_LEN 10000
#define MAX_NUM_GENES 10000
int PARENT[MAX_GENOME_LEN] = {0};
int CHILD[MAX_GENOME_LEN] = {0};
int FITTEST[MAX_GENOME_LEN] = {0};

// Define these to control simulation:
// (Give some defaults to get started with a sensitivity analysis.)

// Represents simulation parameters.
int Genome_len = 100;  // <= MAX_GENOME_LEN
int Num_genes = 30;  // <= MAX_NUM_GENES
int Generation_size = 100;
double Mutation_probability = 0.02;
double Fitness_err = 0.01;
double Target_fitness = 30;

// Control simulation.
int Max_generations = 10000;

// Represents evolutionary model.
double calc_child_fitness() {
    // CHANGE MODEL AS NEEDED.
    // For this particular model:
    // Fitness model is completely 'internalized'.
    // Our strategy is centered only on the genome itself.
    // No external world or creature to simulate.
    // No effect on child creation mechanism.
    // Actual fixed target gene value does not matter (so set = 0).
    // Only need match particular value at particular position
    // on the target genome.
    const int target[MAX_GENOME_LEN] = {0};
    int i;
    double fitness = 0.;
    for (i = 0; i < Genome_len; i++) {
        fitness += (double)(CHILD[i] == target[i]);
    }
    return fitness;
}

// Represents reproduction and gene semantics.
void create_child() {
    int i;
    for (i = 0; i < Genome_len; i++) {
        // drand48() = 48 bit uniform 0.0 to 1.0
        if (drand48() <= Mutation_probability) {
            CHILD[i] = rand() % Num_genes;
        } else {
            CHILD[i] = PARENT[i];
        }
    }
}
void create_random_parent() {
    // Setup first generation.
    int i;
    for (i = 0; i < Genome_len; i++) {
        PARENT[i] = rand() % Num_genes;
    }
}

/*
Starting with a parent,
generate children that have chance of mutated genes.
Most fit (only) child survives to become next generation's parent.
(A parent never survives from one generation to the next.)
*/
int Evolve() {  // generations needed

    int i, j;
    double fitness, max_fitness;

    // Assume first generation parent fitness is low.
    int generation = 0;
    create_random_parent();
    max_fitness = 0.0;

    // Iterate until a child reaches target fitness or to many generations.
    for (generation = 1; generation < Max_generations; generation++) {
        if (max_fitness >= Target_fitness - Fitness_err) break;
        max_fitness = 0;  // yes, a generation can go downhill!
        for (i = 0; i < Generation_size; i++) {
            create_child();
            fitness = calc_child_fitness();
            if (fitness >= max_fitness) {
                max_fitness = fitness;
                for (j = 0; j < Genome_len; j++) {
                    FITTEST[j] = CHILD[j];
                }
            }
        }
        for (j = 0; j < Genome_len; j++) {
            PARENT[j] = FITTEST[j];
        }
    }
    return generation;
}

int main() {

    struct timeval start_time, end_time;

    Genome_len = 100;
    Num_genes = 30;
    Generation_size = 100;
    Mutation_probability = 0.02;
    Fitness_err = .01;
    Target_fitness = Genome_len;

    Max_generations = 10000;
    
    srand(time(NULL));

    printf("LEN: %4d, NUM: %4d, SIZE: %4d, MUTATION: %7.3f\n",
           Genome_len, Num_genes, Generation_size, Mutation_probability);

    int i;
    int num_trials = 10;
    double total_time = 0.0;
    int total_generations = 0;;
    for (i = 0; i < num_trials; i++) {
        gettimeofday(&start_time, NULL);

        total_generations += Evolve();

        gettimeofday(&end_time, NULL);
        long microseconds =
            (end_time.tv_sec * 1000000 + end_time.tv_usec) -
            (start_time.tv_sec * 1000000 + start_time.tv_usec);
        total_time += microseconds / 1000000.;
    }

    double time_per_trial = total_time / num_trials;
    int generations_per_trial = total_generations / num_trials;

    printf("TRIALS: %4d, AVE. GEN: %4d/trial, AVE. TIME: %7.3f sec/trial\n",
           num_trials, generations_per_trial, time_per_trial);

    return 0;
}

Some things to notice. I went from characters to integers so I could use any number of genes I wanted. I increased the size of the static arrays so I wouldn't need to worry about accidental overflow. I defined some defaults for the model parameters. And I made corresponding changes to the main function so I could test all the other changes. (One of the issues with SWIG is that sometimes it makes testing harder. And it is a general best practice anyway to do as much testing before integration as practical.) There was no need to remove main before making it a python callable module using SWIG.

Here is the corresponding SWIG interface file.

/* SWIG Interface File: weasel.i */
%module weasel
%{
extern int Genome_len;
extern int Num_genes;
extern int Generation_size;
extern double Mutation_probability;
extern double Fitness_err;
extern double Target_fitness;
extern int Max_generations;

extern int Evolve();
%}

extern int Genome_len;
extern int Num_genes;
extern int Generation_size;
extern double Mutation_probability;
extern double Fitness_err;
extern double Target_fitness;
extern int Max_generations;

extern int Evolve();

The thing to note here is how this makes it much more difficult to add or change things. It's like defining a class. Something I usually try and avoid when trying to understand some new idea or prototype some unfamiliar algorithm. But this is evidence for the argument that I should have used C++ from the start rather than C.

And these are the commands I used to create the Python callable weasel module. (I'm using Ubuntu 12.04, and installed SWIG at ~/apps/swig. So your commands might need to be a little different.)

#/usr/bin/env bash
~/apps/swig/bin/swig -python weasel.i
gcc -c -fpic weasel.c weasel_wrap.c -I/usr/include/python2.7
gcc -shared weasel.o weasel_wrap.o -o _weasel.so

Finally, here is how I refactored the Python version I originally posted.

'''Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

This is a driver accessing via SWIG a C module that does the evolutionary stuff.
This was done for performance. See the weasel.i interface file for what's accessible.

'''
import random
import time
import weasel as w  # this is the C module accessible via SWIG

random.seed()

# Adjust the following to see how they affect number of generations
# needed to reach fitness goal and time to convergence.

# Set simulation parameters:
wv = w.cvar
wv.Genome_len = 30;
wv.Num_genes = 30;
wv.Generation_size = 100;
wv.Mutation_probability = 0.02;
wv.Fitness_err = .01;
wv.Target_fitness = wv.Genome_len;

# Control the simulation:
wv.Max_generations = 10000;

fmt_str = "LEN: %4d, NUM: %4d, SIZE: %4d, MUTATION: %7.3f\n"
fmt_str_vars = (wv.Genome_len, wv.Num_genes,
    wv.Generation_size, wv.Mutation_probability)
print fmt_str % fmt_str_vars

# To keep track of performance.

num_trials = 1
total_time = 0.
total_generations = 0
for i in range(num_trials):

    start_time = time.time()
    generations = w.Evolve()
    dt = time.time() - start_time

    total_generations += generations
    total_time += dt

    print "TRIAL: %3d" % (i+1), "GEN: %4d" % generations, "TIME: %7.2f sec" % dt

fmt_str = "   AVERAGE GEN: %4d TIME: %7.2f sec/trial"
print fmt_str % (total_generations / num_trials, total_time / num_trials)

I had to change it's name to weasel-driver.py because SWIG generates its own weasel.py file. The result is is a Python program that is only slightly slower than if written in pure C.

Weasel Program - Go Version

One more implementation of the Weasel Program. (See previous two posts.) Then I'll get into the simulation issues.

I really like the ideas behind Go. Based on the still largely theoretical arguments for it being created (not much data yet) I hope it becomes popular for the type of programming problems it's designed for. It's very C-like mostly, which I think is good. But unfortunately, this little simulation doesn't allow it to show off any of the advantages it has over C.

I don't have very much experience in Go and so am taking every opportunity I can to see what it can do.

My bug rate per line of code has always been pretty consistent regardless of language. Go seems no different. Performance wise, for this particular problem, Go split the difference in execution time between the program written in Python and the program written in C.

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating 
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.

Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.

    Stop when child genome = target genome

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

*/
package main

import (
    "fmt"
    "math/rand"
    "time"
)

// Adjust the following to see how they affect number of generations and
// time to convergence. (Convenient to define here, visible and easy to
// change.
var targetGenome = []byte("METHINKS IT IS LIKE A WEASEL!")
var genes = []byte(" ABCDEFGHIJKLMNOPQRSTUVWXYZ!")
var mutationProbability float32 = 0.05
var numChildrenPerGeneration int = 100

// Useful to define globally
var genomeLen, perfectFitnessScore, numGenes int
var parentGenome, childGenome, survivingChildGenome []byte

func init() {
    rand.Seed(time.Now().UTC().UnixNano())

    genomeLen = len(targetGenome)
    perfectFitnessScore = len(targetGenome)
    numGenes = len(genes)

    // First generation parent is just a random set of genes.
    parentGenome = make([]byte, genomeLen)
    for i := range parentGenome {
        parentGenome[i] = genes[rand.Intn(numGenes)]
    }

    childGenome = make([]byte, genomeLen)
    survivingChildGenome = make([]byte, genomeLen)
}

func calcFitness(genome []byte) (fitness int) {
    for i, targetGene := range targetGenome {
        if genome[i] == targetGene {
            fitness++
        }
    }
    return
}

func createChild() {
    for i, parentGene := range parentGenome {
        if rand.Float32() < mutationProbability {
            childGenome[i] = genes[rand.Intn(numGenes)]
        } else {
            childGenome[i] = parentGene
        }
    }
}

func main() {

    var i, generation, fitness, maxFitness int

    startTime := time.Now()
    // Let user know how we started.
    maxFitness = calcFitness(parentGenome)
    fmt.Printf("%3d %3d %s\n", generation, maxFitness, string(parentGenome))

    // Evolve.  Loop each generation.
    for maxFitness < perfectFitnessScore {
        generation++
        maxFitness = 0
        for i = 0; i < numChildrenPerGeneration; i++ {
            createChild()
            fitness = calcFitness(childGenome)
            if fitness >= maxFitness {
                maxFitness = fitness
                copy(survivingChildGenome, childGenome)
            }
        }
        copy(parentGenome, survivingChildGenome)
    }
    fmt.Printf("%3d %3d %s\n", generation, maxFitness, string(parentGenome))
    elapsedTime := time.Since(startTime)
    fmt.Printf("Elapsed time: %v\n", elapsedTime)
}

Weasel Program - C Version

I've just started playing around with my Python version of the Weasel Program (see my last post). I can already see that if I want to automate scenarios (for a range of parameters: target genome length, mutation probability, and generation size) that I will need to address performance. So I think I'll just bite the bullet now.

Here is a version I just wrote in C. I have a vague notion I can refactor to call C from Python if my investigation of the Weasel gets that far. And I am under no illusions that even C is up to the task of handling simulated genomes anywhere near the size found in real life, but the principle that faster is better still holds.

Just in passing, I've been using C over 30 years and during that time I've also used a lot of other languages. But if I had to choose to program in one computer language exclusively -- C would be my choice. For example, first and foremost, I think of C++ as a better C. I'm not using C++ here because the OO design style, verbosity, and memory overhead are not needed for such a small program. But I digress.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>

/* Weasel Program -- evolutionay simulation demonstration

A simple computer simulation demonstrating 
that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.

Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.

    Stop when child genome = target genome

    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

*/

/*
Note, used following command lines for speed:
    gcc -O3 weasel.c -o weasel
    ./weasel
or debugging:
    gcc -Wall -g weasel.c -o weasel.dbg
    gdb weasel.dbg
*/

const char TARGET_GENOME[] = "METHINKS IT IS LIKE A WEASEL!";
const char GENES[] = " ABCDEFGHIJKLMNOPQRSTUVWXYZ!";
const float MUTATION_PROBABILITY = 0.05;
const int NUM_CHILDREN_PER_GENERATION = 100;

/*
 * Also convenient to define globally.
 */

// minus null terminator
const size_t GENOME_LEN = sizeof(TARGET_GENOME) - 1;
const int PERFECT_FITNESS_SCORE = sizeof(TARGET_GENOME) - 1;
const size_t NUM_GENES = sizeof(GENES) - 1;

/* Mechanism of evolution. */

int calc_fitness(const char * child_genome) {
    // How many positions match the DNA of the perfect child?
    int i, sum = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        sum += (child_genome[i] == TARGET_GENOME[i]);
    }
    return sum;
}

/* 
Usually, a child is different from its parent. 
That is, allow chance of mutation.
*/
void create_child(char * parent_genome, char * child_genome) {
    int i;
    char parent_gene;
    float random_num;
    char random_gene;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_gene = parent_genome[i];
        random_num = (float)rand() / (float)RAND_MAX;  // 0.0 to 1.0
        random_gene = GENES[rand() % NUM_GENES];
        child_genome[i] = 
            random_num <= MUTATION_PROBABILITY ? 
                random_gene 
            : 
                parent_gene;
    }
    child_genome[i] = 0;  // if print as string
}

/*
Starting with a parent, 
generate children that have chance of mutated genes.
Most fit (only) child survives to become next generation parent.
(A parent never survives from one generation to the next.)
*/
int main() {

    struct timeval start_time, end_time;
    gettimeofday(&start_time, NULL);

    char parent_genome[GENOME_LEN+1];   // string terminators
    char child_genome[GENOME_LEN+1];
    char surviving_child_genome[GENOME_LEN+1];

    int i, child_fitness, max_fitness = 0;

    // Not nowadays considered a great random number generator, 
    // but good enough for our purposes. */
    srand(time(NULL));

    // Initially, just a random string genome of character genes.
    // Fitness will no doubt be quite low.
    int generation = 0;
    for (i = 0; i < GENOME_LEN; i++) {
        parent_genome[i] = GENES[rand() % NUM_GENES];
    }
    parent_genome[i] = 0;  // null terminator
    max_fitness = calc_fitness(parent_genome);
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    // Iterate until a child reaches perfection.
    while (max_fitness < PERFECT_FITNESS_SCORE) {
        generation++;
        max_fitness = 0;  // yes, generation can go downhill!
        for (i = 0; i < NUM_CHILDREN_PER_GENERATION; i++) {
            create_child(parent_genome, child_genome);
            child_fitness = calc_fitness(child_genome);
            if (child_fitness >= max_fitness) {
                max_fitness = child_fitness;
                strcpy(surviving_child_genome, child_genome);
            }
        }
        strcpy(parent_genome, surviving_child_genome);
    }
    printf("%3d %3d %s\n", generation, max_fitness, parent_genome);

    gettimeofday(&end_time, NULL);
    long microseconds = 
        (end_time.tv_sec * 1000000 + end_time.tv_usec) - 
        (start_time.tv_sec * 1000000 + start_time.tv_usec);
    printf("Elapsed time: %7.2fms\n", microseconds / 1000.);

    return 0;
}

The C version is about twice the number of lines of code as the Python version. IMHO, the C code is more complicated (even though I managed to avoid dynamic memory allocation). I did not keep count, but the defect rate per line of code (all minor defects for both programs for such a simple simulation) seemed about the same for both programs.

The performance reward? Execution is about seven times faster. Of course, increase the size of the simulation parameters and this number would grow.

Weasel Program -- Python Version

I recently stumbled across a type of thought experiment that is both very simple to simulate on a computer and yet very interesting to study and think about. That's too good to pass up!

It's called The Weasel Program. Its "aim is to demonstrate that the process that drives evolutionary systems -- random variation combined with non-random cumulative selection -- is different from pure chance." It's very a very basic example of a genetic algorithm.

(I've had a lot of fun in the past with another class of computer optimization techniques based on simulated annealing, but I have to save that story for another day.)

It also turns out that Dawkin's Weasel has created a bit of a controversy between certain teleological argument believers, particularly Intelligent Design types, and certain outspoken scientists. See The War of the Weasels. I see the Weasel controversy as a very simplified (and so perhaps instructive) version of the difficulties encountered in the Global Climate Model verification and validation debate.

But before getting into that and possibly going off into the deep end of the pool, I know I need to start off carefully and come up to speed by actually implementing a version (or two or three) of The Weasel Program. And then playing around until I get a feel for the numbers.

Of course, there are many examples of Weasel programs in various computer languages on the web. I could just download something. But to really understand code, I have to write code.

I've been coding a lot lately in Java and PHP. Practical languages -- but personally not very exciting or pleasing to write in. So as a chance to keep fresh in other languages I know and like to code in a lot more, I decided to implement a Weasel in Python. Here it is:

'''Weasel Program -- evolutionary simulation demonstration

A simple computer simulation demonstrating that the process that drives natural selection
(based on random variation combined with non-random selection)
is qualitatively different from results based on pure random selection.

See Wikipedia for more info on Weasel Program.

Given:
    Genome modelled as string representing list of genes modelled as chars.
    Target genome is a fixed string of chars.
    Fixed mutation probability.
    Fixed number of children per generation.
    
Basic algorithm:
    Generate children from single parent keeping most fit as new parent
    for next generation.
        
    Stop when child genome = target genome
    
    Note: to save space, child is discarded immediately if not most fit
    in generation so far.

'''
import random
import time

# Also implementing this in other languages, so keep track of performance.
start_time = time.time()

# Adjust the following to see how they affect 
# number of generations and time to convergence:

TARGET_GENOME = list("METHINKS IT IS LIKE A WEASEL!")
GENES = " ABCDEFGHIJKLMNOPQRSTUVWXYZ!"  # each char is a gene
MUTATION_PROBABILITY = 0.05
NUM_CHILDREN_PER_GENERATION = 100

# Convenient to define here:

GENOME_LENGTH = len(TARGET_GENOME)  # how many genes can match?
PERFECT_FITNESS_SCORE = GENOME_LENGTH  # all genes match target

# The mechanism of evolutionary selection.

def calc_fitness(child):
    '''How many positions match the target genome?'''
    fitness = 0
    for i in range(GENOME_LENGTH):
        if child[i] == TARGET_GENOME[i]:
            fitness += 1
    return fitness

def create_child(parent):
    '''Allow chance of mutation.'''
    return [(random.choice(GENES) if random.random() < MUTATION_PROBABILITY 
               else gene) for gene in parent]

# Initialization (the first generation is just a random string of genes)
generation = 0
surviving_child = [random.choice(GENES) for _ in range(GENOME_LENGTH)]
max_fitness = calc_fitness(surviving_child)
print "%3d" % generation, "%3d" % max_fitness, "".join(surviving_child)

# Then evolution proceeds as series of successive generations,
# using random mutations and survival of the fittest.
while max_fitness < PERFECT_FITNESS_SCORE:
    parent = surviving_child[:]
    generation += 1
    max_fitness = 0
    for _ in range(NUM_CHILDREN_PER_GENERATION):
        child = create_child(parent)
        child_fitness = calc_fitness(child)
        if child_fitness > max_fitness:
            max_fitness = child_fitness
            surviving_child = child

print "%3d" % generation, "%3d" % max_fitness, "".join(surviving_child)

print "Elapsed time: %7.2fms" % ((time.time() - start_time) * 1000.)
 
The program runs in about 100ms +/-50% on my desktop machine depending on how lucky the mutations are for that particular run.

One of the things I like about Python is how easy it is to read. Even for someone who is not that familiar with programming in general or Python in particular. But there are a couple of places in the above program where the syntax might seem confusing. One is where generators are used, for example:

surviving_child = [random.choice(GENES) for _ in range(GENOME_LENGTH)]

surviving_child is a list of random genes of length GENOME_LENGTH. The underscore (_) is just a common way of indicating that the for loop's index is not being used. And the generator's syntax can seem strange since it seems to put the body of the loop before the for keyword.

Another place where Python may be confusing is its form of what C-like languages call the ternary operator.

random.choice(GENES) if random.random() < MUTATION_PROBABILITY else gene

In a C-like language, this expression would be:

random.random() < MUTATION_PROBABILITY ? random.choice(GENES) : gene

Everyone Should Learn About Software

Can we trust climate models? This was the title of a recent blog post by Tamsin Edwards. She says:
Ever since I started scientific research, I’ve been fascinated by uncertainty. By the limits of our knowledge. It’s not something we’re exposed to much at school, where Science is often equated with Facts and not, as it should be, with Doubt and Questioning. My job as a climate scientist is to judge these limits, our confidence, in predictions of climate change and its impacts, and to communicate them clearly.
I liked this article. However, I do not think the key public issue about the climate models is a lack of appreciation of the uncertainty inherent in science. Rather, it is ignorance about the difficulties and limitations inherent in computer modelling of natural phenomena. Especially utterly complex systems like the climate. BTW, I think this ignorance about computer coding extends to some scientists as well.

This is an aspect of a more general problem: the need for everyone in modern society to understand what software is all about. And the fact that schools are not teaching this.

For example, there was an article in the Wall Street Journal over the weekend critical of the US education system titled Sorry, College Grads, I probably Won't Hire You. The author, Kirk McDonald, complains that: "States should provide additional resources to train and employ teachers of science, technology, engineering and math, as well as increase access to the latest hardware and software for elementary and high-school students." His advice is that "If you want to survive in this economy, you'd be well-advised to learn how to speak computer code".

Many programmers don't think this is an issue. Typical Programmer says:
This is another article beating the "everyone must learn to code" drum. Employers can’t find enough people with programming skills, schools aren’t turning out enough engineers, jobs at these cool companies are left unfilled because American students are too lazy or short-sighted to spend a summer learning "basic computer language." If only it was that simple. I have some complaints about this "everyone must code" movement, and Mr. McDonald’s article gives me a starting point because he touched on so many of them.
I recommend reading the entire post. But, in the end, I am not convinced with Typical Programmer Greg Jorgensen's points. Simply put, the reason to learn a computer language is not to be able to program, but to have a foundation from which to understand the nature of software. As an engineer, I have taken a lot of courses in science and math as well as engineering. But I am not a scientist or mathematician. I am an engineer. However, it is necessary for me to understand science and math in order to be a competent engineer. By analogy, many people now think it takes an understanding of software to be a productive member of modern society. And I think that too. And the best way to learn is by doing.

Older and Wiser... Up to a Point



Having had a lot of fun, and having gained a lot of software knowledge and wisdom over the years, I do not look forward to the day when, like an old pro athlete, I have to retire from being a programmer because I physically can't do it anymore.

So, I was pleased to stumble across an IEEE Spectrum Tech Talk article that suggested I don't have much to worry about along that line for quite a while. Here is an excerpt:
Prejudice against older programmers is wrong, but new research suggests it's also inaccurate. A dandy natural experiment to test the technical chops of the old against the young has been conducted—or discovered—by two computer scientists at North Carolina State University, in Raleigh. Professor Emerson Murphy-Hill and Ph.D. student Patrick Morrison went to Stack Overflow, a Web site where programmers answer questions and get rated by the audience. It turned out that ratings rose with contributors' age, at least into the 40s (beyond that the data were sparse). The range of topics handled also rose with range (though, strangely, after dipping in the period from 15 to 30). Finally, the old were at least as well versed as the young in the newer technologies.

Of course, such a natural experiment can't account for all possible confounding factors.
This is in line with what I already knew about chess. I played a lot of chess when I was a kid. I was told then that a person's chess rating didn't drop all that much during the player's lifetime.

The article also addresses chess, but a bit ambiguously. Here is a figure the article presents:

It was hard for me to interpret the vertical axis. I am not sure if SD Units represent standard deviation or not. I remember that for I.Q., one standard deviation is about 15 points, but I don't know if there is an appropriate analogy that can be drawn here.

So I looked up Aging in Sports and Chess which presents some experimental results documented in a paper by Ray C. Fair done in 2007. It says that if you are 100% of your maximum ratings by age 35, then you are 99% at age 40, 97% at age 50, 95% at age 60, 93% at age 70, 90% at age 80, and 80% at age 90. So it doesn't look like more than a slow steady decline until people get into their 80s.

As with the Stack Overflow data, it is hard to judge the relationship of data about chess skill to programming skill. But all this data suggests I don't have too much to worry about for a long time.

What is Wrong with Risk Matrices? -- Part 2

This is a continuation of my last post.

Risk is probability times consequence of a given scenario. The Wikipedia gives an example of a risk matrix as:


Risk has been granularized to reflect uncertainty, lack of data, and subjectivity. The vertical axis is probability. The horizontal axis is consequence. As can be seen from the figures in my previous post, this matrix is logically inconsistent with respect to iso-risk curves (lines of constant risk) whether or not the axis scales are linear or log-log. That is, for example, it is mathematically impossible to have any value other than "Low" in the first row or first column if the scales are linear. Or different values for any diagonal if the axes are log-log.

However, IMHO, the above example does not illustrate what is actually wrong with risk matrices. It is that these table values are being interpreted as risk values in the first place. Instead, risk matrix table values should be interpreted as risk categories. They measure our level of concern about a particular category of risk, not its actual value. The axes measure the "worst case scenario" values of probability and consequence for a given risk.

A risk value for any expected scenario at, say, a nuclear power plant could never be anything other than "Low". The very idea of an acceptable "Extreme" risk of "Certain" probability and "Catastrophic" consequences is ridiculous. No stakeholder would sign-off on that. To be of practical utility, the table values cannot be interpreted as risk values.

A risk category is defined as the management process used to manage risks that are at a certain level of concern. There should be stakeholder consensus on risk categories. Risk categories can be ordered by the amount of process grading and tailoring they allow for ordered levels of concern.

For example, for the most "Extreme" risk category, it may be required that each and every risk, regardless of its probability or consequence in a worst case scenario, be formally documented, analyzed to an extreme level of detail and rigor, prioritized, mitigated to an extremely low level, tracked throughout its lifetime, and signed-off on periodically by all stakeholders. Risk categories of lower concern (High, Moderate, and Low) will allow grading and tailoring of these Extreme requirements to lower levels.

What's Wrong with Risk Matrices?

Risk is equal to probability times consequence. The Wikipedia entry for risk matrix has a reference to an article by Tony Cox titled What's Wrong with Risk Matrices? Unfortunately, the article is behind a paywall. So as far as I am concerned, it doesn't exist. But I was able to google a blog posting about the article.

Interesting was how Cox constructed a risk matrix and then drew lines of constant risk on it. Something like this:
The lines are hyperbolas with the axes as asymptotes.

According to the blog posting referenced above: "Cox shows that the qualitative risk ranking provided by a risk matrix will agree with the quantitative risk ranking only if the matrix is constructed according to certain general principles."

Before examining these general principles, and let me be clear that I fundamentally disagree with these principles, first things first.

I have always viewed risk matrices as having log-log scales. (And I'm not the only one looking at risk matrices this way.) Something like this:

Notice the constant values of risk are all straight lines with a slope of minus one. And note the risk contours in this example are separated by an order of magnitude, not just a doubling of risk value as in the previous figure. This means it is easier to represent a wider range of probabilities and consequence scenarios (measurable in dollar amounts) using a log-log scale.

But the most important reason why I think it is better to use a log-log scale is because risk categorization is subjective. And I believe that where possible subjective judgments, like risk category, should be measured in decibels. As I have written about in a previous post: The decibel is a log scale that simplifies overall power gain/loss calculations and is convenient for ratios of numbers that differ by orders of magnitude. Additionally, log scales have been useful in describing the potential of certain physical phenomenon (e.g., earthquakes) and human perceptions (e.g., hearing). Thus, log scales can be useful when performing risk assessments and other related quality assurance activities.

Next time, a rather loud (pun intended) criticism of Cox's general risk matrix principles. :-)

Failures During Runtime

Researchers recently published online a PDF entitled: A Characteristic Study on Failures of Production Distributed Data-Parallel Programs. The data they used was provided by Microsoft. The programs were all MapReduce-like in structure, composed of "declarative SQL-like queries and imperative C# user-defined functions."

Interesting to me was that the authors collected some actual statistics about the failures encountered during runtime. I love real-world numbers. However, I found their results a bit hard to follow. Here is what I got out of the paper.

They ignored operating system and hardware failures. Of the run-time errors considered: 15% were "logic" errors, 23% were "table" errors and 62% were "row" errors.

They gave examples of logic errors such as: cannot find DLLs or scripts for execution, accessing array elements with an out-of-range index, accessing dictionary items with a non-existent key, or other data-unrelated failures.

Table errors such as: accessing a column with an incorrect name or index, a mismatch between data schema and table schema, or other table-level failures.

And row errors such as: corrupt rows with exceptional data, using illegal arguments, dereferencing null column values, exceptions in user-defined functions, allowing out-of-memory errors, or other row-level failures.

Also interesting was that, according to the authors, 8% of runtime errors could not have been caught by programmers during testing using their current debugging toolset.

It seems possible to create check-in and checkout documentation processes for a development organization's SCM system that could automatically generate statistics similar to the above. I think this would have a positive effect on software quality. For example, the researchers suggest that many failures have a common fix pattern that could be automated. Whether the cost would be worth the effort--I don't know. But it does seem obvious that SCM should be a prime source of quality-related data.

    Programmers Must Consider Risk

    There is a thoughtful programmer-oriented blog called The Codist written by Andrew Wulf. In a recent posting he starts off:
    I need to step outside my usual persona of writing about programming to comment on the happenings of the past few days. 
    In Boston two brothers decided to blow up the Marathon, and an hour from my house half the city of West, Texas was blown to pieces in a massive explosion.
    However, he goes on to discuss these events in a way that I don't think is actually outside the realm of programming. Why? He talks about risk.

    As I have written about before, there is always the risk that software may not perform correctly. The general risks, both benefits and consequences, will be different for different stakeholders. (I wrote that software should be independently verified and validated.) The software development effort must understand, communicate and help deal with the risks associated with the software for all its stakeholders.

    In this post I simply note that software is an integral part of modern society. Thus, risk is an integral part of software engineering. In fact, risk is a fundamental concept to engineering in general.

    Software may have played an important part in both of the events Andrew mentions. For the Boston Marathon, I am pretty sure that data mining software such as face recognition algorithms were used in identifying the suspects. No details about the Texas event are publicly available yet, but with SCADA systems being common in plants nowadays, I can easily imagine software being important there too.