IMHO, the predictive skill of the climate models have not been formally and empirically demonstrated (as in IV&V).
This is the same position I had back in March, when I posted a note on the current state of the climate model software.
Jon Pipitone has performed a study of the quality of software in climate modeling. I mention Pipitone's work because it was brought to my attention that Steve Easterbrook links to it in a statement he made in a blog post yesterday:
Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs...Does not a statement like this AMHO (affect my humble opinion)? Unless I take it out of context -- no. What is the context?
In a blog post describing Jon Pipitone's work, Easterbrook writes:
I think there are two key results of Jon’s work:
1. The initial results on defect density bear up. Although not quite as startlingly low as my back of the envelope calculation, Jon’s assessment of three major GCMs indicate they all fall in the range commonly regarded as good quality software by industry standards.
2. There are a whole bunch of reasons why result #1 may well be meaningless, because the metrics for measuring software quality don’t really apply well to large scale scientific simulation models. [Emphasis added.]
And in the Conclusion of his thesis Pipitone writes:
The results of our defect density analysis of three leading climate models shows that they each have a very low defect density, across several releases. A low defect density suggests that the models are of high software quality, but we have only looked at one of many possible quality metrics. As we have discussed, knowing which metrics are relevant to climate modelling software quality, and understanding precisely how they correspond the climate modellers notions of software quality (as well as our own) is the next challenge to take on in order to achieve a more thorough assessment of climate model software quality. [Emphasis added.]
We found a variety of code faults from our static analysis. The majority of faults common to each of the models are due to unused code or implicit type manipulation. From what we know of the construction of the models, there is good reason to believe that many of these faults are the result of acknowledged design choices -- most significantly are those that allow for the flexible configuration of the models. Of course, without additional study, it is not unknown whether the faults we have uncovered point to weaknesses in the code that result in operational failures, or generally, what the impact is of these faults on model development and use. [Emphaisi added.]
And in describing possible threats to the validity of his thesis, Jon writes:
We do not yet understand enough about the different types of climate modelling organisations to hope to make any principled sampling of climate models that would have any power to generalize. [Emphaisis added.] Nevertheless, since we used convenience and snowball sampling to find modelling centres to participate in our study we are particularly open to several biases . For example:
* Modelling centres willing to participant in a study on software quality may be more concerned with software quality themselves;
* Modelling centres which openly publish their climate model code and project artifacts may be also be more concerned with software quality;
In addition, our selection of comparator projects was equally undisciplined. We simply choose projects that were open-source, and that were large enough and well-known enough to provide an intuitive, but by no means rigorous, check against the analysis of the climate models. We have also chosen to include a model component, from centre C1, amongst the GCMs from the other centres we analysed. Even though this particular model component is developed as an independent project it is not clear to what extent it is comparable to a full GCM.
Our choice to use defect density and static analysis as quality indicators was made largely because we had existing publications to compare our results with, not because we felt these techniques are necessarily good indicators. Furthermore, whilst gauging software quality is known to be tricky and subjective, most sources suggest that it can only accurately be done by considering a wide range of quality indicators [21, 3, 1, 17]. Thus, at best, our study can only hope to present a very limited view of software quality. [Emphasis added.]
Thus, "there are a bunch of reasons" why Easterbrook's statement "may well be meaningless".