A Note on the Quality of the Climate Model Software

In response to a comment to my last post, I wrote that:
IMHO, the predictive skill of the climate models have not been formally and empirically demonstrated (as in IV&V).

This is the same position I had back in March, when I posted a note on the current state of the climate model software.

Jon Pipitone has performed a study of the quality of software in climate modeling. I mention Pipitone's work because it was brought to my attention that Steve Easterbrook links to it in a statement he made in a blog post yesterday:
Our research shows that earth system models, the workhorses of climate science, appear to have very few bugs...
Does not a statement like this AMHO (affect my humble opinion)? Unless I take it out of context -- no. What is the context?

In a blog post describing Jon Pipitone's work, Easterbrook writes:
I think there are two key results of Jon’s work:

1. The initial results on defect density bear up. Although not quite as startlingly low as my back of the envelope calculation, Jon’s assessment of three major GCMs indicate they all fall in the range commonly regarded as good quality software by industry standards.

2. There are a whole bunch of reasons why result #1 may well be meaningless, because the metrics for measuring software quality don’t really apply well to large scale scientific simulation models. [Emphasis added.]

And in the Conclusion of his thesis Pipitone writes:
The results of our defect density analysis of three leading climate models shows that they each have a very low defect density, across several releases. A low defect density suggests that the models are of high software quality, but we have only looked at one of many possible quality metrics. As we have discussed, knowing which metrics are relevant to climate modelling software quality, and understanding precisely how they correspond the climate modellers notions of software quality (as well as our own) is the next challenge to take on in order to achieve a more thorough assessment of climate model software quality. [Emphasis added.]

We found a variety of code faults from our static analysis. The majority of faults common to each of the models are due to unused code or implicit type manipulation. From what we know of the construction of the models, there is good reason to believe that many of these faults are the result of acknowledged design choices -- most significantly are those that allow for the flexible configuration of the models. Of course, without additional study, it is not unknown whether the faults we have uncovered point to weaknesses in the code that result in operational failures, or generally, what the impact is of these faults on model development and use. [Emphaisi added.]

And in describing possible threats to the validity of his thesis, Jon writes:
We do not yet understand enough about the different types of climate modelling organisations to hope to make any principled sampling of climate models that would have any power to generalize. [Emphaisis added.] Nevertheless, since we used convenience and snowball sampling to find modelling centres to participate in our study we are particularly open to several biases [10]. For example:

* Modelling centres willing to participant in a study on software quality may be more concerned with software quality themselves;

* Modelling centres which openly publish their climate model code and project artifacts may be also be more concerned with software quality;

In addition, our selection of comparator projects was equally undisciplined. We simply choose projects that were open-source, and that were large enough and well-known enough to provide an intuitive, but by no means rigorous, check against the analysis of the climate models. We have also chosen to include a model component, from centre C1, amongst the GCMs from the other centres we analysed. Even though this particular model component is developed as an independent project it is not clear to what extent it is comparable to a full GCM.

Our choice to use defect density and static analysis as quality indicators was made largely because we had existing publications to compare our results with, not because we felt these techniques are necessarily good indicators. Furthermore, whilst gauging software quality is known to be tricky and subjective, most sources suggest that it can only accurately be done by considering a wide range of quality indicators [21, 3, 1, 17]. Thus, at best, our study can only hope to present a very limited view of software quality. [Emphasis added.]

Thus, "there are a bunch of reasons" why Easterbrook's statement "may well be meaningless".

10 comments:

  1. George - you're playing word games. "appear to have very few bugs" is exactly the right description of Jon's study. You try doing better than that in six words. There are always caveats and limitations on every scientific study, which is why scientists value deep knowledge of the literature.

    So, you're pointing out the evidence I supplied was tentative. You haven't pointed out any evidence that contradicts it. Which means you're playing the common denialist game of saying equating tentative evidence with knowing nothing.

    And did you really understand the conclusions of Jon's work? He's saying that conventional measures of software quality might not apply to this type of software. Yeah, you heard that: the kind of conventional measures that IV&V typically applies. The implication, if you follow it through, is that calling for conventional IV&V on scientific simulation models is likely to be meaningless.

    ReplyDelete
  2. If I may ask George and Steve,

    What are the specific characteristics of 'earth system models' that out of hand suggest that 'conventional IV&V' is meaningless. How do these compare / contrast with software for which there are 1000s of proven successful applications of these conventional processes and procedures?

    Especially, what are the specific characteristics that prevent successful application of proven Verification processes and procedures? How do these compare / contrast with those of software for which these have been successfully applied?

    Thanks

    ReplyDelete
  3. You will just have to put up with the "word games," Steve. Words are all we have to work with. Just like it was pointed out to me that I have to put up with the interjection of the word "denialist" in this discussion about software IV&V. Do what I intend to do -- ignore it.

    Science is by definition experimental. I would prefer not to base important, perhaps even future-of-all-mankind critical, policy decisions on experimental software. I want the science to be settled. How I like to phrase it: we need to turn the scientific climate model codes into engineering climate model codes, like those described in the following paragraph. (I don't want to take a trip in an experimental aircraft. A production Boeing jet is just fine.)

    The National Nuclear Security Administration (NNSA) has an Advanced Simulation and Computing (ASC) program for stewardship of our nuclear stockpile. Since it is no longer legally possible to conduct experiments to confirm the safety, performance, and reliability of our nuclear weapons stockpile, the ASC must rely on software codes to ensure confidence in our nuclear deterrent. IMHO, the ASC V&V process is in no way meaningless. Why can't the climate modelers adopt a strategy similar to that of the ASC?

    ReplyDelete
  4. George, I was chief scientist at NASA's IV&V facility in the '90s. I know IV&V inside out. It sometimes adds value and sometimes doesn't, depending on who's doing the IV&V and what relationship they form with the development contractor. It's very useful for when the software development is contracted out, because it gives the customer a second expert opinion in place of the developer's reassurances that everything is just fine. However, IV&V fails completely if the IV&V team don't have at least as much domain expertise as the development team. Preferably a lot more.

    Climate science doesn't need it because they have a much better approach. They have 20-30 other labs around the world building software to do the same thing, and a regular process of model intercomparison projects that test the models on a whole series of benchmarks. It's like having 20-30 IV&V contractors all at once. And of course, these are the best domain experts in the world. If there are problems in the models, the folks from the other labs find them real fast.

    It really bothers me when people come to this field with their favourite hammers in their hands, wanting to hit everything whether it's a nail or not. If standard IV&V practices were relevant to this field I would be calling for them left right and center. After all, I was part of the team that pushed NASA into expanding IV&V onto all of it's spacecraft development projects, rather than just the shuttle program.

    ReplyDelete
  5. It really, really bothers me when people come to this field and all they have to offer is an Appeal to Authority. An appeal that is nothing more than the blackest of black boxes.

    I'm awaiting an answer to my questions above. The answer that I'm seeing now is, Because we don't want to.

    In which of the hundreds of official specifications of IV&V processes and procedures can we find a description of that proposed here?

    Where can we find the reports that summarize the results of the benchmarking exercises that have been conducted by any of the 20-30 labs on any of the earth systems models?

    The dangers of the approach proposed here have been identified and known for decades. Here's an example:

    http://ccs.chem.ucl.ac.uk/~shantenu/vol58no1p35_41.pdf

    written over six years ago.

    ReplyDelete
  6. Because of the inherently and irreducibly subjective nature of IV&V (no way to avoid those Bayesian priors) a subjective Appeal to Authority argument is not inappropriate. I can respect that.

    That being said, subjectivity cuts both ways. I can clearly understand edhaimichat's point. Putting it my own way, IV&V is a domain of expertise unto itself and growing. And when the climate community decides NOT to follow the consensus IV&V practices of these other IV&V experts, it cannot justify doing so simply based on an appeal to its own authority. There exists greater IV&V authority.

    For example, virtually all SQA processes allow for tailoring and grading of requirements traceability based on the nature and use of the software. Climate IV&V MUST produce this documentation.

    ReplyDelete
  7. edhaimichat isn't keeping up with the literature. The Post and Votta paper he links to is a great paper (its a couple of years since I last read it, so it's nice to get a chance to re-read it). But it's old, and mostly irrelevant to the current discussion. The six case studies it draws its conclusions from were all done in the late 90's. And none of them are in climate science. Post and Votta went on to do several further case studies, one of which was a numerical weather forecasting code (not quite a climate model, but as it was as close as they got). The paper is here:
    http://www.computer.org/portal/web/csdl/doi/10.1109/MS.2008.86
    In the study, they report that the V&V practices are reasonably good, and appropriate to the context. But one of the key observations they make is that for this type of code, agile practices are much more suitable. Agile practices don't include writing requirements specifications, nor do they involve requirements traceability. The whole point is that the requirements cannot be known up front, so trying to specify them is a waste of time. You build a numerical model incrementally, incorporating new approaches and new bits of physics whenever they have proven to be mature enough (through repeated experimentation). You test the hell out of every minor change to the software, running the code through an automated test suite every night. You test it on a set of standard benchmarks. You do the convergence tests for the dynamics. You test for conservation. You measure the rmse against observations. You investigate anomalies, and run process studies to explore weaknesses in the model.

    But you don't bother with requirements traceability because it makes no sense in this domain.

    So it's no good sitting there with your fingers in your ears saying "they gotta do this, they gotta do that". Go see and see how sophisticated their testing and validation processes are. Learn a bit about what it actually takes to write and test code in this domain. And put that hammer down!

    ReplyDelete
  8. You did not seem to notice that I was talking about the metarequirements. The SQA processes. The requirements for the requirements. The requirements that form the basis we use to decide between, say, agile practices and, say, waterfall practices. The requirements applicable for, say, systematic grid convergence studies. Some are more physics oriented, some more algorithmic/methodological.

    BTW, the nearest thing to a hammer I know of is the 2009 edition of Roache's book: Fundamentals of Verification and Validation [in Computational Science and Engineering]. But it is more physics oriented, and more heat and CFD at that. So I agree with you. No hammer.

    But, finger in my ear? If you really knew me you would have said nose :-)

    BTW, Sorry for the delay in posting your comment. Google recently introduced a new automatic spam detector for Blogger and your comment was in the spam bin till I noticed it. Can't imagine why. No trigger words that I could see. Unless "edhaimichat" or "votta" is something offensive in some language.

    ReplyDelete
  9. Thanks for the comments Steve. My understanding is that you suggest we "Go see and see how sophisticated their testing and validation processes are." And stop insisting on using the "hammer" of conventional IV&V. Your suggestion is based on an argument from authority and an argument that spending time with climate scientists and building up domain expertise will make their approach to IV&V, let us say, less opaque.

    But my view is, and remains, different. Note Feynman's "almost" definition of science -- The sole test of knowledge is experiment. IMHO, the predictive skill of the climate models have not been formally and empirically demonstrated (as in IV&V).

    Roger Pielke Sr. in an interview linked to in yesterday's Die Klimazwiebel, states:

    Models are powerful tools with which to understand how the climate system works on multi-decadal time scale as long as there are observations to compare reality with the model simulations. However, when they are used for predictions of environmental and societal impacts decades from now in which there is no data to validate them, such as the IPCC predictions decades into the future, they present a level of forecast skill to policymakers that does not exist. These predictions are, in reality model sensitivity studies and as such this major limitation in their use as predictions needs to be emphasized. Unless accompanied by an adequate recognition of this large uncertainty they imply a confidence in the skill of the results that is not present.


    The climate models cannot be IV&V'ed because there is no data with which to validate them.

    This is a very much different view of the models than Easterbrook's of the climate model software being "scientific instruments".

    ReplyDelete
  10. Here's some slides from a NASA person titled Dimensions of Credibility in Models and Simulations. Slides 29 through 31 detail why software engineering requirements are insufficient to capture the needed credibility ensuring/demonstrating activities. Makes the point that sometimes models are developed in hardware, so that software engineering expertise is of little use in that case, but VV&UQ is still critical to proper decision support.

    Steve, at your "IV&V facility" were you doing software that "performs a task within a system" or "provides a representation of a system" (slide 30).

    ReplyDelete