Can Scientific/Engineering Code Be Validated? Part 2

This is a continuation of my previous post. There I note that I interpret software validation more broadly than Roache. That I believe it can be applied to embodied code as well as documented theory. Here I continue to present how Roache's interpretation of validation may differ somewhat from my own.

In Appendix B, Roache starts off with a commonly used definition for validation:
Validation: The process of determining the degree to which a model {and its associated data} is an accurate representation of the real world from the perspective of the intended uses of the model.
Roache sees three main issues with people properly interpreting this definition. One issue is with the phrase "intended uses." His recommendation is:
Intended use, at least in its specific sense, is not required for validation. The common validation definition could be salvaged by re-defining intended use to include very general intentions, but frankly this appears to be a hollow exercise. The fact is that a useful validation exercise does not necessarily require an intended use, specific or general.
This recommendation is developed using argument such as:
Clearly, much of the confusion is the result of trying to use the same word for different needs. Project oriented engineers are more concerned with specific applications, and naturally tend to rank acceptability within validation (which term is used more often than accreditation or certification). Research engineers and scientists tend to take a broader view, and often would prefer to use validation to encompass only the assessment of accuracy level, rather than to make decisions about whether that level is adequate for unspecified future uses. It is also significant to recognize that these project-specific requirements on accuracy are often ephemeral, so it is difficult to see a rationale for a priori rigid specifications of validation requirements [5,11] when the criteria so often can be re-negotiated if the initial evaluation fails narrowly.
The requirement for "intended use" sounds good at first, but it fails upon closer thought. Did D. C. Wilcox [13] need to have an "intended use" in mind when he evaluated the k-w RANS turbulence models for adverse pressure gradient flows? He may very well have had uses in mind, but does a modeler need to have the same use in mind two decades later? If not, must the validation comparison be repeated? Certainly not.
But who would want to repeat it?

Validation is subjective. (As Roache puts it -- ephemeral.) So it logically must be performed from some perspective. Who's perspective? The software's stakeholders. But unless usage is predefined, not all of the software's potential stakeholders have been identified. How can their potentially differing priors be ignored?

Roache evidently believe validation can be made objective. That acceptability, accreditation, and certification can be separated out from validation. That the degree to which a model is an accurate representation of the real world can be decided upon by some abstract, objective algorithm. No Bayesian priors required.

But I could not disagree more. So I highly recommend reading Roache for a viewpoint different than my own.

Can Scientific/Engineering Code Be Validated?

I am starting to read Patrick J. Roache's book, The Fundamentals of Verification and Validation. I thought I knew the fundamentals of IV&V for scientific and engineering software already, but reading Roache seems to have me feeling a bit ignorant.

Be that as it may, I think I must disagree with the limited scope of Roache's definition of validation.

In Appendix B of the book, Validation -- What Does it Mean? Roache writes:
Before examining the definition of validation, we need to make a small distinction on what it is we are claiming to validate, i.e., between code and model. A model is incorporated into a code, and the same model (e.g. some RANS model) can exist in many codes. Strictly speaking, it is the model that is to be validated, whereas the codes need to be verified. But for a model to be validated, it must be embodied in a code before it can be run. It is thus common to speak loosely of "validating a code" when one means "validating the model in the code".

I think this distinction is too limiting. The embodied code must be validated too.

IMHO, Roache is using the word verification in the sense of formal verification. That's fine, except scientific and engineering software can rarely be heroically tested. Formal proof of such software's correctness is a practical impossibility. Does Roache really think verification is impossible?

Suppose I found a DVD case on the sidewalk and inside was a DVD with a label that said: "Models the Earth's Climate." I put the DVD in my computer and, sure enough, on the DVD is what appears to be a complete and sophisticated complex climate model. How would I go about verifying such a software's outputs? What amount of testing would be sufficient? What verification processes would I choose to use?

On the other hand, suppose I obtain funding to develop, from scratch, a new and complete Earth climate modeling software. What methodologies would I choose to develop and test the software? Would I think it important to verify the processes completed at various stages of the software's development?

And here is the rub. Suppose that it turns out each software actually has the same physics model. Nevertheless, would I need to validate that the different processes I used to verify the software on the DVD and to verify the software built from scratch were appropriate for each? Yes! The verification processes for each software would be different. These differences must be validated as appropriate and effective.

So if I feel a bit ignorant under a limited definition of validation, I now feel even more so under an expanded definition.

The I in IV&V is Important

It was pointed out in the last post that software verification and validation (V&V) are not purely exercises in deductive logic. A comment to the post explicitly noted essential components are based on probabilistic reasoning. The basic point of the post was that the result of V&V is not a proof of certainty. 

Rather software V&V is a measure of the acceptability of the risk that the software may fail to perform properly and thus not provide the desired benefits, that the consequences of using the software may even be negative. (Risk is defined as probability times consequence.)

And so here I make a quick note that the point of the previous post is not the only common and important philosophical misunderstanding about V&V. There is often a failure to realize that software V&V must be independent verification and validation (IV&V).

There is a general consensus that the process by which software is developed can add to or subtract from the quality of the final software product. But the degree to which this occurs is a subjective judgment. Different software stakeholders will have different opinions.

Also, the potential consequences of using software are different for different stakeholders. Just as the climate effects different groups of people differently, an error in the global climate models could potentially be misused to effect different people to differing degrees.

The bottom line is that the estimated risk associated with any software can vary greatly (even in sign) depending on which stakeholders are being used as the reference. Thus, software V&V must not be restricted to an activity that is performed by a single software stakeholder. That would not be fair. Software V&V must be IV&V such that all stakeholders are considered fairly.

You would think this concept would be obvious for all risk analyses (software IV&V or whatever) and far from a potential problem. Unfortunately, this is not the case. For example, how worried should we be about driving a Toyota? According to popular NYT blogger Robert Wright:
My back-of-the-envelope calculations (explained in a footnote below) suggest that if you drive one of the Toyotas recalled for acceleration problems and don’t bother to comply with the recall, your chances of being involved in a fatal accident over the next two years because of the unfixed problem are a bit worse than one in a million — 2.8 in a million, to be more exact. Meanwhile, your chances of being killed in a car accident during the next two years just by virtue of being an American are one in 5,244.

So driving one of these suspect Toyotas raises your chances of dying in a car crash over the next two years from .01907 percent (that’s 19 one-thousandths of 1 percent, when rounded off) to .01935 percent (also 19 one-thousandths of one percent).
Wright does not think these numbers are of much concern. But IMHO, he fails to understand that one stakeholder in the issue (Toyota) should not decide the risk for another (the public). For he writes:
But lots of Americans seem to disagree with me. Why? I think one reason is that not all deaths are created equal. A fatal brake failure is scary, but not as scary as your car seizing control of itself and taking you on a harrowing death ride. It’s almost as if the car is a living, malicious being.
IMHO, it's not that all deaths are not created equal -- it's that not all risk analyses are.

This was also noted in Chance News #62, where we have the following questions being asked about Wright's discussion of these numbers:
  1. People seem to make a distinctions between risks that they place upon themselves (e.g., talking on a cell phone while driving) and risks that are imposed upon them by an outsider (e.g., accidents caused by faulty manufacturing). Is this fair?
  2. Contrast the absolute change in risk (.01935-.01907=.00028) with the relative change in risk (.01935/.01907=1.0147). Which way seems to better reflect the change in risk?
  3. Examine the assumptions that Robert Wright uses. Do these seem reasonable?

IV&V is not Impossible

There is a very important reason why I have devoted a couple of posts to the scientific method. The posts lay the groundwork for addressing an issue concerning the independent verification and validation (IV&V) of science and engineering software.

The very important issue? Many people feel IV&V is impossible.

In an article in the Feb. 4, 1994 issue of Science Magazine, Oreskes et al. make the following argument:
Verification and validation of numerical models of natural systems is impossible. This is because natural systems are never closed and because model results are always nonunique. Models can be confirmed by the demonstration of agreement between observation and prediction, but confirmation is inherently partial. Complete confirmation is logically precluded by the fallacy of affirming the consequent and by incomplete access to natural phenomena. Models can only be evaluated in relative terms, and their predictive value is always open to question. The primary value of models is heuristic.
This argument should be taken seriously. After all, Science is a peer reviewed publication that tries to represent the best of quality science. Additionally, there does not seem to be much in the way of direct, forceful rebuttal of this argument easily and freely available on the WWW. AFAIK, most of what is available is either dismissive of the argument or in basic agreement with it.

For example, Patrick J. Roache is rather dismissive and writes in a paper on the quantification of uncertainty in computational fluid dynamics:
In a widely quoted paper that has been recently described as brilliant in an otherwise excellent Scientific American article (Horgan 1995), Oreskes et al (1994) think that we can find the real meaning of a technical term by inquiring about its common meaning. They make much of supposed intrinsic meaning in the words verify and validate and, as in a Greek morality play, agonize over truth. They come to the remarkable conclusion that it is impossible to verify or validate a numerical model of a natural system. Now most of their concern is with groundwater flow codes, and indeed, in geophysics problems, validation is very difficult. But they extend this to all physical sciences. They clearly have no intuitive concept of error tolerance, or of range of applicability, or of common sense. My impression is that they, like most lay readers, actually think Newton’s law of gravity was proven wrong by Einstein, rather than that Einstein defined the limits of applicability of Newton. But Oreskes et al (1994) go much further, quoting with approval (in their footnote 36) various modern philosophers who question not only whether we can prove any hypothesis true, but also “whether we can in fact prove a hypothesis false.” They are talking about physical laws—not just codes but any physical law. Specifically, we can neither validate nor invalidate Newton’s Law of Gravity. (What shall we do? No hazardous waste disposals, no bridges, no airplanes, no : : : .) See also Konikow & Bredehoeft (1992) and a rebuttal discussion by Leijnse & Hassanizadeh (1994). Clearly, we are not interested in such worthless semantics and effete philosophizing, but in practical definitions, applied in the context of engineering and science accuracy.
Ahmed E. Hassan, on the other hand, seems in basic agreement with Oreskes and writes in a fairly recent review paper on the validation of numerical ground water models:
Many sites of ground water contamination rely heavily on complex numerical models of flow and transport to develop closure plans. This complexity has created a need for tools and approaches that can build confidence in model predictions and provide evidence that these predictions are sufficient for decision making. Confidence building is a long-term, iterative process and the author believes that this process should be termed model validation. Model validation is a process, not an end result. That is, the process of model validation cannot ensure acceptable prediction or quality of the model. Rather, it provides an important safeguard against faulty models or inadequately developed and tested models. If model results become the basis for decision making, then the validation process provides evidence that the model is valid for making decisions (not necessarily a true representation of reality). Validation, verification, and confirmation are concepts associated with ground water numerical models that not only do not represent established and generally accepted practices, but there is not even widespread agreement on the meaning of the terms as applied to models.
Let me also mention that the Oreskes article also briefly and indirectly alludes to another logical fallacy, the appeal to authority:
In contrast to the term verification, the term validation does not necessarily denote an establishment of truth (although truth is not precluded). Rather, it denotes the establishment of legitimacy, typically given in terms of contracts, arguments, and methods (27).

There are a lot of things I think would be interesting to discuss about Oreskes' article. However, this post is already getting too long. So I will only state what I feel is the strongest counter-argument and fill in the details in later posts. I do not agree with Oreskes because the scientific method, of which IV&V is a part, is not an exercise in logic. As I have already pointed out in an earlier post:
Note that even this most bare form of the scientific method contains two logical fallacies. The first is the use of abduction (affirming the consequent). The second is the partial reliance on IV&V for error management (appeal to authority). The use of abduction eliminates logical certainty from the scientific method and introduces the possibility of error. The logical shortcoming of IV&V means that finding and eliminating error is never certain.
The basic problem with Oreskes' argument is that it runs counter to the very foundations of the scientific method. The scientific method does not require logical certainty in order for it to work. The value of models is not only that they can be heuristic, it is that they can be be scientific. To be anti-model is to be anti-science. Good luck with that.

Modelers HATE Python!?

I recently ran across the following by a person involved in mesoscale weather modeling and graduating meteorology majors:
Fortran is the language of choice and the reason has nothing to do with legacy code. Nearly all modelers that I know are fluent not only in Fortran, but C, C++, and Perl as well. Fortran is the language used because it allows you to express the mathematics and physics in a very clear succinct fashion. The idea here is that [while] a craftsman has many tools in his tool chest, the amateur believes everything is a nail. The only common feature in terms of programming tools amongst modelers is a universal HATRED of object-oriented programming languages, particularly Python.
Object-oriented programming is the answer to a question that nobody has ever felt the need to ask. Programming in an object-oriented language is like kicking a dead whale down the beach.

I have no doubt that this is a sincere and knowledgeable comment. And I am not saying that just because this blog observes proper decorum and thus always assumes the Principle of Charity. I think I know why such an attitude may be prevalent. (But not universal. I have modeled things and I do not hate Python.) Let me explain by way of a toy example.

Principles of Planetary Climate is an online working draft of a book on the physics of climate. The author is Raymond T. Pierrehumbert who does research and teaches at the University of Chicago. Dr. Pierrehumbert also frequently posts (as "raypierre") on the popular blog RealClimate.

There is a computational skills web page that accompanies the book. On this page is a tutorial with links to basic and advanced Python scripts for solving a simple example ordinary differential equation with one dependent variable. There is also a script that uses a numerical/graphical library called ClimateUtilities.

The basic Python script implements three different ODE integration methods (Euler, midpoint, and Runge-Kutta) for the example differential equation and then compares their error to the exact analytical solution. (Let me note, since this blog is very concerned about IV&V issues, that the only discussion of validation and verification is a brief reference to a numerical stability problem with the midpoint method. The opportunity to discuss convergence and performance issues is also missed. As is the practicality of using multiple algorithms to solve the same problem as a verification technique. Obviously the author felt such IV&V issues to be of less than fundamental importance.)

The advanced Python script implements the three methods using an object oriented approach. IMHO, this script clearly demonstrates the unsuitability of an "everything is a nail" object oriented approach to numerical programming. IMHO, it perfectly illustrates the point of the comment I quoted above. The "advanced" script has many more lines of code, is much more complex in design, and I am certain would execute much more slowly than the "basic" implementation.

But a forceful reply to the comment quoted above is that neither the "advanced" nor the "basic" approaches are appropriate. Sure pure Python is numerically slow, but Python comes with batteries included. Every Pythonista knows that a key feature of Python is its "extensive standard libraries and third party modules for virtually every task". Python is a great way to glue libraries and third party modules together.

But here I found out that some libraries are better than others. I was unable to successfully install the ClimateUtilities library on my version of Ubuntu Linux (9.10). So I wrote a script that uses the SciPy library instead (as well as my own version of Runge-Kutta), as shown below. Note how short and straightforward the implementation is and, if you run it yourself, how much faster it is to use a numerical library. It is practically as fast as any compiled language implementation, Fortran or whatever. And don't forget, in Python, everything is an object. (E.g., do a dir(1) in Python. Even the number one is an object!)

(I ran into some interesting numerical features. See the dt value I used below for RK4. Maybe a subject for a later post?)

 Numerically solve an ODE using RK4 or scipy's odeint.
 See gmcrewsBlog post for details.
 ODE: dy/dt = slope(y, t)
 Where: slope(y, t) = -t * y
 And: y(0.) = 1.0
 Stopping at: y(5.)
 Note that analytical solution is: y(t) = y(0) * exp(-t**2 / 2)
 So error at y(5.) may be calculated.
 import math
 import time
 from scipy.integrate import odeint
 def slope(y, t):
     '''Function to use for testing the numerical methods.'''
     return -t * y
 # Parameters:
 t_start = 0.
 y_start = 1.
 t_end = 5.
 # Analytical solution:
 y_exact = y_start * math.exp(-t_end**2 / 2.)
 print "ODE: dy/dt = -t * y"
 print "Initial condition: y(%g) = %g" % (t_start, y_start)
 print "Analytical solution: y(t) = y(0) * exp(-t**2 / 2)"
 print "Analytical solution at y(%g) = %g" % (t_end, y_exact)
 # Do a Runge-Kutta (RK4) march and see how good a job it does:
 dt = 0.000044 # chosen so that approx. same error as scipy
 # However: try dt = .04 which gives even lower error!
 dt = .04
 runtime_start = time.time() # keep track of computer's run time
 t = t_start
 y = y_start
 h = dt / 2.
 while t < t_end:
     k1 = dt * slope(y, t)
     th = t + h
     k2 = dt * slope(y + k1 / 2., th)
     k3 = dt * slope(y + k2 / 2., th)
     t = t + dt
     k4 = dt * slope(y + k3, t)
     y = y + (k1 + (2. * k2) + (2. * k3) + k4) / 6.
 runtime = time.time() - runtime_start
 err = (y - y_exact) / y_exact * 100. # percent
 err_rate = err / runtime # error accumulation rate over time
 print "RK4 Results:"
 print "dt = %g" % dt
 print "Runtime = %g seconds" % runtime
 print "Solution: y(%g) = %g" % (t_end, y)
 print "Error: %g %%" % err
 # What does scipy's ode solver return?
 runtime_start = time.time() #keep track of computer's run time
 results = odeint(slope, y_start, [t_start, t_end])
 runtime = time.time() - runtime_start
 y = results[1][0]
 err = (y - y_exact) / y_exact * 100. # percent
 err_rate = err / runtime # error accumulation rate over time
 print "scipy.integrate.odeint Results:"
 print "Runtime = %g seconds" % runtime
 print "Solution: y(%g) = %s" % (t_end, y)
 print "Error: %g %%" % err

My Opinion About Programming Languages

There are several computer languages that I have had significant experience with. Over time, I think programming languages have gotten much better.

It is possible and sometimes entertaining to analogize these most basic of programming tools by viewing them as personal weapons. In chronological order of experience, I have the following subjective opinion:

  1. Fortran == bow and arrow. An ancient weapon, I learned how to use it over 40 years ago. Yet, in its modern form, seems a perfectly usable weapon for certain specialized applications. And still fun to use.
  2. Assembly language == toothpick. Hard to use and I never quite believed I could actually kill anything substantial with it.
  3. Applesoft Basic and Turbo Pascal == decided these weren't actually weapons. More like a cocktail fork and a butter knife.
  4. C == Battle sword. Found out I could kill anything with it. But required considerable courage and expertise for large jobs. And oh, I often cut myself with the "pointy" end (pointers!). (My fellow programming warriors used to accidentally stab me with their weapons' pointy ends too, no matter how careful they tried to be.)
  5. C++ == Klingon Bat'leth. Looked like a very formidable weapon and knowledge about it was a formal requirement for the honor of being known as a true programming warrior. But somehow, I never did figure out how to use the thing exactly right. I couldn't ever kill any problem much better than just using C. At first, I tried to wield it like a battle sword. Then I tried to adopt various styles, but never really felt graceful. Now I mostly wield it like a battle sword again. Screw it. The problems get killed.
  6. Java == Catapult. Seemed like an "infernal contraption" and took a team to use it right. Most practical only for certain types of big jobs.
  7. PHP == Hammer. Nothing fancy, but a handy little tool for building sites.
  8. Javascript == Sledgehammer. Took a lot of effort for the problems to be solved and the results looked really messy. But did seem to get the job done.
  9. Python == Starwars light-saber. Currently my favorite. For an old programming Jedi like me, I feel like I can elegantly kill any problem with this tool.


'''The scientific method expressed in Python syntax.

There is a study that suggests "programmers use pseudo-code and pen and paper to
reduce the cognitive complexity of the programming task." (Rachel Bellamy,
article behind ACM-paywall.) And if you do a Google search on "pseudo-code," you
will find a lot of hits that echo this sentiment.

I agree with this sentiment. In fact, as a generalist, with knowledge in many 
areas of math, science, engineering, and programming, I would like to have a 
"common language" that I can use to express myself in any technical area. Is 
this possible? 

If it is, IMHO, Python may come closest to fitting the bill. It is an expressive
language at multiple levels.

(Of course this would not be a true statement for someone unfamiliar with 
Python. They would have the added cognitive complexity of figuring out the 
language's tokens, grammer, syntax, and idiom. And what is the purpose of a
"common language" if nobody else can understand you? It seems I may have the
burden of helping to make Python popular for such a usage.)

There are other approaches like MathCad, that try to preserve the two
dimensional nature of usual mathematical notation and various common symbology.
But I guess I am not tied to tradition just for its own sake.

In programming design, the key issue is not so much to reduce complexity -- but
to contain it. The ability of object-oriented languages to contain complexity
behind an interface IMHO explains the popularity of object oriented languages.
Python's object model is a very simple one and so would seem ideal to serve
as the basis of a general pupose pseudo-code.

Another issue almost as important is elegance. A pseudo-language has to be 
usable -- to allow complexity at a high and abstract level to be expressed in a
simple and efficient manner. 

Elegance can be styled by defining clear paradigm shifts at object interfaces.
Sometimes the pseudo-language itself has elegant ways of expressing commonly
encountered complexities. For example, NumPy's handling of arrays seems easy
to understand and simple to use. So once again Python suggests itself as a good
technical pseudo-language candidate. 

(BTW, these issues are the main reasons I have never found flow-charts or UML 
diagrams very useful for software design. Documenting the design maybe -- but 
not for creating it. Every UML document I have ever produced has always come
AFTER I have decided upon the software's design and fundamental algorithms. The
only other benefit to UML I have experienced is to brainstorm with peers at a 
whiteboard. And there I usually just start making up notation and being sloppy 
just to speed the creative process along.)

So as an example of using Python as a high-level, all-round technical pseudo-
code, consider the scientific method. My personal philosophical viewpoint is 
that the scientific method is not so much a search for objective truth about 
Nature as it is an iterative exercise in predictive computation about Nature. 
Can I express this very abstract notion simply and unambiguously in pseudo-code
using Python syntax?


# Everything always has a context. Here we presuppose the current level of
# scientific knowledge.
from ScientificKnowledge import Theory, Experiment

# One would think that the scientific method would just be part of
# ScientificKnowledge. But let's pretend it's not.
def scientific_method(theory_id, lab_id):
    '''Perform the scientific method.
    theory_id = theory name or identifier
    lab_id = identifier of place and people performing the method.
    # Every "lab" has their own view of any particular scientific theory:
    my_theories = Theory(lab_id)
    # Each lab has their own experimental capabilities:
    my_experiments = Experiment(lab_id)
    # Iteratively perform the method as long as relevant to increasing our
    # overall state of scientific knowledge and practical.
    while my_theories.relevant(theory_id):
        # What was my belief in the theory before defining and performing
        # the experiment?
        prior_belief = my_theories.belief_intensity(theory_id)
        # What experiment will test the theory optimally?
        # What will be the predicted result?
        experiment, prediction = my_theories.generate(
                theory_id, my_experiments)
        if experiment == None:
            return # testing theory no longer practical
        # Perform the experiment.    
        result = experiment.perform()
        # Determine if the results of the experiment were significant.
        # Considering all possible theories, how plausible was this result?
        plausibility = my_theories.plausibility(experiment, result)
        # How likely was this result?
        likelyhood = prediction.belief_intensity / plausibility
        # How does this experiment change my beliefs in this theory?
        posterior_belief = prior_belief * likelyhood
        # Update my theories to reflect this new knowledge:
        my_theories.abduction(experiment, result, posterior_belief)
# Note how the shortcomings become glaring. There is no IV&V. (I guess this
# could be remedied with a try statement. Unlike most other languages, Python
# style is to use exceptions for non-ideal workflow as well as "extreme"
# exceptions.) Also, there is no mechanism for publishing experimental results
# to others. All this is good since how to improve the description is obvious.