The Measure of Software Quality Improvement

It seems there are many definitions for the term "software quality". On the bright side, this can be liberating. I can add my own definition:

Software quality is the degree of belief that the code will actually execute its intended functions, in the intended operating environment, for the intended life-cycle, without significant error or failure.

Notice that quality is subjective. It is a degree of belief, a Bayesian probability. "A measure of a state of knowledge." This is important for IVV since it means there is a big difference between "software quality" and "consensus software quality."

But that's another topic. The topic here is the proper measure of software quality improvement. (Engineers like quantitative measures and making things better.) So here I would like to note that improving software quality requires conditioning a Bayesian prior. But this, in turn, requires new evidence. The more new evidence presented and the stronger the evidence, the more software quality improves.

This means that the sole test of software quality is, not surprisingly, testing. Testing provides an objective foundation for quality that keeps quality's subjectivity from (hopefully) getting too far from reality.

But that's another topic too. Here I am noting that, IMHO, the strength, power, or intensity of new evidence is best measured in decibels. Bayesian probability ratios are often expressed in decibels. For example, see How Loud is the Evidence?

The decibel is a log scale that simplifies overall power gain/loss calculations and is convenient for ratios of numbers that differ by orders of magnitude. Additionally, log scales have been useful in describing the potential of certain physical phenomenon (e.g., earthquakes) and human perceptions (e.g., hearing). Thus, log scales can be useful when performing risk assessments and other related quality assurance activities.

For more information on evidence measured in decibels, see Chapter 4 of Jayne's Probability Theory: The Logic of Science.

Finally, an analogy. If evidence of software quality is measured in decibels, it suggests software quality assurance can be thought of as singing the quality song about the software. Consensus software quality then is where we all sing the same song, or at least sing our part in a symphony. :-)

(One of my complaints about the state of quality of say, climate science, is that every participant feels she/he must carry the melody.)


  1. I like the idea of getting explicit about what the terms we throw around actually mean in a rigorous (well-defined and measurable) way.

    So, you are suggesting that the measure of software quality is a probability of significant failure / error conditional on the intended functions, operating environment and life-cycle. Did I get that right?

    I like this idea, but one concern I'd have is that when we go to a binary measure (pass / fail) rather than some continuous measure, we loose a lot of information and subsequently the power to detect small things. Maybe that's ok, and we should just learn to stop worrying and not sweat the small stuff...

    From the list on the wiki page the things I tend to care about are correctness and reliability, but that's probably too narrow a view.

    Your post made me think, "how would I map the results of a MMS-based verification test to a probability of error", and I don't know that I have a good answer to that yet. This example is an interesting case, because MMS has shown me that I had a bug that prevented the design order of convergence (3rd), but I still got first order convergence. So the governing equations were still being solved correctly / consistently, just not as accurately as they were supposed to be, significant error or no? Is the result of a test like that useful for updating our prior on failure?

    You've definitely got me thinking, thanks.

  2. The software product quality list you mention shows that there are many components to software quality. Each of these components, even if based on objective parameters such as bug counts, must be subjectively evaluated.

    AFAIK, the state of the evaluation art, such as it is, is mulit-criteria decision analysis; specifically multi-attribute global inference of quality. In other words -- weighted attribute analyses.

    The point of my post was a suggestion that attributes measured in decibels are easier to understand and to combine using some form of the above procedure. I am not very hopeful for this suggestion. How many times have you ever heard of a stock market signal or the significance of some survey expressed in decibels?

    What is driving me is my experience that when the quality of a software is important to a diverse range of stakeholders, reaching a quality consensus can be very difficult. I am searching for the best general way to reach such consensus. My fear is that the current state of the SQA art may render politicized high-consequence software, such as the climate models, medical software, or nuclear power plant SCADA systems, technically unverifiable/unvalidatable. As pointed out so clearly in Dan Hughes' post today, an appeal to trust in authority and their current closed quality processes is, let us politely say, sub-optimal.

    Let me think about your MMS posts some more. But, a question. In an application usage area where I am not sure what a correct output would actually be, could I not use the method to manufacture a prior? (Create an initial prior, not update one I already have.) In other words, not MMS, but MMP.

  3. In an application usage area where I am not sure what a correct output would actually be, could I not use the method to manufacture a prior?
    I don't think I'm following; you mean like trying to verify an implementation of a stochastic simulation? If so, Andrew Gelman has done some work on that (they call it validation, but they mean software correctness, what I'd call verification)

    Maybe the answer for the MMS stuff is that it just becomes another one of the criteria you track.

  4. This report, Software Failure Probability Quantification for System Risk Assessment, is about a way of using Bayes networks to combine V&V and testing into an overall picture of software failure probability, the V&V activities are used to provide a measure of 'quality' of the evidence.