How to Evaluate the Effectiveness of a Program Using Statistical and Practical Significance

How to Evaluate the Effectiveness of a Program Using Statistical and Practical Significance

Funders are increasing pressure on nonprofits to demonstrate their impacts, but many organizations have limited evaluation capacity. Some do not have access to statistical software, and even fewer have the assistance of an external evaluator. However, evaluation efforts are critical for determining whether clients are being impacted by services and for helping to guide program improvements.

Given this mounting pressure, it’s easy to understand why many program administrators are seeking guidance on how to evaluate the effectiveness of their programs. The good news is that developing an understanding of the concepts of statistical significance and practical significance can help with efforts to assess the validity of outcomes and their impact on clients.

Measuring program outcomes

Evaluating programs is easier when using objective indicators like graduation rates. This is particularly true if we have a target goal for that indicator. If we set as a goal that 80% of our participants will graduate from high school, it is fairly easy for us to determine whether we did or did not achieve that target.

However, much of the outcome data collected by nonprofits occurs through the use of pre-post measures of either internal client characteristics or targeted behaviors.

Internal client characteristics include things like:

  • Knowledge
  • Attitudes
  • Beliefs
  • Skills
  • Self-perception traits such as self-esteem or self-efficacy.

Targeted behaviors tend to be qualities that are more outwardly visible, such as:

  • Increased school attendance
  • Changes in diet
  • Decreases in the use of illegal substances

These two outcome types are often hypothesized to relate to each other in a program’s theory of change or logic model; that is, a change in some inner quality such as an attitude or belief is designed to bring about specific behavior changes. This allows us to assess each outcome individually, and also explore whether there is an association between variables consistent with the hypothesized functioning of our program model.

However, pre-post or between-group changes on these types of indicators require more careful examination in contrast to objectively measured items due to the types of statistical tests that are typically used. Observed differences in internal client characteristics or behaviors of clients could in fact simply be the product of chance rather than an effect of the program.

So, how do we test whether a program is effective when using these less objective indicators?

Statistical significance can help determine the likelihood that observed results in relation to changes to inner qualities or behaviors are valid. An assessment of practical significance can then help determine the degree to which these changes are meaningful to the lives of clients.

Statistical Significance: can we reject the conclusion that we had no impact on our clients?

One important use of statistical tests is to determine whether numeric differences (for example, between pre and posttest scores or between two or more independent groups) are likely to be real or simply due to chance. That is, even if we achieve a seemingly positive difference from pre to post with a group of clients, it may actually be the case that there is no difference between them when we factor in variability in group scores. When running a test of mean differences we start with the assumption that the difference is zero, and the test (using a statistical software package or even Excel) will provide us with the probability that we can confidently reject this zero difference – referred to as the null hypothesis. As a general convention, we are comfortable rejecting the conclusion that there is no difference in scores if this probability is 5% or smaller; that is, when only 5 (or fewer) times out of 100 we would conclude there is a difference when actually there is not.

Now, there are a great number of statistical analysis techniques, but this is not a blog about statistics. It is simply intended to underscore that it is important for us to seek verification, whenever possible, that changes we observe in our client populations be tested so that we can determine the degree to which we have confidence that these changes are real. This is also to say that we should not assume differences are real simply because our data appear to be moving in the right direction.

Importantly, statistical tests are sensitive to large sample sizes. This makes it easy to achieve statistical significance even if the measured difference is very small. This is a technical matter, and consultation with an evaluator or statistician can help you think through sample size issues. However, this provides a nice transition to our next topic: Practical Significance.

Practical Significance: can we conclude that our impact made a difference?

Statistical significance says nothing about the size or potential meaning of an observed effect; it simply tells us whether we should reject the idea that there is no difference. As mentioned above, this is easily accomplished as the size of our sample grows, even when differences are very small. Thus, we need to make some determination as to the practical meaning of the result we’ve achieved. For example, would we deem the difference large enough to believe that it had material impact on the lives of our clients? Would we make further investments in our program based on the magnitude of this difference? Or, conversely, would we deem the difference so small, despite being statistically significant, that we might question whether our program is achieving the effects we aspire to?

While there are statistical procedures for calculating the size of an effect, it is often the case that we base at least a part of this decision on our own experience and judgment, or on that of an external professional. We might, for example, have extensive expertise in our program area and previously determined what constitutes a meaningful difference (the important words here being “previously determined”).  Or, we might turn to research literature to see if there are studies on programs like ours that focus on the same types of outcomes to obtain external guidance on what we should set our projected change to. Finally, many measurement tools provide guidance based on previous experiments as to what constitutes a meaningful change on their tools. Each of these can help us consider what practical significance means within our program. However, as suggested in this section:

  1. We should predetermine what a meaningful change is, and
  2. We should seek out information that helps to explain why or in what ways this constitutes a meaningful change.

A significant caveat

It is important note that while statistical and practical significance provide us with preliminary evidence of our program’s impacts, they do not, in and of themselves, suggest causality; that the program caused these changes. For this stronger test, we would need a more robust evaluation design, for example one that includes a control group. Despite this significant limitation, the use of these two concepts when interpreting data help to put us on more solid ground when reporting findings to our funders, seeking to adjust program efforts, or when using our findings in strategic ways to market and expand services.