# 10 Chapter 10: Critical consumption of statistical evaluation data

Nothing in life is to be feared.

It is only to be understood.

-Marie Curie

Statistics. This word usually sends my students into orbit with anxiety. But take heed of the words of scientist Marie Curie, above. The focus of this chapter is helping you to interpret statistical evaluation data. The focus here is on interpreting numerical information with as little pain as possible.

Statistical evaluation data generally come in three forms: univariate, bivariate and multivariate. I am guessing that these terms sound like “evaluation as a second language,” so let’s break them down. After breaking them down, we will go over the basic steps involved in interpreting a statistical table.

## Interpreting univariate data

The term “univariate” refers to one group – looking at data from one topic, and sometimes one group. When you read the word “univariate,” focus in on “uni” which is derived from the Latin word for one (“unus”). In practice evaluation, we see univariate statistics used to describe the characteristics of a group – either demographic, clinical, or other. For example, a report on citizens’ level of community pride at one point in time is considered a piece of univariate data.

Looking at Table 10.1, we see a visual example of how univariate evaluation statistics are reported in a one group table. In this example, we are thinking back to our social worker doing community organizing work focused on fostering the improvement of community pride.

To read a table, slow yourself down by first reading the title of the table, to ground yourself in what you are about to read. Then, figure out what is presented in each of the columns – in this case a list of the measures, or variables in the evaluation, and a summary of the data taken from citizens participating in the project. Under the latter, you will notice the notation “N=331.” This is the total number of people participating in the evaluation.

 Table 10.1: Community pride evaluation data Measures Citizen participants N=331 % (n) or M (SD) Intention to remain in community (yes/no) 61.9% (205) Community pride score 27.0 (13.6)

At this point, you can start going row by row to critically consume the data. Looking at the third row, we see what may at first appear to be some gibberish, “% (n) or M (SD).” Actually, this row is telling us the format that data will be presented in, in the column below. Let’s take these symbols one at a time.

The symbol “%” refers to percentage who answered the question (listed in a given row) in the affirmative. The letter “n” refers to the number of people who answered the question (listed in a given row) in the affirmative. For example, under the “measures” column, you can see a row labelled “intention to remain in the community (yes/no).” This measure is known as a nominal variable. The words in quotations describe the measure, or variable for which data are being presented in the data row. As this is a percentage being reported, the “yes/no” notation indicates that the number listed in the table is the percent of people indicating yes versus no. In this case, 61.9% of people indicated that they did intend to stay in the community versus move away. This is good news for the evaluator!

The number “205” listed in a parenthesis next to the percentage indicates the number of people who said yes out of the total number of people participating in the evaluation. For that number, see the “citizen participants” column showing N=331, the total number of people participating in the evaluation.

Moving on, the capital letter “M” refers to the mean, or average, a “measure of central tendency.” This is one of the most commonly used univariate evaluation statistics. This measure is known as a continuous variable. The problem with the mean, however, is that sometimes, it can be misleading. For example, if I told you the mean fee-for-service rate United States social work clinicians charge per hour was \$356, you might be surprised, thinking that the rate sounds high. The truth is, a mean is calculated from a number of values along a spectrum. And that is where the standard deviation comes in.

On the table, the letters “SD” refer to the standard deviation, which can loosely be thought of as a unit of measurement above and below the mean, similar to a margin of error. You can be one standard deviation above or below the mean, or two, or three. The standard deviation is always measured in the unit of the measure in question. Let’s go through an example to make this real.

Back to our example about fee-for-service rates while the mean was \$356, if the standard deviation was \$151, this would give us a better sense of the spectrum of rates clinicians charge around the country. Variations might relate to insurance type or use of a private pay system. As you can see, the standard deviation allows for the truth to come forward. Therefore, the standard deviation should *always* be reported with the mean as a way to give critical consumers the bigger picture around the mean.

Now that we have mastered the concepts of the mean and standard deviation, let us interpret the community pride scale from Table 10.1. On average, citizen participants scored 27.0 (SD=13.6) on the community pride scale. The community pride instrument used a 1-100 measurement with 1 coding as not proud of the community at all and 100 coding as extremely proud of the community. This does not bode well for our evaluator as 40.6 was the highest community pride score within one standard deviation of the mean.

We get that number by taking the average score, 27.0, and adding one standard deviation (13.6) to get 40.6. To get a better sense of how the mean and standard deviation work in this example, see Figure 10.1, below.   Figure 10.1 Mean and standard deviation

The percentage, mean and standard deviation are core concepts in evaluation statistics and are vital to understand when critically consuming bivariate data, which we will consider next.

## Interpreting bivariate analyses

When you hear the term “bivariate,” focus in on the part of the word that says “bi.” This word originates from Latin as well and refers to a condition in which two things are present ( e.g. binocular, bicameral). In practice evaluation, we see bivariate statistics used to compare groups against one another, or to compare one group of people before and after a social work intervention.

For example, if we think back to our child welfare social worker that is running parenting education groups, a report on people’s child abuse potential score before and after a parenting course is considered a piece of bivariate data. We look at the group’s average score before the intervention as it is compared to their average score after the parenting intervention.

With bivariate statistics, you see the inclusion of information about statistical significance. Statistical significance helps us to tell whether there is a real mathematical difference between the before and after scores. Another way of saying that is that statistical significance tells us the percent chance any given finding is due to chance. But before we go there for more explanation, let’s start by orienting ourselves to what is in Table 10.2 step by step.

 Table 10.2: Parenting program evaluation data Measures Before intervention N=20 One month after intervention N=15 Test N (%) or M (SD) N (%) or M (SD) Child abuse potential score 56.0 (4.2) 38.0 (2.3) t=3.22*** Meeting all reunification goals 35% (7) 73% (11) X2=8.54* *p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups

We can see from the top row that this table is reporting on parenting education evaluation data. In the left-hand column, we see a list of two measures used by the evaluators, the child abuse potential score (from a standardized instrument) and whether parents are meeting their family reunification goals (a yes/no question). The second and third columns indicate where pre data (before the intervention) and post data (after the intervention) will be reported. The final column indicates where statistical test data are reported.

Let’s start with the child abuse potential score, which can range anywhere between 1 (no child abuse potential at all) and 100 (high chance of child abuse potential). Before the intervention, as a group, parents scored on average 56.0 plus or minus 4.2 points. 56.0 is the mean and 4.2 is the standard deviation. However, after the parenting education intervention, as a group parents scored on average 38.0 plus or minus 2.3 points. In this example, 38.0 is the mean and 2.3 is the standard deviation.

On the face of it, this is about a 20-point drop in child abuse potential score. However, we don’t know if this is a statistically significant or mathematically meaningful difference. In order to understand whether this is a real difference, we need statistical data, which we have in the fourth column, where it is noted as “t=3.22***”.

Let’s take this one step at a time, beginning with the “t.” every statistical test produces a result that is coded with a letter. The “t” (always lower case) refers to the Student’s t test, a calculation that compares two values in a before and after situation (such as ours) or in a group comparison situation (such as comparing the results of two different therapeutic groups). When you see the “t,” this means that the measure you are looking at is continuous. A continuous variable is one for which you can calculate an average, which is what we are looking at. It is important to note that you can’t calculate an average on a yes/no question such as “completed treatment/didn’t complete treatment.” As a critical consumer of research, it is important to check whether the correct statistical test was used with any given measure in order to determine whether the results are legitimate or not. Believe it or not, I’ve seen this happen more than I care to relate!

Next, we have the 3.22 number listed next to the “t.” This number is not interpreted. It is a relic of a time before computer calculations when we had to look up values to determine whether our findings were statistically significant or not. So, we can skip right over this number and get to the “***” part of the report. Asterisks, or stars, are commonly used to indicate the presence of statistical significance.

Generally, if you see three stars, this means that there is a 99.9% chance that the finding you are looking at is not due to chance. The presence of two stars means that there is a 99% chance that the finding you are looking at is not due to chance. The presence of one star means that there is a 95% chance that the finding you are looking at is not due to chance. Classically, evaluators only accept results as statistically significant if they find between a 95% and a 99.9% chance in their statistical data.

However, more recently, statisticians have questioned this rule of thumb, suggesting that lower percentages might be considered as well. In order to understand this, it helps to look at the entire range of possible percentages, see Table 10.3 (need to fix the graphics so it runs evenly across the page).

In Table 10.3, statistical significance levels are reported as probability values, or p values. A p value of <.99 translates into 1%, for example, indicating there is less than a one percent chance the finding is not due to chance. To get from .99 to 1%, you have to remember some fourth-grade math, where you subtract .99 from 1 to get the percentage. The three groupings presented in Table 10.3 reflect classical thinking about statistical significance but are not a hard and fast rule.

 Table 10.3: Range of statistical significance levels Not statistically significant Approaching significance Statistically significant p<.99-p<.80-p<.70-p<.60-p<.50- p<.40-p<.30- p<.20-p<.10 p<.09-p<.08- p<.07-p<.06 p<.05*-p<.04*- p<.03*-p<.02*- p<.01**-p<.001***

So, we’ve talked a lot about statistical significance, but there is another factor we have to consider, clinical significance (also called clinical meaningfulness sometimes). To do this, let’s go back to our example. As you will recall from Table 10.2, before the intervention, on average, parents’ child abuse potential score was 56.0 (4.2) whereas it was 38.0 (2.3) one month after the intervention ended.

While statistical difference tells us the chance a finding is due to chance, clinical significance is what we, individually, as social workers think about the difference between the numbers being compared (in a comparison context). These are two different concepts. In order to think about clinical significance, we have to start with statistical significance. If something is found to be statistically significantly different, then we have to consider clinical significance. If something is found *not* to be statistically significantly different, we stop there with the understanding that mathematically, the scores are equal vis-a-vis the comparison of scores. We would then just consider the range the two scores are in on whatever spectrum of scores we are considering vs. comparing the two scores to one another.

To use our statistically significant example from Table 10.2, we have to begin to consider whether 56 and 38 are meaningfully different from one another. This involves us going back to the range of scores on the child abuse potential scale, 1-100. I’m willing to bet that most people would consider a roughly 20-point difference, with statistical significance, a meaningful difference.

However, what if you had a statistically significant difference, but the score difference was only 10 points? Or 5 points? Would you feel the finding was clinically significant then? What would it mean, for example, if the pre-test score was 56, but the post-test score was only 51? Would you think that the intervention assisted parents in lowering their child abuse potential score? In this situation, even though there was a mathematical difference (per statistical significance), there was not a clinically meaningful change in score, suggesting that the intervention might not be effective.

Now, we’ve been spending a great deal of energy to interpret the child abuse potential score, but what about the other measure in Table 10.2, “meeting all reunification goals?” As you will recall, our child welfare social worker is trying to assist all parents on her caseload to reunify with their children who are living in foster care. In this situation, each parent works with a child welfare social work team to establish goals they need to accomplish in order to reunify with their children.

One of the measures in our child welfare social worker’s evaluation of her parenting education intervention is whether group participants are meeting all of their reunification goals. The hope is that participation in the parenting education intervention will help parents to achieve all of their goals.

As we can see in Table 10.2, only 35 percent of parents (a total of 7 people) had achieved their parenting goals before the intervention started. This is contrasted by the fact that 73 percent of the intervention participants (11 people) had achieved their reunification goals one month after the end of the parenting education intervention. On the face of it, that is a big difference, but is it a statistically significant difference? In order to determine that, we look in the fourth column for the statistical data, and we see X2=8.54*

Let’s take this one step at a time, beginning with the “X2.” Every statistical test produces a result that is coded with a letter. The “X2” (pronounced “Kai squared”) refers to the Chi-square test, a calculation that compares two values in a before and after situation (such as ours) or in a group comparison situation (such as comparing the results of two different therapeutic groups). When you see the “X2,” this means that the measure you are looking at is nominal. A nominal variable is one for which you *cannot* calculate an average, which is what we are looking at. Nominal variables are often structured as a yes/no question such as “achieved reunification goals/didn’t achieve reunification goals.” Remember, as a critical consumer of research, it is important to check whether the correct statistical test was used with any given measure in order to determine whether the results are legitimate or not.

Next, we have the 8.54 number listed next to the “X2.” As with the Student’s t test, this number is not interpreted. It is a relic of a time before computer calculations when we had to look up values to determine whether our findings were statistically significant or not. So, we can skip right over this number and get to the “*” part of the report. As you will recall, a single start tells us that the finding is statistically significant, and that there is a 95% chance that our test result is not due to chance.

So, we’ve talked a lot about statistical significance, but now we have to consider, clinical significance. To do this, let’s go back to our example. As you will recall from Table 10.2, before the intervention, 35 percent of parents had achieved their reunification goals whereas 73 percent (almost three quarters of the group) had achieved their reunification goals one month after the intervention ended. This roughly forty-point difference between the pre and post tests would be considered clinically significant by most people.

However, if there had been no statistically significant difference between pre and post, we would have to consider the scores statistically equal, in which case clinical meaningfulness would take on a different meaning. Specifically, the meaningfulness would not be about the difference in scores as the statistical tests showed the scores to be equal, suggesting that the evaluation was not effective.

We’ve taken our time going through Table 10.2. Now, let’s ask the question “How is this evaluation information helpful?” From the data we have reviewed, our evaluation tells us that parents participating in the parenting education intervention are making progress, and we have evidence to support this statement, evidence backed up by results from a Student’s t-test and a Chi-squared test.

So far, you have learned how to interpret a Student’s t-test and a Chi-squared test (two of the most common evaluation statistics), but we also have to talk about the public health and medical statistic that is increasingly used in social work, the odds ratio. An odds ratio compares two groups, such as a treatment group and a control group. You always need to know which group you are comparing to the other in order to correctly interpret the data that the test gives you. The data that you interpret is about the group you are focused on, and it is reported in comparison to the other group, known as the “referent.” Usually the treatment group is the group you are focused on and the control group is the referent. Let’s say we are comparing the treatment group and the control group on their likelihood of maintaining sobriety for 90 days post treatment. You will have data about what percentage of each group maintained sobriety for that timeframe, but you will be wanting to know if there is a statistically significant difference. Odds ratio scores also help you to know if there is a clinically meaningful difference. More on that in a minute.

While we don’t interpret the scores that are given to use from t-tests and chi-squares, we do interpret the number that is given to us with an odds ratio. If we have an odds ratio score of exactly 1.0 it means that the treatment and control group are exactly equal. If we have an odds ratio score of 2.3 (p<.001) it means that the treatment group is 2.3 times more likely to maintain sobriety than the control group. Let’s now say we have an odds ratio score of 2.3 (p<.99), while odds ratios are still reported when there is no statistical significance, we don’t interpret them because both groups are statistically equal to have the outcome.

Now, let’s say that our odds ratio had been 0.23 instead of 2.3. In this situation, we subtract 0.23 from 1, and get .77. We interpret this as a percent. This would tell us that the treatment group was 77% less likely to maintain sobriety for 90 days (meaning there’s a problem with our program). When our odds ratios are positive (meaning 1 or higher), we talk about “times more likely” and when they are negative (meaning they start with a zero), we talk about “percent less likely.” So, an odds ratio of 0 point anything is always about lower likelihood.

Let’s talk about the way that odds ratio scores help us to determine clinical meaningfulness. We only start paying attention to odds ratios as meaningful at a certain cutoff point. In research as a second language we talk about this as an “effect size.” As Chen, Cohen, and Chen (2010) note, “the odds ratio (OR) is probably the most widely used index of effect size in epidemiological studies” (p. 860). Further, these authors suggest that odds ratios of 1.68, 3.47, and 6.71 are equivalent to Cohen’s d effect sizes of 0.2 (small), 0.5 (medium), and 0.8 (large), respectively” (p. 860). So unless your odds ratio is above 1.68, you shouldn’t really consider it to be clinically meaningful. That’s a good rule of thumb.

It is important to note that depending on the structure of the outcome measure/variable used in an evaluation, a t-test, chi-square test or odds ratio test could be used in a pre-post test research design, as discussed in chapter 4.

Now that you have learned all of the classic statistics for comparing two groups, there is one more statistical test that evaluators use all the time, the ANOVA, or analysis of variance. Don’t let the evaluation-as-a-second language deter you. The ANOVA test can compare two or more points in time (or groups) but is classically used to compare three points in time (or groups). ANOVA tests compare continuous variables (reported as means, with standard deviations) across timeframes or groups. As discussed in chapter 4, the ANOVA statistical test is often used in evaluation to look at pretest data compared to post-test data and then aftercare data, for example.

Let’s look at some examples, starting with Table 10.4, which reports on data gathered by our child welfare social worker before the parenting education intervention, one month after the intervention and one year after the intervention. We have already reviewed a bivariate analysis that compared the average score before the intervention and one month after the intervention. Now, we are adding on a third point in time, the one-year mark after the intervention. We can see that on average, the child abuse potential score was at its lowest one month after the intervention, and that it crept up a slight bit at the one-year mark.

As with all statistical tests, the ANOVA presents a letter along with test results, in this case it is a capital F. The “F statistic” reads as F=5.23**. We do not interpret the number, as with the Student’s t-test and Chi-square, but we do interpret the stars to indicate the presence of statistical significance, in this case suggesting 99% chance the different scores across timeframes are not due to chance.

 Table 10.4: Parenting education evaluation data across three timeframes Measures Before intervention N=20 One month after intervention N=15 One-year post test N=15 ANOVA or X2 M (SD) or % (n) M (SD) or % (n) M (SD) or % (n) Child abuse potential score 56.0 (4.2) 38.0 (2.3) 41.2 (3.7) F=5.23** *p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups

In order to determine where the specific differences lie between the three timeframes, a post hoc test is conducted. All that means is that a test is done after the ANOVA is completed in order to compare the pre-test to each post-test individually, as well as comparing the post tests to one another. This test provides p values across time frames to see where statistically significant differences lie.

In Table 10.5, if we start by finding the p value linking the “before intervention” timeframe and the “one-year post-test” timeframe, we see a value of p<.01, indicating a 99% chance that there is a real difference between scores across these timeframes.

 Table 10.5: ANOVA post-hoc test results for parenting education intervention evaluation Timeframes: Before intervention One month after intervention One-year post test Before intervention — p<.03 p<.01 One month after intervention p<.03 — p<.86 One-year post test p<.01 p<.86 —

Moving on, we might be curious about whether there is a statistically significant difference between scores at the “one month after intervention” timeframe and the “one-year post-test” timeframe. Looking at the score that links the two, we see p<.86. As we can see on Table 10.3, this is not a statistically significant value. This score means that there is a 14% chance that this finding is not due to chance.

As we mentioned above, the ANOVA test can also be used to compare groups, something that is also relevant for practice evaluation. For example, let’s say our child welfare social worker was running three intervention groups, and we wanted to compare the outcomes. See Table 10.6.

 Table 10.6: Parenting education evaluation data across groups at one month after intervention Measure Group 1 N=15 Group 2 N=22 Group 3 N=17 ANOVA M (SD) M (SD) M (SD) Child abuse potential score 38.0 (2.3) 42.6 (5.1) 33.5 (1.6) F=7.45 *p< .05; **p< .01; ***p< .001; NS = No statistically significant difference between groups

Taking our time to orient to the table, we would start by grounding ourselves in the title of the table. Looking at the “measure” column we would see that the child abuse potential score is being reported. The three columns in the middle show us information about each of the three intervention groups.

The last column shows us which statistical test is used, an ANOVA, and reports the F statistic. As we learned above, we skip over the “F statistic” in order to see if there is a statistically significant difference between the three groups – or not. In this case, there is no star or asterisk present, indicating that there is no statistically significant difference between these groups with respect to the outcome measure of child abuse potential. As a reminder, if we wanted to know *exactly* where the statistically significant difference was between groups – namely between 1 or 2, 2 or 3 or 1 or 3, we would do a post-hoc analysis. In a post hoc analysis, a table of p values is reported for each of those comparisons, see Table 10.5 for an example.

How is this evaluation information helpful? We can take a look at these data several ways. First, we know that there is no statistically significant difference between the groups, which means that mathematically, they are doing equally well. This suggests that there may not be a difference in who is in each group or in how each group is conducted.

Second, if we look at the score range, we can see that the child abuse potential scores are between 33 and 43 (if we round up). Given that the child abuse potential score range is between 1 and 100, parents in the groups are at the lower end of the range. In terms of clinical significance, we would focus on how the data look all together – no concerning differences between groups, all scores on the lower end of the range. We would not compare the scores in each group for clinical significance because there was no statistical significance, indicating that the groups are mathematically the same.

## Interpreting multivariate analysis

The last frontier of critically consuming evaluation data relates to the use of a statistical test referred to as regression analysis. Saying this term out loud often elicits anxiety amongst my students, but in truth, regression analysis is all about predictions.

Specifically, regression analysis allows us to understand whether groups of factors predict a certain outcome. There are two main types of regression analysis used in evaluation, logit regression and ordinary least squares regression. Logit regression is for nominal outcomes and ordinary least squares is for continuous outcomes. Before your head begins to spin at the terminology, let’s focus on the basics of regression analysis.

Let’s go right to the concrete, an example related to our child welfare social worker’s evaluation of her parenting education intervention. All of the parents on her caseload have the same goal, reunifying with their children. Therefore, the outcome of interest in this evaluation is whether or not family reunification took place (yes or no). Note that because family reunification is a yes/no measure, it is classified as a nominal variable. Therefore, logit regression is the appropriate test to choose because it works with an outcome measure that is nominal.

In order to plan her future work with parents involved in the child welfare system, she may be curious about what factors are related to the family reunification outcome. Factors that might be related to this outcome could include a parent’s child abuse potential score and whether they accomplished all of their reunification goals, among other factors.

As we have measures for both of these factors, we can conduct a regression analysis to see how much these factors, taken together predict or explain a positive family reunification outcome. Each of these measures is called an independent measure or an independent variable.

Let’s start to understand the utility of logit regression analysis by interpreting what’s in Table 10.7. After grounding ourselves with the title, so we know what we are focusing on, we move on to the second row. We can see that there is a column where the independent measures – or factors we are interested in – are listed.

The next column has what appears to be gibberish, Exp(B). This is statistical language, but what it is important to know is that this column gives us what is called an “odds ratio” to interpret that links the independent measure to the outcome measure. We’ll interpret that in just a bit. The third column gives us “confidence intervals” which are akin to a margin of error, telling us how low and how high the odds ratio could be. Finally, the p value tells us whether there is a statistically significant relationship between the independent measure and the outcome measure.

The last thing we need to pay attention to is the last row, where we see the term “Nagelkerke R2.” This is the name of a statistical test that tells us how much the set of independent measures (together) explains the variation in the outcome. You can think of this as a quality measure.

 Table 10.7: Logit regression predicting family reunification Independent measures Exp(B) Confidence intervals P value Has low child abuse potential score at end of intervention (yes/no) 2.30 1.75-2.46 p<.001 Achieved all reunification goals within one year (yes/no) 10.1 9.2-11.6 p<.05 *p< .05; **p< .01; ***p< .001 Nagelkerke R2=0.54

There are two ways to interpret logistic regression analyses in evaluation. The first approach looks at how good the set of independent measures is at predicting the outcome. The second approach looks at how each independent measure relates to the outcome measure.

Interpretation 1: In this interpretation we are using logistic regression to predict family reunification (our outcome measure) among child welfare-involved parents, we have a two-measure “model.” In regression analysis, a “model” is a set of independent measures that are thought to relate to the outcome measure of interest. The goal is to test the model to see how much of the variation in the outcome, family reunification in this case, is explained.

This interpretation relies on that Nagelkerke R2 that we mentioned above. This statistical test should be read as a percentage. Remember that in fourth grade math, 1.0 is equal to 100%, .99 is equal to 99% and so on. When we interpret the percentage, we can see that 54% of the outcome was explained by the combination of independent measures we chose to include in our regression analysis.

This tells us that while our two measures or “model” predicts over half of the outcome, there are some measures that are missing. In other words, there are other factors that predict the family reunification outcome that are not included in this model. Something to consider for our next evaluation. This interpretation helps us to see the spectrum of factors that we should work on with clients that are geared towards a positive outcome.

Interpretation 2: In this interpretation we are still using logistic regression to predict family reunification (our outcome measure) among child welfare-involved parents, and we are looking at the same two-measure “model.” In this approach to interpretation, we look at one independent measure’s relationship to the outcome measure at a time. One of our measures tells us whether child welfare involved parents had a low or high child abuse potential score (yes/no), so let’s start there.

Our interpretation focuses on parents with low child abuse potential scores as compared to parents with high child abuse potential scores. To interpret the odds ratio along this line of the table, we look at the number in the child abuse potential score row and the Exp(B) column. We see the number 2.2. This tells us parents with low child abuse potential scores were 2.3 times more likely to be reunified with their children, “controlling for” (taking into consideration) the other independent measure in the model, which is about achieving reunification goals. Our p level tells us that this is a statistically significant finding, meaning it is not due to chance.

Let’s say that our odds ratio had been 0.23 instead of 2.3. In this situation, we subtract 0.23 from 1, and get .77. We interpret this as a percent. This would tell us that parents with low child abuse potential scores were 77% less likely to reunify with their children (which doesn’t make a whole lot of common sense, but this is just an example). When our odds ratios are positive, we talk about “times more likely” and when they are negative, we talk about “percent less likely.” So, an odds ratio of 0 point anything is always about lower likelihood.

But back to our real finding, that parents with lower scores on the child abuse potential scale may have better family reunification outcomes. This incentivizes us to work with our parents who have higher scores on this measure so that they can do better at managing the tasks and challenges of parenting.

If we move to a focus on the other independent measure, we are focusing on parents who achieved their reunification goals versus those who did not with respect to whether their family was reunified. Our odds ratio tells us that parents who achieved their reunification goals were 10.1 times more likely to reunify with their children, controlling for the child abuse potential score measure. This is also a statistically significant difference.

This finding tells us that it is not only important to craft the right goals (that parents buy into) but also to facilitate a pathway to goal completion by our clients. It also tells us that we can’t look at goal completion separately from child abuse potential scores, which are also an important factor in family reunification outcomes.

OK, so you’ve worked through the basics of logistic regression for evaluation. Now, let’s turn our attention to the interpretation of an ordinary least squares (OLS) regression, which is also focused on prediction. This time, instead of focusing on the prediction of a yes/no outcome, we are looking at a prediction of an outcome measure that is a continuous variable, giving a score. In this case, we are trying to predict what increases the outcome measure of child abuse potential, measured on the 1-100 scale we discussed above.

 Table 10.8: OLS regression analysis predicting child abuse potential score Independent measures Beta P value Parental age 0.24 NS Days of child welfare involvement 1.08 p<.05 *p< .05; **p< .01; ***p< .001 NS=no significance R2=0.72

Let’s start to understand OLS regression analysis by interpreting what’s in Table 10.8. After grounding ourselves with the title, so we know what we are focusing on, we move on to the second row. We can see the terms “beta” and p value. We know that the p value tells us about statistical significance, but what about this beta thing? The beta score tells us about the relationship between the independent measure to the outcome measure. We’ll interpret that in just a bit.

Moving to the first column, we see that the “model” (or set of independent measures) includes parental age and the number of days a family has been involved in the child welfare system. The latter measure could be considered a proxy or stand-in measure for the complexity of the child welfare case. As with logit regression, there are two ways to interpret an OLS regression.

Interpretation 1: In this interpretation we are using OLS regression to predict child abuse potential scores (our outcome measure) among child welfare-involved parents, and we have a two-measure “model.” We set out to predict change in child abuse potential score. In this interpretation we focus on the R2 percentage (note that it is not the Nagelkerke R2but the regular plain old R2). In this evaluation, our model predicted 72% of the variation in the outcome, the child abuse potential score.

Interpretation 2: In this interpretation we are still using OLS regression to predict child abuse potential score (our outcome measure) among child welfare-involved parents, and we are looking at the same two-measure “model.” In this approach to interpretation, we look at one independent measure’s relationship to the outcome measure at a time. One of our measures tells us how parental age is (or is not) related to child abuse potential score, so let’s start there.

Right off the bat, we may notice that the finding is not significant – this means that a parent’s age is not related to child abuse potential scores when controlling for days of child welfare involvement. If this was statistically significant, for every year of a parent’s life, the child abuse potential score went up by 0.24 points, controlling for the other measure in the model. This would have told us that as parents are older, their child abuse potential goes up a little bit.

Now, we need to see what the relationship between days of child welfare involvement and child abuse potential score is. Looking at the beta score we see that for evert additional day a family is child welfare-involved, their child abuse potential score goes up. This suggests that something about child welfare involvement is not conducive to being a good parent, which is counterintuitive. This is something to be considered carefully by the evaluation team in order to try to change practice in this area.

In summary, univariate, bivariate and multivariate statistical tests are used to analyze evaluation data. This requires social workers to be able to interpret findings on the most basic level, so that they can inform their practice. Univariate information looks at summary data about a whole group. Bivariate data allows for the statistical comparison of groups or groups across time. There are different types of bivariate tests for measures that are continuous or nominal variables. Multivariate data analysis allows us to consider how sets of factors work together to explain outcomes.

Once social workers embrace their ethical duty to be critical consumers of research for evidence-based practice, the work of interpreting statistics becomes more of a priority. Hopefully this chapter has given you the basics you need to engage in practice evaluation data consumption with or without a team!

Now that we have reviewed the three types of statistical tests used in evaluation, let’s sum up how to interpret a statistical table that is the product of an evaluation analysis. Many people feel anxious when they see a statistical table and are concerned that they will not be able to understand it.

You may not understand ALL of what is on a statistical table, but you will be able to pick out what is important for application to social work practice. In order to do this work, you need to start with a DEEP BREATH and an OPEN MIND before following these simple steps:

• ORIENTATION: Read the title of the table in order to orient yourself. Many people skip this step due to their anxiety. The title will tell you a lot that can help you to decipher the rest of the table. Some of the questions you can ask of the title are:
• What was the purpose of the table? Summarization? Comparison? Correlation? Prediction?
• Which statistical test was used? Mean & standard deviation? Percentage? Student’s t test? Odds ratio? Chi-square? ANOVA? OLS regression? Logit regression? Go to your stats cheat sheet to remember what these are about.
• How many groups are reported on? In one-group situations we will most often be looking at mean & standard deviation as well as percentages and sometimes OLS regression or logit regression. In some OLS regression and logit regression tables, we will be comparing groups.
• INTERPRETATION OF SCORES/PERCENTAGES: Now take a look at the numbers, slowly and methodically in order to increase understanding and reduce anxiety.
• If one group is reported on: What are the scores or percentages for that group? Once you have interpreted these for the group in question, you are done.
• If multiple groups are reported on: What are the scores or percentages for that group? If you are dealing with a multiple group comparison, you will need to interpret statistical and clinical significance, see below.
• INTERPRETATION OF STATISTICAL SIGNIFICANCE: While we can see that there may be numerical differences between groups, reading those numbers alone will not tell us if there is a statistical or mathematical difference between them. Statistical significance tells us whether mathematically, there is a real difference between the groups or whether the difference noted is due to chance.
• INTERPRETATION OF CLINICAL SIGNIFICANCE: Statistical significance doesn’t tell us everything. We also have to use our clinical minds in order to think about the meaningfulness of the data. Something may have a statistically significant difference, but not a clinically significant difference. For example, two therapy groups may score within 10 points of one another on an outcome measure, with a statistically significant difference noted. Even though one group is a little higher than the other on the outcome measure, this may not be clinically meaningful if we are using a scale of 1-100. If there is no statistical significance, we cannot detect a clinically significant difference between groups, however, as the values are statistically equal. There is still clinical meaningfulness, though, about the overall score of both groups on the range of scores being considered.

## Discussion questions for chapter 10

• Thinking of your current internship or work placement, how could you use univariate, bivariate and multivariate statistics to inform your work?
• What are the differences between univariate, bivariate and multivariate statistics?
• What is statistical significance as compared to clinical significance?
• If there is no statistically significant difference in evaluation outcome measures between two groups, is there clinical significance?

Chen, H., Cohen, P., & Chen, S. (2010). How big is a big odds ratio? Interpreting the magnitudes of odds ratios in epidemiological studies. Communications in Statistics Simulation and Computation, 39(4), 860–864. 