research papers evaluation

Evaluating Research in Academic Journals: A Practical Guide to Realistic Evaluation

November 2018
Edition: 7th edition
Publisher: Routledge (Taylor & Francis)
ISBN: 978-0815365662
This person is not on ResearchGate, or hasn't claimed this research yet.

University of New Haven

Discover the world's research

25+ million members
160+ million publication pages
2.3+ billion citations

Victoria Espinoza

Jesus Alberto Galvan
Dorothea Ivey

TOBIAS CHACHA OLAMBO
DR. MOSES ODHIAMBO ALUOCH (PhD)

Catherine Briand

Carol Giba Bottger Garcia
Kofar Wambai

J AM COLL HEALTH

C. Nathan DeWall
J. Michael Bartels

Sojung Park

J APPL DEV PSYCHOL

Sandra A. Brown
COMPUT HUM BEHAV
Alberta Contarello

Barker Bausell

Recruit researchers
Join for free
Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Evaluating Research Articles

Understanding research statistics, critical appraisal, help us improve the libguide.

Imagine for a moment that you are trying to answer a clinical (PICO) question regarding one of your patients/clients. Do you know how to determine if a research study is of high quality? Can you tell if it is applicable to your question? In evidence based practice, there are many things to look for in an article that will reveal its quality and relevance. This guide is a collection of resources and activities that will help you learn how to evaluate articles efficiently and accurately.

Is health research new to you? Or perhaps you're a little out of practice with reading it? The following questions will help illuminate an article's strengths or shortcomings. Ask them of yourself as you are reading an article:

Is the article peer reviewed?
Are there any conflicts of interest based on the author's affiliation or the funding source of the research?
Are the research questions or objectives clearly defined?
Is the study a systematic review or meta analysis?
Is the study design appropriate for the research question?
Is the sample size justified? Do the authors explain how it is representative of the wider population?
Do the researchers describe the setting of data collection?
Does the paper clearly describe the measurements used?
Did the researchers use appropriate statistical measures?
Are the research questions or objectives answered?
Did the researchers account for confounding factors?
Have the researchers only drawn conclusions about the groups represented in the research?
Have the authors declared any conflicts of interest?

If the answer to these questions about an article you are reading are mostly YESes , then it's likely that the article is of decent quality. If the answers are most NOs , then it may be a good idea to move on to another article. If the YESes and NOs are roughly even, you'll have to decide for yourself if the article is good enough quality for you. Some factors, like a poor literature review, are not as important as the researchers neglecting to describe the measurements they used. As you read more research, you'll be able to more easily identify research that is well done vs. that which is not well done.

Determining if a research study has used appropriate statistical measures is one of the most critical and difficult steps in evaluating an article. The following links are great, quick resources for helping to better understand how to use statistics in health research.

How to read a paper: Statistics for the non-statistician. II: “Significant” relations and their pitfalls This article continues the checklist of questions that will help you to appraise the statistical validity of a paper. Greenhalgh Trisha. How to read a paper: Statistics for the non-statistician. II: “Significant” relations and their pitfalls BMJ 1997; 315 :422 *On the PMC PDF, you need to scroll past the first article to get to this one.*
A consumer's guide to subgroup analysis The extent to which a clinician should believe and act on the results of subgroup analyses of data from randomized trials or meta-analyses is controversial. Guidelines are provided in this paper for making these decisions.

Statistical Versus Clinical Significance

When appraising studies, it's important to consider both the clinical and statistical significance of the research. This video offers a quick explanation of why.

If you have a little more time, this video explores statistical and clinical significance in more detail, including examples of how to calculate an effect size.

Statistical vs. Clinical Significance Transcript Transcript document for the Statistical vs. Clinical Significance video.
Effect Size Transcript Transcript document for the Effect Size video.
P Values, Statistical Significance & Clinical Significance This handout also explains clinical and statistical significance.
Absolute versus relative risk – making sense of media stories Understanding the difference between relative and absolute risk is essential to understanding statistical tests commonly found in research articles.

Critical appraisal is the process of systematically evaluating research using established and transparent methods. In critical appraisal, health professionals use validated checklists/worksheets as tools to guide their assessment of the research. It is a more advanced way of evaluating research than the more basic method explained above. To learn more about critical appraisal or to access critical appraisal tools, visit the websites below.

Last Updated: Jun 11, 2024 10:26 AM
URL: https://libguides.massgeneral.org/evaluatingarticles

Criteria for Good Qualitative Research: A Comprehensive Review

Regular Article
Open access
Published: 18 September 2021
Volume 31 , pages 679–689, ( 2022 )

Cite this article

You have full access to this open access article

Drishti Yadav ORCID: orcid.org/0000-0002-2974-0323 1

98k Accesses

45 Citations

71 Altmetric

Explore all metrics

This review aims to synthesize a published set of evaluative criteria for good qualitative research. The aim is to shed light on existing standards for assessing the rigor of qualitative research encompassing a range of epistemological and ontological standpoints. Using a systematic search strategy, published journal articles that deliberate criteria for rigorous research were identified. Then, references of relevant articles were surveyed to find noteworthy, distinct, and well-defined pointers to good qualitative research. This review presents an investigative assessment of the pivotal features in qualitative research that can permit the readers to pass judgment on its quality and to condemn it as good research when objectively and adequately utilized. Overall, this review underlines the crux of qualitative research and accentuates the necessity to evaluate such research by the very tenets of its being. It also offers some prospects and recommendations to improve the quality of qualitative research. Based on the findings of this review, it is concluded that quality criteria are the aftereffect of socio-institutional procedures and existing paradigmatic conducts. Owing to the paradigmatic diversity of qualitative research, a single and specific set of quality criteria is neither feasible nor anticipated. Since qualitative research is not a cohesive discipline, researchers need to educate and familiarize themselves with applicable norms and decisive factors to evaluate qualitative research from within its theoretical and methodological framework of origin.

Good Qualitative Research: Opening up the Debate

Beyond qualitative/quantitative structuralism: the positivist qualitative research and the paradigmatic disclaimer.

What is Qualitative in Research

Avoid common mistakes on your manuscript.

Introduction

“… It is important to regularly dialogue about what makes for good qualitative research” (Tracy, 2010 , p. 837)

To decide what represents good qualitative research is highly debatable. There are numerous methods that are contained within qualitative research and that are established on diverse philosophical perspectives. Bryman et al., ( 2008 , p. 262) suggest that “It is widely assumed that whereas quality criteria for quantitative research are well‐known and widely agreed, this is not the case for qualitative research.” Hence, the question “how to evaluate the quality of qualitative research” has been continuously debated. There are many areas of science and technology wherein these debates on the assessment of qualitative research have taken place. Examples include various areas of psychology: general psychology (Madill et al., 2000 ); counseling psychology (Morrow, 2005 ); and clinical psychology (Barker & Pistrang, 2005 ), and other disciplines of social sciences: social policy (Bryman et al., 2008 ); health research (Sparkes, 2001 ); business and management research (Johnson et al., 2006 ); information systems (Klein & Myers, 1999 ); and environmental studies (Reid & Gough, 2000 ). In the literature, these debates are enthused by the impression that the blanket application of criteria for good qualitative research developed around the positivist paradigm is improper. Such debates are based on the wide range of philosophical backgrounds within which qualitative research is conducted (e.g., Sandberg, 2000 ; Schwandt, 1996 ). The existence of methodological diversity led to the formulation of different sets of criteria applicable to qualitative research.

Among qualitative researchers, the dilemma of governing the measures to assess the quality of research is not a new phenomenon, especially when the virtuous triad of objectivity, reliability, and validity (Spencer et al., 2004 ) are not adequate. Occasionally, the criteria of quantitative research are used to evaluate qualitative research (Cohen & Crabtree, 2008 ; Lather, 2004 ). Indeed, Howe ( 2004 ) claims that the prevailing paradigm in educational research is scientifically based experimental research. Hypotheses and conjectures about the preeminence of quantitative research can weaken the worth and usefulness of qualitative research by neglecting the prominence of harmonizing match for purpose on research paradigm, the epistemological stance of the researcher, and the choice of methodology. Researchers have been reprimanded concerning this in “paradigmatic controversies, contradictions, and emerging confluences” (Lincoln & Guba, 2000 ).

In general, qualitative research tends to come from a very different paradigmatic stance and intrinsically demands distinctive and out-of-the-ordinary criteria for evaluating good research and varieties of research contributions that can be made. This review attempts to present a series of evaluative criteria for qualitative researchers, arguing that their choice of criteria needs to be compatible with the unique nature of the research in question (its methodology, aims, and assumptions). This review aims to assist researchers in identifying some of the indispensable features or markers of high-quality qualitative research. In a nutshell, the purpose of this systematic literature review is to analyze the existing knowledge on high-quality qualitative research and to verify the existence of research studies dealing with the critical assessment of qualitative research based on the concept of diverse paradigmatic stances. Contrary to the existing reviews, this review also suggests some critical directions to follow to improve the quality of qualitative research in different epistemological and ontological perspectives. This review is also intended to provide guidelines for the acceleration of future developments and dialogues among qualitative researchers in the context of assessing the qualitative research.

The rest of this review article is structured in the following fashion: Sect. Methods describes the method followed for performing this review. Section Criteria for Evaluating Qualitative Studies provides a comprehensive description of the criteria for evaluating qualitative studies. This section is followed by a summary of the strategies to improve the quality of qualitative research in Sect. Improving Quality: Strategies . Section How to Assess the Quality of the Research Findings? provides details on how to assess the quality of the research findings. After that, some of the quality checklists (as tools to evaluate quality) are discussed in Sect. Quality Checklists: Tools for Assessing the Quality . At last, the review ends with the concluding remarks presented in Sect. Conclusions, Future Directions and Outlook . Some prospects in qualitative research for enhancing its quality and usefulness in the social and techno-scientific research community are also presented in Sect. Conclusions, Future Directions and Outlook .

For this review, a comprehensive literature search was performed from many databases using generic search terms such as Qualitative Research , Criteria , etc . The following databases were chosen for the literature search based on the high number of results: IEEE Explore, ScienceDirect, PubMed, Google Scholar, and Web of Science. The following keywords (and their combinations using Boolean connectives OR/AND) were adopted for the literature search: qualitative research, criteria, quality, assessment, and validity. The synonyms for these keywords were collected and arranged in a logical structure (see Table 1 ). All publications in journals and conference proceedings later than 1950 till 2021 were considered for the search. Other articles extracted from the references of the papers identified in the electronic search were also included. A large number of publications on qualitative research were retrieved during the initial screening. Hence, to include the searches with the main focus on criteria for good qualitative research, an inclusion criterion was utilized in the search string.

From the selected databases, the search retrieved a total of 765 publications. Then, the duplicate records were removed. After that, based on the title and abstract, the remaining 426 publications were screened for their relevance by using the following inclusion and exclusion criteria (see Table 2 ). Publications focusing on evaluation criteria for good qualitative research were included, whereas those works which delivered theoretical concepts on qualitative research were excluded. Based on the screening and eligibility, 45 research articles were identified that offered explicit criteria for evaluating the quality of qualitative research and were found to be relevant to this review.

Figure 1 illustrates the complete review process in the form of PRISMA flow diagram. PRISMA, i.e., “preferred reporting items for systematic reviews and meta-analyses” is employed in systematic reviews to refine the quality of reporting.

PRISMA flow diagram illustrating the search and inclusion process. N represents the number of records

Criteria for Evaluating Qualitative Studies

Fundamental criteria: general research quality.

Various researchers have put forward criteria for evaluating qualitative research, which have been summarized in Table 3 . Also, the criteria outlined in Table 4 effectively deliver the various approaches to evaluate and assess the quality of qualitative work. The entries in Table 4 are based on Tracy’s “Eight big‐tent criteria for excellent qualitative research” (Tracy, 2010 ). Tracy argues that high-quality qualitative work should formulate criteria focusing on the worthiness, relevance, timeliness, significance, morality, and practicality of the research topic, and the ethical stance of the research itself. Researchers have also suggested a series of questions as guiding principles to assess the quality of a qualitative study (Mays & Pope, 2020 ). Nassaji ( 2020 ) argues that good qualitative research should be robust, well informed, and thoroughly documented.

Qualitative Research: Interpretive Paradigms

All qualitative researchers follow highly abstract principles which bring together beliefs about ontology, epistemology, and methodology. These beliefs govern how the researcher perceives and acts. The net, which encompasses the researcher’s epistemological, ontological, and methodological premises, is referred to as a paradigm, or an interpretive structure, a “Basic set of beliefs that guides action” (Guba, 1990 ). Four major interpretive paradigms structure the qualitative research: positivist and postpositivist, constructivist interpretive, critical (Marxist, emancipatory), and feminist poststructural. The complexity of these four abstract paradigms increases at the level of concrete, specific interpretive communities. Table 5 presents these paradigms and their assumptions, including their criteria for evaluating research, and the typical form that an interpretive or theoretical statement assumes in each paradigm. Moreover, for evaluating qualitative research, quantitative conceptualizations of reliability and validity are proven to be incompatible (Horsburgh, 2003 ). In addition, a series of questions have been put forward in the literature to assist a reviewer (who is proficient in qualitative methods) for meticulous assessment and endorsement of qualitative research (Morse, 2003 ). Hammersley ( 2007 ) also suggests that guiding principles for qualitative research are advantageous, but methodological pluralism should not be simply acknowledged for all qualitative approaches. Seale ( 1999 ) also points out the significance of methodological cognizance in research studies.

Table 5 reflects that criteria for assessing the quality of qualitative research are the aftermath of socio-institutional practices and existing paradigmatic standpoints. Owing to the paradigmatic diversity of qualitative research, a single set of quality criteria is neither possible nor desirable. Hence, the researchers must be reflexive about the criteria they use in the various roles they play within their research community.

Improving Quality: Strategies

Another critical question is “How can the qualitative researchers ensure that the abovementioned quality criteria can be met?” Lincoln and Guba ( 1986 ) delineated several strategies to intensify each criteria of trustworthiness. Other researchers (Merriam & Tisdell, 2016 ; Shenton, 2004 ) also presented such strategies. A brief description of these strategies is shown in Table 6 .

It is worth mentioning that generalizability is also an integral part of qualitative research (Hays & McKibben, 2021 ). In general, the guiding principle pertaining to generalizability speaks about inducing and comprehending knowledge to synthesize interpretive components of an underlying context. Table 7 summarizes the main metasynthesis steps required to ascertain generalizability in qualitative research.

Figure 2 reflects the crucial components of a conceptual framework and their contribution to decisions regarding research design, implementation, and applications of results to future thinking, study, and practice (Johnson et al., 2020 ). The synergy and interrelationship of these components signifies their role to different stances of a qualitative research study.

Essential elements of a conceptual framework

In a nutshell, to assess the rationale of a study, its conceptual framework and research question(s), quality criteria must take account of the following: lucid context for the problem statement in the introduction; well-articulated research problems and questions; precise conceptual framework; distinct research purpose; and clear presentation and investigation of the paradigms. These criteria would expedite the quality of qualitative research.

How to Assess the Quality of the Research Findings?

The inclusion of quotes or similar research data enhances the confirmability in the write-up of the findings. The use of expressions (for instance, “80% of all respondents agreed that” or “only one of the interviewees mentioned that”) may also quantify qualitative findings (Stenfors et al., 2020 ). On the other hand, the persuasive reason for “why this may not help in intensifying the research” has also been provided (Monrouxe & Rees, 2020 ). Further, the Discussion and Conclusion sections of an article also prove robust markers of high-quality qualitative research, as elucidated in Table 8 .

Quality Checklists: Tools for Assessing the Quality

Numerous checklists are available to speed up the assessment of the quality of qualitative research. However, if used uncritically and recklessly concerning the research context, these checklists may be counterproductive. I recommend that such lists and guiding principles may assist in pinpointing the markers of high-quality qualitative research. However, considering enormous variations in the authors’ theoretical and philosophical contexts, I would emphasize that high dependability on such checklists may say little about whether the findings can be applied in your setting. A combination of such checklists might be appropriate for novice researchers. Some of these checklists are listed below:

The most commonly used framework is Consolidated Criteria for Reporting Qualitative Research (COREQ) (Tong et al., 2007 ). This framework is recommended by some journals to be followed by the authors during article submission.

Standards for Reporting Qualitative Research (SRQR) is another checklist that has been created particularly for medical education (O’Brien et al., 2014 ).

Also, Tracy ( 2010 ) and Critical Appraisal Skills Programme (CASP, 2021 ) offer criteria for qualitative research relevant across methods and approaches.

Further, researchers have also outlined different criteria as hallmarks of high-quality qualitative research. For instance, the “Road Trip Checklist” (Epp & Otnes, 2021 ) provides a quick reference to specific questions to address different elements of high-quality qualitative research.

Conclusions, Future Directions, and Outlook

This work presents a broad review of the criteria for good qualitative research. In addition, this article presents an exploratory analysis of the essential elements in qualitative research that can enable the readers of qualitative work to judge it as good research when objectively and adequately utilized. In this review, some of the essential markers that indicate high-quality qualitative research have been highlighted. I scope them narrowly to achieve rigor in qualitative research and note that they do not completely cover the broader considerations necessary for high-quality research. This review points out that a universal and versatile one-size-fits-all guideline for evaluating the quality of qualitative research does not exist. In other words, this review also emphasizes the non-existence of a set of common guidelines among qualitative researchers. In unison, this review reinforces that each qualitative approach should be treated uniquely on account of its own distinctive features for different epistemological and disciplinary positions. Owing to the sensitivity of the worth of qualitative research towards the specific context and the type of paradigmatic stance, researchers should themselves analyze what approaches can be and must be tailored to ensemble the distinct characteristics of the phenomenon under investigation. Although this article does not assert to put forward a magic bullet and to provide a one-stop solution for dealing with dilemmas about how, why, or whether to evaluate the “goodness” of qualitative research, it offers a platform to assist the researchers in improving their qualitative studies. This work provides an assembly of concerns to reflect on, a series of questions to ask, and multiple sets of criteria to look at, when attempting to determine the quality of qualitative research. Overall, this review underlines the crux of qualitative research and accentuates the need to evaluate such research by the very tenets of its being. Bringing together the vital arguments and delineating the requirements that good qualitative research should satisfy, this review strives to equip the researchers as well as reviewers to make well-versed judgment about the worth and significance of the qualitative research under scrutiny. In a nutshell, a comprehensive portrayal of the research process (from the context of research to the research objectives, research questions and design, speculative foundations, and from approaches of collecting data to analyzing the results, to deriving inferences) frequently proliferates the quality of a qualitative research.

Prospects : A Road Ahead for Qualitative Research

Irrefutably, qualitative research is a vivacious and evolving discipline wherein different epistemological and disciplinary positions have their own characteristics and importance. In addition, not surprisingly, owing to the sprouting and varied features of qualitative research, no consensus has been pulled off till date. Researchers have reflected various concerns and proposed several recommendations for editors and reviewers on conducting reviews of critical qualitative research (Levitt et al., 2021 ; McGinley et al., 2021 ). Following are some prospects and a few recommendations put forward towards the maturation of qualitative research and its quality evaluation:

In general, most of the manuscript and grant reviewers are not qualitative experts. Hence, it is more likely that they would prefer to adopt a broad set of criteria. However, researchers and reviewers need to keep in mind that it is inappropriate to utilize the same approaches and conducts among all qualitative research. Therefore, future work needs to focus on educating researchers and reviewers about the criteria to evaluate qualitative research from within the suitable theoretical and methodological context.

There is an urgent need to refurbish and augment critical assessment of some well-known and widely accepted tools (including checklists such as COREQ, SRQR) to interrogate their applicability on different aspects (along with their epistemological ramifications).

Efforts should be made towards creating more space for creativity, experimentation, and a dialogue between the diverse traditions of qualitative research. This would potentially help to avoid the enforcement of one's own set of quality criteria on the work carried out by others.

Moreover, journal reviewers need to be aware of various methodological practices and philosophical debates.

It is pivotal to highlight the expressions and considerations of qualitative researchers and bring them into a more open and transparent dialogue about assessing qualitative research in techno-scientific, academic, sociocultural, and political rooms.

Frequent debates on the use of evaluative criteria are required to solve some potentially resolved issues (including the applicability of a single set of criteria in multi-disciplinary aspects). Such debates would not only benefit the group of qualitative researchers themselves, but primarily assist in augmenting the well-being and vivacity of the entire discipline.

To conclude, I speculate that the criteria, and my perspective, may transfer to other methods, approaches, and contexts. I hope that they spark dialog and debate – about criteria for excellent qualitative research and the underpinnings of the discipline more broadly – and, therefore, help improve the quality of a qualitative study. Further, I anticipate that this review will assist the researchers to contemplate on the quality of their own research, to substantiate research design and help the reviewers to review qualitative research for journals. On a final note, I pinpoint the need to formulate a framework (encompassing the prerequisites of a qualitative study) by the cohesive efforts of qualitative researchers of different disciplines with different theoretic-paradigmatic origins. I believe that tailoring such a framework (of guiding principles) paves the way for qualitative researchers to consolidate the status of qualitative research in the wide-ranging open science debate. Dialogue on this issue across different approaches is crucial for the impending prospects of socio-techno-educational research.

Amin, M. E. K., Nørgaard, L. S., Cavaco, A. M., Witry, M. J., Hillman, L., Cernasev, A., & Desselle, S. P. (2020). Establishing trustworthiness and authenticity in qualitative pharmacy research. Research in Social and Administrative Pharmacy, 16 (10), 1472–1482.

Article Google Scholar

Barker, C., & Pistrang, N. (2005). Quality criteria under methodological pluralism: Implications for conducting and evaluating research. American Journal of Community Psychology, 35 (3–4), 201–212.

Bryman, A., Becker, S., & Sempik, J. (2008). Quality criteria for quantitative, qualitative and mixed methods research: A view from social policy. International Journal of Social Research Methodology, 11 (4), 261–276.

Caelli, K., Ray, L., & Mill, J. (2003). ‘Clear as mud’: Toward greater clarity in generic qualitative research. International Journal of Qualitative Methods, 2 (2), 1–13.

CASP (2021). CASP checklists. Retrieved May 2021 from https://casp-uk.net/casp-tools-checklists/

Cohen, D. J., & Crabtree, B. F. (2008). Evaluative criteria for qualitative research in health care: Controversies and recommendations. The Annals of Family Medicine, 6 (4), 331–339.

Denzin, N. K., & Lincoln, Y. S. (2005). Introduction: The discipline and practice of qualitative research. In N. K. Denzin & Y. S. Lincoln (Eds.), The sage handbook of qualitative research (pp. 1–32). Sage Publications Ltd.

Google Scholar

Elliott, R., Fischer, C. T., & Rennie, D. L. (1999). Evolving guidelines for publication of qualitative research studies in psychology and related fields. British Journal of Clinical Psychology, 38 (3), 215–229.

Epp, A. M., & Otnes, C. C. (2021). High-quality qualitative research: Getting into gear. Journal of Service Research . https://doi.org/10.1177/1094670520961445

Guba, E. G. (1990). The paradigm dialog. In Alternative paradigms conference, mar, 1989, Indiana u, school of education, San Francisco, ca, us . Sage Publications, Inc.

Hammersley, M. (2007). The issue of quality in qualitative research. International Journal of Research and Method in Education, 30 (3), 287–305.

Haven, T. L., Errington, T. M., Gleditsch, K. S., van Grootel, L., Jacobs, A. M., Kern, F. G., & Mokkink, L. B. (2020). Preregistering qualitative research: A Delphi study. International Journal of Qualitative Methods, 19 , 1609406920976417.

Hays, D. G., & McKibben, W. B. (2021). Promoting rigorous research: Generalizability and qualitative research. Journal of Counseling and Development, 99 (2), 178–188.

Horsburgh, D. (2003). Evaluation of qualitative research. Journal of Clinical Nursing, 12 (2), 307–312.

Howe, K. R. (2004). A critique of experimentalism. Qualitative Inquiry, 10 (1), 42–46.

Johnson, J. L., Adkins, D., & Chauvin, S. (2020). A review of the quality indicators of rigor in qualitative research. American Journal of Pharmaceutical Education, 84 (1), 7120.

Johnson, P., Buehring, A., Cassell, C., & Symon, G. (2006). Evaluating qualitative management research: Towards a contingent criteriology. International Journal of Management Reviews, 8 (3), 131–156.

Klein, H. K., & Myers, M. D. (1999). A set of principles for conducting and evaluating interpretive field studies in information systems. MIS Quarterly, 23 (1), 67–93.

Lather, P. (2004). This is your father’s paradigm: Government intrusion and the case of qualitative research in education. Qualitative Inquiry, 10 (1), 15–34.

Levitt, H. M., Morrill, Z., Collins, K. M., & Rizo, J. L. (2021). The methodological integrity of critical qualitative research: Principles to support design and research review. Journal of Counseling Psychology, 68 (3), 357.

Lincoln, Y. S., & Guba, E. G. (1986). But is it rigorous? Trustworthiness and authenticity in naturalistic evaluation. New Directions for Program Evaluation, 1986 (30), 73–84.

Lincoln, Y. S., & Guba, E. G. (2000). Paradigmatic controversies, contradictions and emerging confluences. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (2nd ed., pp. 163–188). Sage Publications.

Madill, A., Jordan, A., & Shirley, C. (2000). Objectivity and reliability in qualitative analysis: Realist, contextualist and radical constructionist epistemologies. British Journal of Psychology, 91 (1), 1–20.

Mays, N., & Pope, C. (2020). Quality in qualitative research. Qualitative Research in Health Care . https://doi.org/10.1002/9781119410867.ch15

McGinley, S., Wei, W., Zhang, L., & Zheng, Y. (2021). The state of qualitative research in hospitality: A 5-year review 2014 to 2019. Cornell Hospitality Quarterly, 62 (1), 8–20.

Merriam, S., & Tisdell, E. (2016). Qualitative research: A guide to design and implementation. San Francisco, US.

Meyer, M., & Dykes, J. (2019). Criteria for rigor in visualization design study. IEEE Transactions on Visualization and Computer Graphics, 26 (1), 87–97.

Monrouxe, L. V., & Rees, C. E. (2020). When I say… quantification in qualitative research. Medical Education, 54 (3), 186–187.

Morrow, S. L. (2005). Quality and trustworthiness in qualitative research in counseling psychology. Journal of Counseling Psychology, 52 (2), 250.

Morse, J. M. (2003). A review committee’s guide for evaluating qualitative proposals. Qualitative Health Research, 13 (6), 833–851.

Nassaji, H. (2020). Good qualitative research. Language Teaching Research, 24 (4), 427–431.

O’Brien, B. C., Harris, I. B., Beckman, T. J., Reed, D. A., & Cook, D. A. (2014). Standards for reporting qualitative research: A synthesis of recommendations. Academic Medicine, 89 (9), 1245–1251.

O’Connor, C., & Joffe, H. (2020). Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods, 19 , 1609406919899220.

Reid, A., & Gough, S. (2000). Guidelines for reporting and evaluating qualitative research: What are the alternatives? Environmental Education Research, 6 (1), 59–91.

Rocco, T. S. (2010). Criteria for evaluating qualitative studies. Human Resource Development International . https://doi.org/10.1080/13678868.2010.501959

Sandberg, J. (2000). Understanding human competence at work: An interpretative approach. Academy of Management Journal, 43 (1), 9–25.

Schwandt, T. A. (1996). Farewell to criteriology. Qualitative Inquiry, 2 (1), 58–72.

Seale, C. (1999). Quality in qualitative research. Qualitative Inquiry, 5 (4), 465–478.

Shenton, A. K. (2004). Strategies for ensuring trustworthiness in qualitative research projects. Education for Information, 22 (2), 63–75.

Sparkes, A. C. (2001). Myth 94: Qualitative health researchers will agree about validity. Qualitative Health Research, 11 (4), 538–552.

Spencer, L., Ritchie, J., Lewis, J., & Dillon, L. (2004). Quality in qualitative evaluation: A framework for assessing research evidence.

Stenfors, T., Kajamaa, A., & Bennett, D. (2020). How to assess the quality of qualitative research. The Clinical Teacher, 17 (6), 596–599.

Taylor, E. W., Beck, J., & Ainsworth, E. (2001). Publishing qualitative adult education research: A peer review perspective. Studies in the Education of Adults, 33 (2), 163–179.

Tong, A., Sainsbury, P., & Craig, J. (2007). Consolidated criteria for reporting qualitative research (COREQ): A 32-item checklist for interviews and focus groups. International Journal for Quality in Health Care, 19 (6), 349–357.

Tracy, S. J. (2010). Qualitative quality: Eight “big-tent” criteria for excellent qualitative research. Qualitative Inquiry, 16 (10), 837–851.

Download references

Open access funding provided by TU Wien (TUW).

Author information

Authors and affiliations.

Faculty of Informatics, Technische Universität Wien, 1040, Vienna, Austria

Drishti Yadav

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Drishti Yadav .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Yadav, D. Criteria for Good Qualitative Research: A Comprehensive Review. Asia-Pacific Edu Res 31 , 679–689 (2022). https://doi.org/10.1007/s40299-021-00619-0

Download citation

Accepted : 28 August 2021

Published : 18 September 2021

Issue Date : December 2022

DOI : https://doi.org/10.1007/s40299-021-00619-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Qualitative research
Evaluative criteria
Find a journal
Publish with us
Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 02 March 2021

Research impact evaluation and academic discourse

Marta Natalia Wróblewska ORCID: orcid.org/0000-0001-8575-5215 1 , 2

Humanities and Social Sciences Communications volume 8 , Article number: 58 ( 2021 ) Cite this article

5582 Accesses

13 Citations

40 Altmetric

Metrics details

Language and linguistics
Science, technology and society

The introduction of ‘impact’ as an element of assessment constitutes a major change in the construction of research evaluation systems. While various protocols of impact evaluation exist, the most articulated one was implemented as part of the British Research Excellence Framework (REF). This paper investigates the nature and consequences of the rise of ‘research impact’ as an element of academic evaluation from the perspective of discourse. Drawing from linguistic pragmatics and Foucauldian discourse analysis, the study discusses shifts related to the so-called Impact Agenda on four stages, in chronological order: (1) the ‘problematization’ of the notion of ‘impact’, (2) the establishment of an ‘impact infrastructure’, (3) the consolidation of a new genre of writing–impact case study, and (4) academics’ positioning practices towards the notion of ‘impact’, theorized here as the triggering of new practices of ‘subjectivation’ of the academic self. The description of the basic functioning of the ‘discourse of impact’ is based on the analysis of two corpora: case studies submitted by a selected group of academics (linguists) to REF2014 (no = 78) and interviews ( n = 25) with their authors. Linguistic pragmatics is particularly useful in analyzing linguistic aspects of the data, while Foucault’s theory helps draw together findings from two datasets in a broader analysis based on a governmentality framework. This approach allows for more general conclusions on the practices of governing (academic) subjects within evaluation contexts.

Writing impact case studies: a comparative study of high-scoring and low-scoring case studies from REF2014

The transformative power of values-enacted scholarship

IDADA: towards a multimethod methodological framework for PhD by publication underpinned by critical realism

Introduction.

The introduction of ‘research impact’ as an element of evaluation constitutes a major change in the construction of research evaluation systems. ‘Impact’, understood broadly as the influence of academic research beyond the academic sphere, including areas such as business, education, public health, policy, public debate, culture etc., has been progressively implemented in various systems of science evaluation—a trend observable worldwide (Donovan, 2011 ; Grant et al., 2009 ; European Science Foundation, 2012 ). Salient examples of attempts to systematically evaluate research impact include the Australian Research Quality Framework–RQF (Donovan, 2008 ) and the Dutch Standard Evaluation Protocol (VSNU–Association of Universities in the Netherlands, 2016 , see ‘societal relevance’).

The most articulated system of impact evaluation to date was implemented in the British cyclical ex post assessment of academic units, Research Excellence Framework (REF), as part of a broader governmental policy—the Impact Agenda. REF is the most-studied and probably the most influential impact evaluation system to date. It has been used as a model for analogous evaluations in other countries. These include the Norwegian Humeval exercise for the humanities (Research Council of Norway, 2017 , pp. 36–37, Wróblewska, 2019 ) and ensuing evaluations of other fields (Research Council of Norway, 2018 , pp. 32–34; Wróblewska, 2019 , pp. 12–16). REF has also directly inspired impact evaluation protocols in Hong-Kong (Hong Kong University Grants Committee, 2018 ) and Poland (Wróblewska, 2017 ). This study is based on data collected in the context of the British REF2014 but it advances a description of the ‘discourse of impact’ that can be generalized and applied to other national and international contexts.

Although impact evaluation is a new practice, a body of literature has been produced on the topic. This includes policy documents on the first edition of REF in 2014 (HEFCE, 2015 ; Stern, 2016 ) and related reports, be it commissioned (King’s College London and Digital Science, 2015 ; Manville et al., 2014 , 2015 ) or conducted independently (National co-ordinating center for public engagement, 2014 ). There also exists a scholarly literature which reflects on the theoretical underpinnings of impact evaluations (Gunn and Mintrom, 2016 , 2018 ; Watermeyer, 2012 , 2016 ) and the observable consequences of the exercise for academic practice (Chubb and Watermeyer, 2017 ; Chubb et al., 2016 ; Watermeyer, 2014 ). While these reports and studies mainly draw on the methods of philosophy, sociology and management, many of them also allude to changes related to language .

Several publications on impact drew attention to the process of meaning-making around the notion of ‘impact’ in the early stages of its existence. Manville et al. flagged up the necessity for the policy-maker to facilitate the development of common vocabulary to enable a broader ‘cultural shift’ (2015, pp. 16, 26. 37–38, 69). Power wrote of an emerging ‘performance discourse of impact’ (2015, p. 44) while Derrick ( 2018 ) looked at the collective process of defining and delimiting “the ambiguous object” of impact at the stage of panel proceedings. The present paper picks up these observations bringing them together in a unique discursive perspective.

Drawing from linguistic pragmatics and Foucauldian discourse analysis, the paper presents shifts related to the introduction of ‘impact’ as element of evaluation in four stages. These are, in chronological order: (1) the ‘problematisation’ of the notion of ‘impact’ in policy and its appropriation on a local level, (2) the creation of an impact infrastructure to orchestrate practices around impact, (3) the consolidation of a new genre of writing—impact case study, (4) academics’ uptake of the notion of impact and its progressive inclusion in their professional positioning.

Each of these stages is described using theoretical concepts grounded in empirical data. The first stage has to do with the process of ‘problematization’ of a previously non-regulated area, i.e., the process of casting research impact as a ‘problem’ to be addressed and regulated by a set of policy measures. The second stage took place when in rapid response to government policy, new procedures and practices were created within universities, giving rise to an impact ‘infrastructure’ (or ‘apparatus’ in the Foucauldian sense). The third stage is the emergence of a crucial element of the infrastructure—a new genre of academic writing—impact case study. I argue that engaging with the new genre and learning to write impact case studies was key in incorporating ‘impact’ into scholars’ narratives of ‘academic identity’. Hence, the paper presents new practices of ‘subjectivation’ as the fourth stage of incorporation of ‘impact’ into academic discourse. The four stages of the introduction of ‘impact’ into academic discourse are mutually interlinked—each step paves the way for the next.

Of the described four stages, only stage three focuses a classical linguistic task: the description of a new genre of text. The remaining three take a broader view informed by sociology and philosophy, focusing on discursive practices i.e., language used in social context. Other descriptions of the emergence of impact are possible—note for instance Power’s four-fold structure (Power, 2015 ), at points analogous to this study.

Theoretical framework and data

This study builds on a constructivist approach to social phenomena in assuming that language plays a crucial role in establishing and maintaining social practice. In this approach ‘discourse’ is understood as the production of social meaning—or the negotiation of social, political or cultural order—through the means of text and talk (Fairclough, 1989 , 1992 ; Fairclough et al., 1997 ; Gee, 2015 ).

Linguistic pragmatics and Foucauldian approaches to discourse are used to account for the changes related to the rise of ‘impact’ as element of evaluation and discourse on the macro and micro scale. In looking at the micro scale of every-day linguistic practices the analysis makes use of linguistic pragmatics, in particular concepts of positioning (Davies and Harré, 1990 ), stage (Goffman, 1969 ; Robinson, 2013 ), metaphor (Cameron, et al., 2009 ; Musolff, 2004 , 2012 ), as well as genre analysis (Swales, 1990 , 2011 ). Analyzing the macro scale, i.e., the establishment of the concept of ‘impact’ in policy and the creation of an impact infrastructure, it draws on selected concepts of Fouculadian governmentality theory (crucially ‘problematisation’, ‘apparatus’, ‘subjectivation’) (Foucault, 1980 , 1988 , 1990 ; Rose, 1999 , pp. ix–xiii).

While the toolbox of linguistic pragmatics is particularly useful in analyzing linguistic aspects of the datasets, Foucault’s governmental framework helps bring together findings from the two datasets in a broader analysis, allowing more general conclusions on the practices of governing (academic) subjects within evaluation frameworks. Both pragmatic and Foucauldian traditions of discourse analysis have been productively applied in the study of higher education contexts (e.g., Fairclough, 1993 , Gilbert and Mulkey, 1984 , Hyland, 2009 , Myers, 1985 , 1989 ; for an overview see Wróblewska and Angermuller, 2017 ).

The analysis builds on an admittedly heterogenous set of concepts, hailing from different traditions and disciplines. This approach allows for a suitably nuanced description of a broad phenomenon—the discourse of impact—studied here on the basis of two different datasets. To facilitate following the argument, individual theoretical and methodological concepts are defined where they are applied in the analysis.

The studied corpus consists of two datasets: a written and oral one. The written corpus includes 78 impact case studies (CSs) submitted to REF2014 in the discipline of linguistics Footnote 1 . Linguistics was selected as a discipline straddling the social sciences and humanities (SSH). SSH are arguably most challenged by the practice of impact evaluation as they have traditionally resisted subjection to economization and social accountability (Benneworth et al., 2016 ; Bulaitis, 2017 ).

The CSs were downloaded in pdf form from REF’s website: https://www.ref.ac.uk/2014/ . The documents have an identical structure, featuring basic information: name of institution, unit of assessment, title of CS and core content divided into five sections: (1) summary of impact, (2) underpinning research, (3) references to the research, (4) details of impact (5) sources to corroborate impact. Each CS is about 4 pages long (~2400 words). The written dataset (with a word-count of 173,474) was analyzed qualitatively using MAX QDA software with a focus on the generic aspect of the documents.

The oral dataset is composed of semi-structured interviews with authors of the studied CSs ( n = 20) and other actors involved in the evaluation, including two policy-makers and three academic administrators Footnote 2 . In total, the 25 interviews, each around 60 min long, add up to around 25 h of recordings. The interviews were analyzed in two ways. Firstly, they were coded for themes and topics related to the evaluation process—this was useful for the description of impact infrastructure presented in step 2 of analysis. Secondly, they were considered as a linguistic performance and coded for discursive devices (irony, distancing, metaphor etc.)—this was the basis for findings related to the presentation of one’s ‘academic self’ which are the object of fourth step of analysis. The written corpus allows for an analysis of the functioning of the notion of ‘impact’ in the official, administrative discourse of academia, looking at the emergence of an impact infrastructure and the genre created for the description of impact. The oral dataset in turn sheds light on how academics relate to the notion of impact in informal settings, by focusing on metaphors and pragmatic markers of stage.

The discourse of impact

Problematization of impact.

The introduction of ‘impact’, a new element of evaluation accounting for 20% of the final result, was seen as a surprise and as a significant change in respect to the previous model of evaluation—the Research Assessment Exercise (Warner, 2015 ). The outline of an approach to impact evaluation in REF was developed on the government’s recommendation after a review of international practice in impact assessment (Grant et al., 2009 ). The adopted approach was inspired by the previously-created (but never implemented) Australian RQF framework (Donovan, 2008 ). A pilot evaluation exercise run in 2010 confirmed the viability of the case-study approach to impact evaluation. In July 2011 the Higher Education Council for England (HEFCE) published guidelines regulating the new assessment (HEFCE, 2011 ). The deadline for submissions was set for November 2013.

In the period between July 2011 and November 2013 HEFCE engaged in broad communication and training activities across universities, with the aim of explaining the concept of ‘impact’ and the rules which would govern its evaluation (Power, 2015 , pp. 43–48). Knowledge on the new element of evaluation was articulated and passed down to particular departments, academic administrative staff and individual researchers in a trickle-down process, as explained by a HEFCE policymaker in an account of the run-up to REF2014:

There was no master blue print! There were some ideas, which indeed largely came to pass. But in order to understand where we [HEFCE] might be doing things that were unhelpful and might have adverse outcomes, we had to listen. I was in way over one hundred meetings and talked to thousands of people! (…) [The Impact Agenda] is something that we are doing to universities. Actually, what we wanted to say is: ‘we are doing it with you, you’ve Footnote 3 got to own it’.

Int20, policymaker, example 1 Footnote 4

Due to the importance attributed to the exercise by managers of academic units and the relatively short time for preparing submissions, institutions were responsive to the policy developments. In fact, they actively contributed to the establishment and refinement of concepts related to impact. Institutional learning occurred to a large degree contemporarily to the consolidation of the policy and the refinement of the concepts and definitions related to impact. The initially open, undefined nature of ‘impact’ (“there was no master blue-print”) is described also in accounts of academics who participated in the many rounds of meetings and consultations. See example 2 below:

At that time, they [HEFCE] had not yet come up with this definition [of impact], not yet pinned it down, but they were trying to give an idea of what it was, to get feedback, to get a grip on it. (…) And we realised (…) they didn’t have any more of an idea of this than we did! It was almost like a fishing expedition. (…) I got a sense very early on of, you know, groping.

Int1, academic, example 2

The “pinning down” of an initially fuzzy concept and defining the rules which would come to govern its evaluation was just one aim of the process. The other one was to engage academics and affirm their active role in the policy-making. From an idea which came from outside of the British academic community (from the the government, the research councils) and originally from outside the UK (the Australian RQF exercise), a concept which was imposed on academics (“it is something that we are doing to universities”) the Impact Agenda was to become an accepted, embedded element of the academic life (“you’ve got to own it”). In this sense, the laboriousness of the process, both for the policy-makers and the academics involved, was a necessary price to be paid for the feeling of “ownership” among the academic community. Attitudes of academics, initially quite negative (Chubb et al., 2016 , Watermeyer, 2016 ), changed progressively, as the concept of impact became familiarized and adapted to the pre-existing realities of academic life, as recounted by many of the interviewees, e.g.,:

I think the resentment died down relatively quickly. There was still some resistance. And that was partly academics recognising that they had to [take part in the exercise], they couldn’t ignore it. Partly, the government and the research council has been willing to tweak, amend and qualify the initial very hard-edged guidelines and adapt them for the humanities. So, it was two-way process, a dialogue.

Int16, academic, example 3

The announcement of the final REF regulations (HEFCE, 2011 ) was the climax of the long process of making ‘impact’ into a thinkable and manageable entity. The last iteration of the regulations constituted a co-creation of various actors (initial Australian policymakers of the RQF, HEFCE employees, academics, impact professionals, universities, professional organizations) who had contributed to it at different stages (in many rounds of consultations, workshops, talks and sessions across the country). ‘Impact’ as a notion was ‘talked into being’ in a polyphonic process (Angermuller, 2014a , 2014b ) of debate, critique, consultation (“listening”, “getting feedback”) and adaptation (“tweaking”, “changing”, “amending hard-edged guidelines”) also in view of the pre-existing conditions of academia such as the friction between the ‘soft’ and ‘hard’ sciences (as mentioned in example 3). In effect, impact was constituted as an object of thought, and an area of academic activity begun to emerge around it.

The period of defining ‘impact’ as a new, important notion in academic discourse in the UK, roughly between July 2011 and November 2013, can be conceptualized in terms of the Foucauldian notion of ‘problematization’. This concept describes how spaces, areas of activity, persons, behaviors or practices become targeted by government, separated from others, and cast as ‘problems’ to be addressed with a set of techniques and regulations. ‘Problematisation’ is the moment when a notion “enters into the play of true and false, (…) is constituted as an object of thought (whether in the form of moral reflection, scientific knowledge, political analysis, etc.)” (Foucault, 1988 , p. 257), when it “enters into the field of meaning” (Foucault, 1984 , pp. 84–86). The problematization of an area triggers not only the establishment of new notions and objects but also of new practices and institutions. In consequence, the areas in question become subjugated to a new (political, administrative, financial) domination. This eventually shapes the way in which social subjects conceive of their world and of themselves. But a ‘problematisation’, however influential, cannot persist on its own. It requires an overarching structure in the form of an ‘apparatus’ which will consolidate and perpetuate it.

Impact infrastructure

Soon after the publication of the evaluation guidelines for REF2014, and still during the phase of ‘problematisation’ of impact, universities started collecting data on ‘impactful’ research conducted in their departments and recruiting authors of potential CSs which could be submitted for evaluation. The winding and iterative nature of the process of problematization of ‘impact’ made it difficult for research managers and researchers to keep track of the emerging knowledge around impact (official HEFCE documentation, results of the pilot evaluation, FAQs, workshops and sessions organized around the country, writings published in paper and online). At the stage of collecting drafts of CSs it was still unclear what would ‘count’ as impact and what evidence would be required. Hence, there emerged a need for specific procedures and specialized staff who would prepare the REF submissions.

At most institutions, specific posts were created for employees preparing impact submissions for REF2014. These were both secondment positions such as ‘impact lead’, ‘impact champion’ and full-time ones such as impact officer, impact manager. These professionals soon started organizing between themselves at meetings and workshops. Administrative units focused on impact (such as centers for impact and engagement, offices for impact and innovation) were created at many institutions. A body of knowledge on impact evaluation was soon consolidated, along with a specific vocabulary (‘a REF-able piece of research’, ‘pathways to impact’, ‘REF-readiness’ etc.) and sets of resources. Impact evaluation gave raise to the creation of a new type of specialized university employee, who in turn contributed to turning the ‘generation of impact’, as well as the collection and presentation of related data into a veritable field of professional expertize.

In order to ensure timely delivery of CSs to REF2014, institutions established fixed procedures related to the new practice of impact evaluation (periodic monitoring of impact, reporting on impact-related activities), frames (schedules, document templates), forms of knowledge transfer (workshops on impact generation or on writing in the CS genre), data systems and repositories for logging and storing impact-related data, and finally awards and grants for those with achievements (or potential) related to impact. Consultancy companies started offering commercial services focused on research impact, catering to universities and university departments but also to governments and research councils outside the UK looking at solutions for impact evaluation. There is even an online portal with a specific focus on showcasing researchers’ impact (Impact Story).

In consequence, impact became institutionalized as yet another “box to be ticked” on the list of academic achievements, another component of “academic excellence”. Alongside burdens connected to reporting on impact and following regulations in the area, there came also rewards. The rise of impact as a new (or newly-problematised) area of academic life opened up uncharted areas to be explored and opportunities for those who wished to prove themselves. These included jobs for those who had acquired (or could claim) expertize in the area of impact (Donovan, 2017 , p. 3) and research avenues for those studying higher education and evaluation (after all, entirely new evaluation practices rarely emerge, as stressed by Power, 2015 , p. 43). While much writing on the Impact Agenda highlights negative attitudes towards the exercise (Chubb et al., 2016 ; Sayer, 2015 ), equally worth noting are the opportunities that the establishment of a new element of the exercise opened. It is the energy of all those who engage with the concept (even in a critical way) that contributes to making it visible, real and robust.

The establishment of a specialized vocabulary, of formalized requirements and procedures, the creation of dedicated impact-related positions and departments, etc. contribute to the establishment of what can be described as an ‘impact infrastructure’ (comp. Power, 2015 , p. 50) or in terms of Foucauldian governmentality theory as an ‘apparatus’ Footnote 5 . In Foucault’s terminology, ‘apparatus’ refers to a formation which encompasses the entirety of organizing practices (rituals, mechanisms, technologies) but also assumptions, expectations and values. It is the system of relations established between discursive and non-discursive elements as diverse as “institutions, architectural forms, regulatory decisions, laws, administrative measures, scientific statements, philosophical, moral and philanthropic propositions” (Foucault, 1980 , p. 194). An apparatus servers a specific strategic function—responding to an urgent need which arises in a concrete time in history—for instance, regulating the behavior of a population.

There is a crucial discursive element to all the elements of the ‘impact apparatus’. While the creation of organizational units and jobs, the establishment of procedures and regulations, participation in meetings and workshops are no doubt ‘hard facts’ of academic life, they are nevertheless brought about and made real in discursive acts of naming, defining, delimiting and evaluating. The aim of the apparatus was to support the newly-established problematization of impact. It did so by operating on many levels: first of all, and most visibly, newly-established procedures enabled a timely and organized submission to the upcoming REF. Secondly, the apparatus guided the behavior of social actors. It did so not only through directive methods (enforcing impact-related requirements) but also through nurturing attitudes and dispositions which are necessary for the notion of impact to take root in academia (for instance via impact training delivered to early-career scholars).

Interviewed actors involved in implementing the policy in institutions recognized their role in orchestrating collective learning. An interviewed impact officer stated:

My feeling is that ultimately my post should not exist. In ten or fifteen years’ time, impact officers should have embedded the message [about impact] firmly enough that they [researchers] don’t need us anymore.

Int7, impact officer, example 4

A similar vision was evoked by a HEFCE policymaker who was asked if the notion of impact had become embedded in academic institutions:

I hope [after the next edition of REF] we will be able to say that it has become embedded. I think the question then will be “have we done enough in terms of case studies? Do we need something very much lighter-touch?” “Do we need anything at all?”—that’s a question. (…) If [impact] is embedded you don’t need to talk about it.

Int20, policy-maker, example 5

Rather than being an aim in itself, the Impact Agenda is a means of altering academic culture so that institutions and individual researchers become more mindful of the societal impacts of their research. The instillment of a “new impact culture” (see Manville et al., 2014 , pp. 24–29) would ensure that academic subjects consider the question of ‘impact’ even outside of the framework of REF. The “culture shift” is to occur not just within institutions but ultimately within the subjects—it is in them that the notion of ‘impact’ has to become embedded. Hence, the final purpose of the apparatus would be to obscure the origins of the notion of ‘impact’ and the related practices, neutralizing the notion itself, and giving a guise of necessity to an evaluative reality which in fact is new and contingent.

The genre of impact case study as element of infrastructure

In this section two questions are addressed: (1) what are the features of the genre (or what is it like?) and (2) what are the functions of the genre (or what does it do? what vision of research does it instil?). In addressing the first question, I look at narrative patterns, as well as lexical and grammatical features of the genre. This part of the study draws on classical genre analysis (Bhatia, 1993 ; Swales, 1998 ) Footnote 6 . The second question builds on the recognition, present in discourse studies since the 1970s’, that genres are not merely classes of texts with similar properties, but also veritable ‘dispositives of communication’. A genre is a means of articulation of legitimate speech; it does not just represent facts or reflect ideologies, it also acts on and alters the context in which it operates (Maingueneau, 2010 , pp. 6–7). This awareness has engendered broader sociological approaches to genre which include their pragmatic functioning in institutional realities (Swales, 1998 ).

The genre of CS differs from other academic genres in that it did not emerge organically, but was established with a set of guidelines and a document template at a precise moment in time. The genre is partly reproductive, as it recycles existing patterns of academic texts, such as journal article, grant application, annual review, as well as case study templates applied elsewhere. The studied corpus is strikingly uniform, testifying to an established command of the genre amongst submitting authors. Identical expressions are used to describe impact across the corpus. Only very rarely is non-standard vocabulary used (e.g., “horizontal” and “vertical” impact rather then “reach” and “significance” of impact). This coherence can be contrasted with a much more diversified corpus of impact CSs submitted in Norway to an analogous exercise (Wróblewska, 2019 ). The rapid consolidation of the genre in British academia can be attributed to the perceived importance of impact evaluation exercise, which lead to the establishment of an impact infrastructure, with dedicated employees tasked with instilling the ‘culture of impact’.

In its nature, the CS is a performative, persuasive genre—its purpose is to convince the ‘ideal readers’ (the evaluators) of the quality of the underpinning research and the ‘breadth and significance’ of the described impact. The main characteristics of the genre stem directly from its persuasive aim. These are discussed below in terms of narrative patterns, and grammatical and lexical features.

Narrative patterns

On the level of narrative, there is an observable reliance on a generic pattern of story-telling frequent in fiction genres, such as myths or legends, namely the Situation-Problem–Response–Evaluation (SPRE) structure (also known as the Problem-Solution pattern, see Hoey, 1994 , 2001 pp. 123–124). This is a well-known narrative which follows the SPRE pattern: a mountain ruled by a dragon (situation) which threats the neighboring town (problem) is sieged by a group of heroes (response), to lead to a happy ending or a new adventure (evaluation). Compare this to an example of the SPRE pattern in a sample impact narrative from the studied corpus:

Mosetén is an endangered language spoken by approximately 800 indigenous people (…) (SITUATION). Many Mosetén children only learn the majority language, Spanish (PROBLEM). Research at [University] has resulted in the development of language materials for the Mosetenes. (…) (RESPONSE). It has therefore had a direct influence in avoiding linguistic and cultural loss. (EVALUATION).

CS40828 Footnote 7

The SPRE pattern is complemented by patterns of Further Impact and Further Corroboration. The first one allows elaborating the narrative, e.g., by showing additional (positive) outcomes, so that the impact is not presented as an isolated event, but rather as the beginning of a series of collaborations, e.g.,:

The research was published in [outlet] (…). This led to an invitation from the United Nations Environment Programme for [researcher](FURTHER IMPACT).

Patterns of ‘further impact’ are often built around linking words, such as: “X led to” ( n = 78) Footnote 8 , “as a result” ( n in the corpus =31), “leading to” ( n = 24), “resulting in” ( n = 13), “followed” (“X followed Y”– n = 14). Figure 1 below shows a ‘word tree’ for a frequent linking structure “led to”. The size of the terms in the diagram represents frequencies of terms in the corpus. Reading the word tree from left to right enables following typical sentence structures built around the ‘led to’ phrase: research led to an impact (fundamental change/development/establishment/production of…); impact “led to” further impact.

Word tree with string ‘led to'. This word tree with string ‘led to’ was prepared with MaxQDA software. It visualises a frequent sentence structure where research led to impact (fundamental change/ development/ establishment/ production of…) or otherwise how impact “led to” further impact.

The ‘Further Corroboration’ pattern provides additional information which strengthens the previously provided corroborative material:

(T)he book has been used on the (…) course rated outstanding by Ofsted, at the University [Name](FURTHER CORROBORATION).

Grammatical and lexical features

Both on a grammatical and lexical level, there is a visible focus on numbers and size. In making the point on the breadth and significance of impact, CS authors frequently showcase (high) numbers related to the research audience (numbers of copies sold, audience sizes, downloads but also, increasingly, tweets, likes, Facebook friends and followers). Adjectives used in the CSs appear frequently in the superlative or with modifiers which intensify them: “Professor [name] undertook a major Footnote 9 ESRC funded project”; “[the database] now hosts one of the world’s largest and richest collections (…) of corpora”; “work which meets the highest standards of international lexicographical practice”; “this experience (…) is extremely empowering for local communities”, “Reach: Worldwide and huge ”.

Use of ‘positive words’ constitutes part of the same phenomenon. These appear often in the main narrative on research and impact, and even more frequently in quoted testimonials. Research is described in the CSs as being new, unique and important with the use of words such as “innovative” ( n = 29), “influential” ( n = 16), “outstanding” ( n = 12), “novel” ( n = 10), “excellent” ( n = 8), “ground-breaking” ( n = 7), “tremendous” ( n = 4), “path-breaking” ( n = 2), etc. The same qualities are also rendered descriptively, with the use of words that can be qualified as boosters e.g., “[the research] has enabled a complete rethink of the relationship between [areas]”; “ vitally important [research]”.

Novelty of research is also frequently highlighted with the adjective “first” appearing in the corpus 70 times Footnote 10 . While in itself “first” is not positive or negative, it carries a big charge in the academic world where primacy of discovery is key. Authors often boast about having for the first time produced a type of research—“this was the first handbook of discourse studies written”…, studied a particular area—“This is the first text-oriented discourse analytic study”…, compiled a type of data—“[We] provid[ed] for the first time reliable nationwide data”; “[the] project created the first on-line database of…”, or proven a thesis: “this research was the first to show that”…

Another striking lexical characteristic of the CSs is the presence of fixed expressions in the narrative on research impact. I refer to these as ‘impact speak’. There are several collocations with ‘impact’, the most frequent being “impact on” ( n = 103) followed by the ‘type’ of impact achieved (impact on knowledge), area/topic (impact on curricula) or audience (Impact on Professional Interpreters). This collocation often includes qualifiers of impact such as “significant”, “wide”, “primary”,“secondary”, “broader”, “key”, and boosters: great, positive, wide, notable, substantial, worldwide, major, fundamental, immense etc. Impact featured in the corpus also as a transitive verb ( n = 22) in the forms “impacted” and “impacting”—e.g., “[research] has (…) impacted on public values and discourse”. This is interesting, as use of ‘impact’ as a verb is still often considered colloquial. Verb collocations with ‘impact’ are connected to achieving influence (“lead to..”, “maximize…”, “deliver impact”) and proving the existence and quality of impact (“to claim”, “to corroborate” impact, “to vouch for” impact, “to confirm” impact, to “give evidence” for impact). Another salient collocation is “pathways to impact” ( n = 14), an expression describing channels of interacting with the public, in the corpus occasionally shortened to just “pathways” e.g., “The pathways have been primarily via consultancy”. This phrase has most likely made its way to the genre of CS from the Research Councils UK ‘Pathways to Impact’ format introduced as part of grant applications in 2009 (discontinued in early 2020).

On a syntactic level, CSs are rich in parallel constructions of enumeration, for instance: “ (t)ranslators, lawyers, schools, colleges and the wider public of Welsh speakers are among (…) users [of research]”; “the research has benefited a broad, international user base including endangered language speakers and community members, language activists, poets and others ”; [the users of the research come] “from various countries including India, Turkey, China, South Korea, Venezuela, Uzbekistan, and Japan ”. Listing, alongside providing figures, is one of the standard ways of signaling the breadth and significance of impact. Both lists and superlatives support the persuasive function of the genre. In terms of verbal forms, passive verbs are clearly favored and personal pronouns (“I, we”) are avoided: “research was conducted”, “advice was provided”, “contracts were undertaken”.

Vision of research promoted by the genre of CS

Impact CS is a new, influential genre which affects its academic context by celebrating and inviting a particular vision of successful research and impact. It sets a standard for capturing and describing a newly-problematized academic object. This standard will be a point of reference for future authors of CSs. Hence, it is worth taking a look at the vision on research it instills.

The SPRE pattern used in the studied CSs favors a vision of research that is linear: work proceeds from research question to results without interference. The Situation and Problem elements are underplayed in favor of elaborate descriptions of the researchers’ ‘Reactions’ (research and outreach/impact activities) and flattering ‘Evaluations’ (descriptions of effects of the research and data supporting these claims). Most narratives are devoid of challenges (the ‘Problem’ element is underplayed, possible drawbacks and failures in the research process are mentioned sporadically). Furthermore, narratives are clearly goal-oriented: impact is shown as included in the research design from the beginning (e.g., impact is frequently mentioned already in section 2 ‘Underpinning research’, rather than the latter one ‘Details of the impact’). Elements of chance, luck, serendipity in the research process are erased—this is reinforced by the presence of patterns of ‘further proof’ and ‘further corroboration’. As such, the bulk of studied CSs channel a vision of what is referred to in Science Studies as ‘normal’ (deterministic, linear) science (Kuhn, 1970 , pp. 10–42). From a purely literary perspective this makes for rather dull narratives: “fairy-tales of researcher-heroes… but with no dragons to be slain” (Selby, 2016 ).

The few CSs which do discuss obstacles in the research process or in securing impact stand out as strikingly diverse from the rest of the corpus. Paradoxically, while apparently ‘weakening’ the argumentation, they render it more engaging and convincing. This effect has been observed also in in an analogous corpus of Norwegian CSs which tend to problematize the pathway from research to impact to a much higher degree (Wróblewska, 2019 , pp. 34–35).

The lexical and grammatical features of the CSs—the proliferation of ‘positive words’, including superlatives, and the adjective “first”— contribute to an idealization of the research process. The documents channel a vision of academia where there is no place for simply ‘good’ research—all CSs seem based on ‘excellent’ and ‘ground-breaking’ projects. The quality of research underpinning impact is recognized in CSs in a straightforward, simplistic way (quotation numbers, peer reviewed papers, publications in top journals, submission to REF), which contributes to normalizing the view of research quality as easily measurable. Similarly, testimonials related to impact are not all equal. Sources of corroboration cited in CSs were carefully selected to appear prestigious and trustworthy. Testimonials and statements from high-ranking officials (but also ‘celebrities’ such as famous intellectuals or political leaders) were particularly sought-after. The end effect reinforces a solidified vision of a hierarchy of worth and trustworthiness in academia.

The prevalence of impersonal verbal forms suggests an de-personalized vision of the research process (“work was conducted”, “papers were published”, “evidence was given…”), where individual factors such as personal aspirations, constraints or ambitions are effaced. The importance given to numbers contributes to a strengthening of a ‘quantifiable’ idea of impact. This is in line with a trend observed in academic writing in general – the inflation of ‘positive words’ (boosters and superlatives) (Vinkers et al., 2015 ). This tendency is amplified in the genre of CS, particularly in its British iteration. In a Norwegian corpus claims to excellence of research and breadth and significance of impact were significantly more modest (Wróblewska, 2019 , pp. 28–30).

The genre of impact CS is a core binding component of the impact infrastructure: all the remaining elements of this formation are mutually connected by a common aim – the generation of CSs. While the CS genre, together with the encompassing impact infrastructure, is vested with a seductive/coercive force, the subjects whose work it represents and who produce it take different positions in its face.

Academics’ positioning towards the Impact Agenda

Academics position themselves towards the concept of impact in many explicit and implicit ways. ‘Positioning’ is understood here as performance-based claims to identity and subjectivity (Davies and Harré, 1990 , Harré and Van Langenhove, 1998 ). Rejecting the idea of stable “inherent” identities, positioning theorists stress how different roles are invoked and enacted in a continuous game of positioning (oneself) and being positioned (by others). Positioning in academic contexts may take the form of indexing identities such as “professor”, “linguist”, “research manager”, “SSH scholar”, “intellectual”, “maverick” etc. (Angermuller, 2013 ; Baert, 2012 , Hamann, 2016 , Hah, 2019 , 2020 ). Also many daily interactions which do not include explicit identity claims involve subject positioning, as they carry value judgments, thereby also evoking counter-statements and colliding social contexts (Tirado and Galvaz, 2008 , pp. 32–45).

My analysis draws attention to the process of incorporating impact into academic subjectivities. I look firstly at the mechanics of academics’ positioning towards impact: the game of opposite discursive acts of distancing and endorsement. Academics reject the notion of ‘impact’ by ironizing, stage management and use of metaphors. Conversely, they may actively incorporate impact into their presentation of academic ‘self’. This discursive engagement with the notion of impact can be described as ‘subjectivation’, i.e., the process whereby subjects re(establish) themselves in relation to the grid of power/knowledge in which they function (in this case the emergent ‘impact infrastructure’).

The relatively high response rate of this study (~50%) and the visible eagerness of respondents to discuss the question of impact suggest an emotional response of academics to the topic of impact evaluation. Yet, respondents visibly struggled with the notion of ‘impact’, often distancing themselves from it through discursive devices, the most salient being ironizing, use of metaphors and stage management.

Ironizing the notion of impact

In many cases, before proceeding to explain their attitude to impact, interviewed academics elaborated on the notion of impact, explaining how the notion applied to their discipline or field and what it meant for them personally. This often meant rejecting the official definition of impact or redefining the concept. In excerpt 6, the interviewee picks up the notion:

Impact… I don’t even like the word! (…) It sounds [like] a very aggressive word, you know, impact, impact ! I don’t want to imp act ! What you want, and what has happened with [my research] really is… more of a dialogue.

Int21, academic, example 6

Another respondent brought up the notion of impact when discussing ethical challenges arising from public dissemination of research.

When you manage to go through that and navigate successfully, and keep producing research, to be honest, that’s impact for me.

Int9, academic, example 7

An analogous distinction was made by a third respondent who discussed the effect of his work on an area of professional activity. While, as he explained, this application of his research has been a source of personal satisfaction, he refused to describe his work in terms of ‘impact’. He stressed that the type of influence he aims for does not lend itself to producing a CS (is not ‘REF-able’):

That’s not impact in the way this government wants it! Cause I have no evidence. I just changed someone’s view. Is that impact? Yes, for me it is. But it is not impact as understood by the bloody REF.

Int3, academic, example 8

These are but three examples of many in the studied corpus where speakers take up the notion of impact to redefine or nuance it, often juxtaposing it with adjacent notions of public engagement, dissemination, outreach, social responsibility, activism etc. A previous section highlighted how the definition of impact was collectively constructed by a community in a process of problematization. The above-cited examples illustrate the reverse of this phenomenon—namely, how individual social actors actively relate to an existing notion in a process of denying, re-defining, and delimiting.

These opposite tendencies of narrowing down and again widening a definition are in line with the theory of the double role of descriptions in discourse. Definitions are both constructions and constructive —while they are effects of discourse, they can also become ‘building blocks’ for ideas, identities and attitudes (Potter, 1996 , p. 99). By participating in impact-related workshops academics ‘reify’ the existing, official definition by enacting it within the impact infrastructure. Fragments cited above exemplify the opposite strategy of undermining the adequacy of the description or ‘ironizing’ the notion (Ibid, p.107). The tension between reifying and ironizing points to the winding, conflictual nature of the process of accepting and endorsing the new ‘culture of impact’. A recognition of the multiple meanings given to the notion of ‘impact’ by policy-makers, academic managers and scholars may caution us in relation to studies on attitudes towards impact which take the notion at face value.

Respondents nuanced the notion of impact also through the use of metaphors. In discourse analysis metaphors are seen in not just as stylistic devices but as vehicles for attitudes and values (Mussolf, 2004 , 2012 ). Many of the respondents make remarks on the ‘realness’ or ‘seriousness’ of the exercise, emphasizing its conventional, artificial nature. Interviewees admitted that claims made in the CSs tend to be exaggerated. At the same time, they stressed that this was in line with the convention of the genre, the nature of which was clear for authors and panelists alike. The practice of impact evaluation was frequently represented metaphorically as a game. See excerpt 9 below:

To be perfectly honest, I view the REF and all of this sort of regulatory mechanisms as something of a game that everybody has to play. The motivation [to submit to REF] was really: if they are going to make us jump through that hoop, we are clever enough to jump through any hoops that any politician can set.

Int14, academic, example 9

Regarding the relation of the narratives in the CSs to truth see example 10:

[A CS] is creative stuff. Given that this is anonymous, I can say that it’s just creative fiction. I wouldn’t say we [authors of CSs] lie, because we don’t, but we kind of… spin. We try to show a reality which, by some stretch of imagination is there. (It’s) a truth. I’m not lying. Can it be shown in different ways? Yes, it can, and then it would be possibly less. But I choose, for obvious reasons, to say that my external funding is X million, which is a truth.

Int3, academic, example 10

The metaphors of “playing a game”, “jumping through hoops” suggest a competition which one does not enter voluntarily (“everybody has to play it”) while those of “creative fiction”, “spinning”, presenting “ a truth” point to an element of power struggle over defining the rules of the game. Doing well in the exercise can mean outsmarting those who establish the framework (politicians) by “performing” particularly well. This can be achieved by eagerly fulfilling the requirements of the genre of CS, and at the same time maintaining a disengaged position from the “regulatory mechanism” of the impact infrastructure.

Stage management

Academics’ positioning towards impact plays out also through management of ‘stage’ of discursive performance, often taking the form of frontstage and backstage markers (in the sense of Goffman’s dramaturgy–1969, pp. 92–122). For instance, references to the confidential nature of the interview (see example 10 above) or the expression “to be perfectly honest” (example 9), are backstage markers. Most of the study’s participants have authored narratives about their work in the strict, formalized genre of CS, thereby performing on the Goffmanian ‘front stage’ for an audience composed of senior management, REF panelists and, ultimately, perhaps “politicians”, “the government”. However, when speaking on the ‘back stage’ context of an anonymous interview, many researchers actively reject the accuracy of the submitted CSs as representations of their work. Many express a nuanced, often critical, view on impact.

Respondents frequently differentiate between the way they perceive ‘impact’ on different ‘levels’, or from the viewpoint of their different ‘roles’ (scholar, research manager, citizen…). One academic can hold different (even contradictory) views on the assessment of impact. Someone who strongly criticizes the Impact Agenda as an administrative practice might be supportive of ‘impact’ on a personal level or vice versa. See the answer of a linguist asked whether ‘impact’ enters into play when he assesses the work of other academics:

When I look at other people’s work work as a linguist, I don’t worry about that stuff. (…) As an administrator, I think that linguistics, like many sciences, has neglected the public. (…) At some point, when we would be talking about promotion (…) I would want to take a look at the impact of their work. (…) And that would come into my thinking in different times.

Int13, academic, example 11

Interestingly, in the studied corpus there isn’t a simple correlation between conducting research which easily ‘lends itself to impact’ and a positive overall attitude to impact evaluation.

Subjectivation

The most interesting data excerpts in this study are perhaps the ones where respondents wittingly or unwittingly expose their hesitations, uncertainties and struggles in positioning themselves towards the concept of impact. In theoretical terms, these can be interpreted as symptoms of an ongoing process of ‘subjectivation’.

‘Subjectivation’ is another concept rooted in Foucauldian governmentality theory. According to Foucault, individuals come to the ‘truth’ about their subjectivity by actively relating to a pre-existent set of codes, patterns, rules and rituals suggested by their culture or social group (Castellani, 1999 , pp. 257–258; Foucault, 1988 , p. 11). The term ‘subjectivation’ refers to the process in which an individual establishes oneself in relation to the grid of power/knowledge in which they function. This includes actions subjects take on their performance, competences, attitudes, self-esteem, desires etc. in order to improve, regulate or reform themselves (Dean, 1999 , p. 20; Lemke, 2002 ; Rose, 1999 , p. xii).

Academics often distance themselves from the assessment exercise, as shown in previous sections. And yet, the data hints that having taken part in the evaluation and engaged with the impact infrastructure was not without influence on the way they present their research, also in nonofficial, non-evaluative contexts, such as the research interview. This effect is visible in vocabulary choices—interviewees routinely spoke about ‘pathways to impact’, ‘impact generation’, ‘REF-ability’ etc. ‘Impact speak’ has made its way into every-day, casual academic conversations. Beyond changes to vocabulary, there is a more deep-running process—the discursive work of reframing one’s research in view of the evaluation exercise and in its terms. Many respondents seemed to adjust the presentation of their research, its focus and aims, when the topic of REF surfaced in the exchange. Interestingly, such shifts occurred even in the case of respondents who did not submit to the exercise, for instance because they were already retired, or because they refused to take part in it. For those who have submitted CSs to REF, the effect of having re-framed the narrative of their research in this new genre often had a tremendous effect.

Below presented is the example of a scholar who did not initially volunteer to submit a CS, and was reluctant to take part when she was encouraged by a supervisor. During the interview the respondent distanced herself from the exercise and the concept of impact through the discursive devices of ironizing, metaphors, stage management, and humor. The respondent was consistently critical towards impact in course of the interview. Therefore the researcher expected a firm negative answer to the final question: “did the exercise affect your perception of your work?”. See excerpt 13 below for her the respondent’s somewhat surprising answer.

Do you know what? It did, it did, it did. Almost a kind of a massive influence it had. Maybe this is the answer that you didn’t see coming ((laughing)). (…) It did [have an influence] but maybe from a different route as for people who were signed up for [the REF submission] from the outset. (…) When I saw this [CS narrative] being shaped up and people [who gave testimonies] I kind of thought: goodness me! And there were other moving things.

Int21, academic, example 13

Through the preparation of the CS and particularly through familiarizing herself with the underpinning testimonials, the respondent gained greater awareness of an area of practice which was influenced by her research. The interviewee’s attitude changed not only in the course of the evaluation exercise, but also—as if mirroring this process—during the interview. In both cases, elements which were up to that moment implicit (the response of end-users of the work, the researcher’s own emotional response to the exercise and to the written-up narrative of her impact) were made explicit. It is the process of recounting one’s story in a different framework, according to other norms and values (and in a different genre) that triggers the process of subjectivation. This example of a change of attitude in an initially reluctant subject demonstrates the difficulty in opposing the overwhelming force of the impact infrastructure, particularly in view of the (sometimes unexpected) rewards that it offers.

Many respondents found taking part in the REF submission—including the discursive work on the narrative of their research—an exhausting experience. In some cases however, the process of reshaping one’s academic identity triggered by the Agenda was a welcome development. Several interviewees claimed that the exercise valorized their extra-academic involvement which previously went unnoticed at their department. These scholars embraced the genre of CS as an opportunity to present their impact-related activities as an inherent part of their academic work. One academic stated:

At last, I can take my academic identity and my activist identity and roll them up into one.

Int11, academic, example 14

Existing studies have focused on situating academics’ attitudes towards the Impact Agenda on a positive-negative scale (e.g., Chubb et al., 2016 ), and studied divergences depending on career stage or disciplinary affiliation etc. (Chikoore, 2016 ; Chikoore and Probets, 2016 ; Weinstein et al., 2019 ). My data shows that there are many dimensions to each academic’s view of impact. Scholars have complex (sometimes even contradictory) views on ‘impact’ and the discursive work in incorporating impact into a coherent academic ‘self’ is ongoing. While an often overwhelming ‘impact infrastructure’ looms over professional discursive positioning practices, academic subjects are by no means passive recipients of governmental new-managerial policies. On the contrary, they are agents actively involved in accepting, rejecting and negotiating them on a local level—both in front-stage and back-stage contexts.

Looking at the front stage, most CSs seem compliant in their eagerness to demonstrate impact in all its breadth and significance. The documents showcase large numbers and data once considered trivial in the academic context (Facebook likes, Twitter followers, endorsement of celebrities…) and faithfully follow the policy documents in adopting ‘impact speak’. Interviews with academics paint a different picture: the respondents may be playing according to the rules of the evaluation “game”, but they are playing consciously , often in an emotionally detached, distanced manner. Other scholars adjust to the regulations, but not in the name of compliance, but in view of an alignment between the goals of the Agenda and their personal ones. Finally, some academics perceive the evaluation of impact as an opportunity to re-position themselves professionally or re-claim areas of activity which were long considered non-essential for an academic career, like public engagement, outreach and activism.

Concluding remarks

The initial, dynamic phases of the introduction of impact to British academia represent, in terms of Foucauldian theory, the phase of ‘emergence’. This notion draws attention to the moment when discursive concepts (‘impact’, ‘impact case study’…) surface and consolidate. It is in these terms that the previously non-regulated area of academic activity will be thereon described, assessed, evaluated. New notions, definitions, procedures related to impact and the genre of CS will continue to circulate, emerging in other evaluation exercises, at other institutions, in other countries.

The stage of emergence is characterized by a struggle of forces, an often violent conflict between opposing ideas—“it is their eruption, the leap from the wings to centre stage” (Foucault, 1984 , p. 84). The shape that an emergent idea will eventually take is the effect of clashes of these forces and it does not fully depend on any of them. Importantly, emergence is merely “the entry of forces” (p. 84), and “not the final term of historical development” (p. 83). For Foucault, a concept, in its inception, is essentially an empty word, which addresses the needs of a field that is being problematized and satisfies the powers which target it. A problematization (of an object, practice, area of activity) is a response to particular desires or problems—these constitute an instigation, but do not determine the shape of the problematization. As Foucault urges “to one single set of difficulties, several responses can be made” (2003, p. 24).

With the emergence of the Impact Agenda, an area of activity which has always existed (the collaboration of academics with the non-academic world) was targeted, delimited and described with new notions in a process of problematization. The notion of ‘impact’ together with the genre created for capturing it became the core of an administrative machinery—the impact infrastructure. This was a new reality that academics had to quickly come to terms with, positioning themselves towards it in a process of subjectification.

The run-up to REF2014 was a crucial and defining phase, but it was only the first stage of a longer process—the emergence of the concept of ‘impact’, the establishment of basic rules which would govern its generation, documentation, evaluation. Let’s recall Foucault’s argument that “rules are empty in themselves, violent and unfinalized; they are impersonal and can be bent to any purpose. The successes of history belong to those who are capable of seizing these rules”… (pp. 85–86). The rules embodied in the REF guidelines, the new genre of CS, the principals of ‘impact speak’ were in the first instance still “empty and unfinalized”. It was up to those subject to the rules to fill them with meaning.

The data analyzed in this study shows that despite dealing with a new powerful problematization and functioning in the framework of a complex infrastructure, academics continue to be active and highly reflective subjects, who discursively negotiate key concepts of the impact infrastructure and their own position within it. It will be fascinating to study the emergence of analogous evaluation systems in other countries and institutions. ‘Impact infrastructure’ and ‘genre’ are two excellent starting points for an analysis of ensuing changes to academic realities and subjectivities.

Data availability

The interview data analyzed in this paper is not publicly available, due to the confidential nature of the interview data. It can be made available by the corresponding author in anonymised form on reasonable request. The cited case studies were sourced from the REF database ( https://www.ref.ac.uk/2014/ ) and may be consulted online. The coded dataset is considered part of the analysis (and hence protected by copyright), but may be made available on reasonable request.

Most of the studied documents—71 CSs—have been submitted to the Unit of Assessment (UoA) 28—Linguistics and Modern Languages, the remaining seven have been submitted to five different UoAs but fall under the field of linguistics.

Some interviewees were involved in REF in more than just one role. ‘Authors’ of CSs authored the documents to a different degree, some (no = 5) were also engaged in the evaluation process in managerial roles.

Words underlined in interview excerpts were stressed by the speaker.

When citing interview data I give numbers attributed to individual interviews in the corpus, type of interviewee, and number of cited example.

‘Apparatus’ is one of the existing translations of the French ‘dispositif’, another one is ‘historical construct’ (Sembou, 2015 , p. 38) or ‘grid of intelligibility’ (Dreyfus and Rabinow, 1983 , p. 121). The French original is also sometimes used in English texts. In this paper, I use ‘apparatus’ and ‘infrastructure’, as the notion of ‘infrastructure’ has already become current in referring to resources dedicated to impact generation at universities, both in scholarly literature (Power, 2015 ) and in managerial ‘impact speak’.

A full version of the analysis may be found in Wróblewska, 2018 .

CS numbers are those found in the REF impact case study base: https://impact.ref.ac.uk/casestudies/ . I only provide CS numbers for cited fragments of one sentence or longer; exact sources for cited phrases may be given on request or easily identified in the CS database.

The figures given for appearances of certain elements of the genre in the studied corpus are drawn from the computer-assisted qualitative analysis conducted with MaxQDA software. They serve as an illustration of the relative frequency of particular elements for the reader, but since they are not the result of a rigorous corpus analytical study of a larger body of CSs, the researcher does can not claim statistical relevance.

Words underlined in CS excerpts are emphasized by the author of the analysis.

Number of occurrences of string ‘the first’ in the context of quality of research, excluding phrases like “the first workshop took place…” etc.

Angermuller J (2013) How to become an academic philosopher: academic discourse as a multileveled positioning practice. Sociol Hist 2:263–289

Google Scholar

Angermuller J (2014a) Poststructuralist discourse analysis. subjectivity in enunciative pragmatics. Palgrave Macmillan, Houndmills/Basingstoke

Angermuller J (2014b) Subject positions in polyphonic discourse. In:Angermuller J, Maingueneau D, Wodak R (eds) The Discourse Studies Reader. Main currents in theory and analysis. John Benjamins Publishing Company, Amsterdam/Philadelphia, p 176–186

Baert P (2012) Positioning theory and intellectual interventions. J Theory Soc Behav 42(3):304–324

Article Google Scholar

Benneworth P, Gulbrandsen M, Hazelkorn E (2016) The impact and future of arts and humanities research. Palgrave Macmillan, London

Book Google Scholar

Bhatia VK (1993) Analysing genre: language use in professional settings. Longman, London

Bulaitis Z (2017) Measuring impact in the humanities: Learning from accountability and economics in a contemporary history of cultural value. Pal Commun 3(7). https://doi.org/10.1057/s41599-017-0002-7

Cameron L, Maslen R, Todd Z, Maule J, Stratton P, Stanley N (2009) The discourse dynamics approach to metaphor and metaphor-led discourse analysis. Metaphor Symbol 24(2):63–89. https://doi.org/10.1080/10926480902830821

Castellani B (1999) Michel Foucault and symbolic interactionism: the making of a new theory of interaction. Stud Symbolic Interact 22:247–272

Chikoore L (2016) Perceptions, motivations and behaviours towards ‘research impact’: a cross-disciplinary perspective. Loughborough University. Loughborough University Institutional Repository. https://dspace.lboro.ac.uk/2134/22942 . Accessed 30 Dec 2020

Chikoore L, Probets S (2016) How are UK academics engaging the public with their research? a cross-disciplinary perspective. High Educ Q 70(2):145–169. https://doi.org/10.1111/hequ.12088

Chubb J, Watermeyer R, Wakeling P (2016) Fear and loathing in the Academy? The role of emotion in response to an impact agenda in the UK and Australia. High Educ Res Dev 36(3):555–568. https://doi.org/10.1080/07294360.2017.1288709

Chubb J, Watermeyer R (2017) Artifice or integrity in the marketization of research impact? Investigating the moral economy of (pathways to) impact statements within research funding proposals in the UK and Australia. Stud High Educ 42(12):2360–2372

Davies B, Harré R (1990) Positioning: the discursive production of selves. J Theory Soc Behav 20(1):43–63

Dean MM (1999) Governmentality: power and rule in modern society. SAGE Publications, Thousand Oaks, California

Derrick G (2018) The evaluators’ eye: Impact assessment and academic peer review. Palgrave Macmillan, London

Donovan C (2008) The Australian Research Quality Framework: A live experiment in capturing the social, economic, environmental, and cultural returns of publicly funded research. New Dir for Eval 118:47–60. https://doi.org/10.1002/ev.260

Donovan C (2011) State of the art in assessing research impact: introduction to a special issue. Res. Eval. 20(3):175–179. https://doi.org/10.3152/095820211X13118583635918

Donovan C (2017) For ethical ‘impactology’. J Responsible Innov 6(1):78–83. https://doi.org/10.1080/23299460.2017.1300756

Dreyfus HL, Rabinow P (1983) Michel Foucault: beyond structuralism and hermeneutics. University of Chicago Press, Chicago

European Science Foundation (2012) The Challenges of Impact Assessment. Working Group 2: Impact Assessment. ESF Archives. http://archives.esf.org/index.php?eID=tx_nawsecuredl&u=0&g=0&t=1609373495&hash=08da8bb115e95209bcea2af78de6e84c0052f3c8&file=/fileadmin/be_user/CEO_Unit/MO_FORA/MOFORUM_Eval_PFR__II_/Publications/WG2_new.pdf . Accessed 30 Dec 2020

Fairclough N (1989) Language and power. Longman, London/New York

Fairclough N (1992) Discourse and social change. Polity Press, Cambridge, UK/Cambridge

Fairclough N (1993) Critical discourse analysis and the marketization of public discourse: The Universities. Discourse Soc 4(2):133–168

Fairclough N, Mulderrig J, Wodak R (1997) Critical discourse analysis. In: Van Dijk TA (ed) Discourse studies: a multidisciplinary introduction. SAGE Publications Ltd, New York, pp. 258–284

Foucault M (1980) The confession of the flesh. In: Gordon C (ed) Power/knowledge: selected interviews and other writings 1972–1977. Vintage Books, New York

Foucault M (1984) Nietzsche, genealogy, history. In: Rabinow P (ed) The Foucault Reader. Pantheon Books, New York

Foucault M (1988) Politics, philosophy, culture: Interviews and other writings, 1977–1984. Routledge, New York

Foucault M (1990) The use of pleasure. The history of sexuality, vol. 2. Vintage Books, New York

Gee J (2015) Social linguistics and literacies ideology in discourses. Taylor and Francis, Florence

Gilbert GN, Mulkay M (1984) Opening Pandora’s Box: a sociological analysis of scientists’ discourse. Cambridge University Press, Cambridge

Goffman E (1969) The presentation of self in everyday life. Allen Lane The Pinguin Press, London

Grant J, Brutscher, PB, Kirk S, Butler L, Wooding S (2009) Capturing Research Impacts. A review of international practice. Rand Corporation. RAND Europe. https://www.rand.org/content/dam/rand/pubs/documented_briefings/2010/RAND_DB578.pdf . Accessed 30 Dec 2020

Gunn A, Mintrom M (2016) Higher education policy change in Europe: academic research funding and the impact agenda. Eur Educ 48(4):241–257. https://doi.org/10.1080/10564934.2016.1237703

Gunn A, Mintrom M (2018) Measuring research impact in Australia. Aust Universit Rev 60(1):9–15

Hah S (2019) Disciplinary positioning struggles: perspectives from early career academics. J Appl Linguist Prof Pract 12(2). https://doi.org/10.1558/jalpp.32820

Hah S (2020) Valuation discourses and disciplinary positioning struggles of academic researchers–a case study of ‘maverick’ academics. Pal Commun 6(1):1–11. https://doi.org/10.1057/s41599-020-0427-2

Hamann J (2016) “Let us salute one of our kind.” How academic obituaries consecrate research biographies. Poetics 56:1–14. https://doi.org/10.1016/j.poetic.2016.02.005

Harré R, Van Langenhove L (1998) Positioning theory: moral contexts of international action. Wiley-Blackwell, Chichester

HEFCE (2015) Research Excellence Framework 2014: Manager’s report. HEFCE. https://www.ref.ac.uk/2014/media/ref/content/pub/REF_managers_report.pdf . Accessed 30 Dec 2020

HEFCE (2011) Assessment framework and guidance on submissions. HEFCE: https://www.ref.ac.uk/2014/media/ref/content/pub/assessmentframeworkandguidanceonsubmissions/GOS%20including%20addendum.pdf . Accessed 30 Dec 2020

Hoey M (1994) Signalling in discourse: A functional analysis of a common discourse pattern in written and spoken English. In: Coulthard M (ed) Advances in written text analysis. Routledge, London

Hoey M (2001) Textual interaction: an introduction to written discourse analysis. Routledge, London

Hong Kong University Grants Committee (2018) Research Assessment Exercise 2020. Draft General Panel Guidelines. UGC. https://www.ugc.edu.hk/doc/eng/ugc/rae/2020/draft_gpg_feb18.pdf Accessed 30 Dec 2020

Hyland K (2009) Academic discourse English in a global context. Continuum, London

King’s College London and Digital Science (2015) The nature, scale and beneficiaries of research impact: an initial analysis of Research Excellence Framework (REF) 2014 impact case studies. Dera: http://dera.ioe.ac.uk/22540/1/Analysis_of_REF_impact.pdf . Accessed 30 Dec 2020

Kuhn TS (1970) The structure of scientific revolutions. University of Chicago Press, Chicago

Lemke T (2002) Foucault, governmentality, and critique. Rethink Marx 14(3):49–64

Maingueneau D (2010) Le discours politique et son « environnement ». Mots. Les langages du politique 94. https://doi.org/10.4000/mots.19868

Manville C, Jones MM, Frearson M, Castle-Clarke S, Henham ML, Gunashekar S, Grant J (2014) Preparing impact submissions for REF 2014: An evaluation. Findings and observations. RAND Corporation: https://www.rand.org/pubs/research_reports/RR726.html . Accessed 30 Dec 2020

Manville C, Guthrie S, Henham ML, Garrod B, Sousa S, Kirtley A, Castle-Clarke S, Ling T (2015) Assessing impact submissions for REF 2014: an evaluation. Rand Corporation. https://www.rand.org/content/dam/rand/pubs/research_reports/RR1000/RR1032/RAND_RR1032.pdf . Accessed 30 Dec 2020

Myers G (1985) Texts as knowledge claims: the social construction of two biology articles. Soc Stud Sci 15(4):593–630

Myers G (1989) The pragmatics of politeness in scientific articles. Appl Linguist 10(1):1–35

Article ADS Google Scholar

Musolff A (2004) Metaphor and political discourse. Analogical reasoning in debates about Europe. Palgrave Macmillan, Basingstoke

Musolff A (2012) The study of metaphor as part of critical discourse analysis. Crit. Discourse Stud. 9(3):301–310. https://doi.org/10.1080/17405904.2012.688300

National Co-ordinating Centre For Public Engagement (2014) After the REF-Taking Stock: summary of feedback. NCCFPE. https://www.publicengagement.ac.uk/sites/default/files/publication/nccpe_after_the_ref_write_up_final.pdf . Accessed 30 Dec 2020

Potter J (1996) Representing reality: discourse, rhetoric and social construction. Sage, London

Power M (2015) How accounting begins: object formation and the accretion of infrastructure. Account Org Soc 47:43–55. https://doi.org/10.1016/j.aos.2015.10.005

Research Council of Norway (2017) Evaluation of the Humanities in Norway. Report from the Principal Evaluation Committee. The Research Council of Norway. Evaluation Division for Science. RCN. https://www.forskningsradet.no/siteassets/publikasjoner/1254027749230.pdf . Accessed 30 Dec 2020

Research Council of Norway (2018) Evaluation of the Social Sciences in Norway. Report from the Principal Evaluation Committee. The Research Council of Norway.Division for Science and the Research System RCN. https://www.forskningsradet.no/siteassets/publikasjoner/1254035773885.pdf Accessed 30 Dec 2020

Robinson D (2013) Introducing performative pragmatics. Routledge, London/New York

Rose N (1999) Governing the soul: the shaping of the private self. Free Association Books, Sidmouth

Sayer D (2015) Rank hypocrisies: the insult of the REF. Sage, Thousand Oaks

Selby J (2016) Critical IR and the Impact Agenda, Paper presented at Pais Impact Conference. Warwick University, Coventry, pp. 22–23 November 2016

Sembou E (2015) Hegel’s Phenomenology and Foucault’s Genealogy. Routledge, New York

Stern N (2016) Building on Success and Learning from Experience. an Independent Review of the Research Excellence Framework. Department for Business, Energy and Industrial Strategy. Assets Publishing Service. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/541338/ind-16-9-ref-stern-review.pdf . Accessed 30 Dec 2020

Swales JM (1998) Other floors, other voices: a textography of a small university building. Routledge, London/New York

Swales JM (1990) Genre analysis: English in academic and research settings. Cambridge University Press, Cambridge

Swales JM (2011) Aspects of Article Introductions. University of Michigan Press, Ann Arbor

Tirado F, Gálvez A (2008) Positioning theory and discourse analysis: some tools for social interaction analysis. Historical Social Res 8(2):224–251

Vinkers CH, Tijdink JK, Otte WM (2015) Use of positive and negative words in scientific PubMed abstracts between 1974 and 2014: retrospective analysis. BMJ 351:h6467. https://doi.org/10.1136/bmj.h6467

Article PubMed PubMed Central Google Scholar

VSNU–Association of Universities in the Netherlands (2016) Standard Evaluation Protocol (SEP). Protocol for Research Assessments in the Netherlands. VSNU. https://vsnu.nl/files/documenten/Domeinen/Onderzoek/SEP2015-2021.pdf . Accessed 30 Dec 2020

Watermeyer R (2012) From engagement to impact? Articulating the public value of academic research. Tertiary Educ Manag 18(2):115–130. https://doi.org/10.1080/13583883.2011.641578

Watermeyer R (2014) Issues in the articulation of ‘impact’: the responses of UK academics to ‘impact’ as a new measure of research assessment. Stud High Educ 39(2):359–377. https://doi.org/10.1080/03075079.2012.709490

Watermeyer R (2016) Impact in the REF: issues and obstacles. Stud High Educ 41(2):199–214. https://doi.org/10.1080/03075079.2014.915303

Warner M (2015) Learning my lesson. London Rev Books 37(6):8–14

Weinstein N, Wilsdon J, Chubb J, Haddock G (2019) The Real-time REF review: a pilot study to examine the feasibility of a longitudinal evaluation of perceptions and attitudes towards REF 2021. SocArXiv: https://osf.io/preprints/socarxiv/78aqu/ . Accessed 30 Dec 2020

Wróblewska MN, Angermuller J (2017) Dyskurs akademicki jako praktyka społeczna. Zwrot dyskursywny i społeczne badania szkolnictwa wyższego. Kultura–Społeczeństwo–Edukacja 12(2):105–128. https://doi.org/10.14746/kse.2017.12.510.14746/kse.2017.12.5

Wróblewska MN (2017) Ewaluacja wpływu społecznego nauki. Przykład REF 2014 a kontekst polski. NaukaiSzkolnicwo Wyższe 49(1):79–104. https://doi.org/10.14746/nisw.2017.1.5

Wróblewska MN (2018) The making of the Impact Agenda. A study in discourse and governmnetality. Unpublished doctoral dissertation. Warwick University

Wróblewska MN (2019) Impact evaluation in Norway and in the UK: A comparative study, based on REF 2014 and Humeval 2015-2017. ENRESSH working paper series 1. University of Twente Research Information. https://ris.utwente.nl/ws/portalfiles/portal/102033214/ENRESSH_01_2019.pdf . Accessed 30 Dec 2020

Download references

Acknowledgements

I wish to thank Prof. Johannes Angermuller, the supervisor of the doctoral dissertation in which many of the ideas discussed in this paper were first presented. Prof. Angermuller’s guidance and support were essential for the development of my understanding of the importance of discourse in evaluative contexts. I also thank the reviewers of the aforementioned thesis, Prof. Jo Angouri and Prof. Srikant Sarangi for their feedback which helped me develop and clarify the concepts which I use in my analysis, as well as its presentation. Any errors or omissions are of course my own. The research presented in this paper received funding from the European Research Council (DISCONEX project 313,172). The underpinning research was also facilitated by the author’s membership in EU Cost Action “European Network for Research Evaluation in the Social Sciences and the Humanities”(ENRESSH CA15137-E). Particularly advice and encouragement recieved from the late prof. Paul Benneworth was invaluable.

Author information

Authors and affiliations.

University of Warwick, Coventry, UK

Marta Natalia Wróblewska

National Centre for Research and Development–NCBR, Warsaw, Poland

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marta Natalia Wróblewska .

Ethics declarations

Competing interests.

The author declares no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Wróblewska, M.N. Research impact evaluation and academic discourse. Humanit Soc Sci Commun 8 , 58 (2021). https://doi.org/10.1057/s41599-021-00727-8

Download citation

Received : 12 May 2020

Accepted : 11 January 2021

Published : 02 March 2021

DOI : https://doi.org/10.1057/s41599-021-00727-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

Explore articles by subject
Guide to authors
Editorial policies

12.7 Evaluation: Effectiveness of Research Paper

Learning outcomes.

By the end of this section, you will be able to:

Identify common formats and design features for different kinds of texts.
Implement style and language consistent with argumentative research writing while maintaining your own voice.
Determine how genre conventions for structure, paragraphing, tone, and mechanics vary.

When drafting, you follow your strongest research interests and try to answer the question on which you have settled. However, sometimes what began as a paper about one thing becomes a paper about something else. Your peer review partner will have helped you identify any such issues and given you some insight regarding revision. Another strategy is to compare and contrast your draft with the grading rubric similar to one your instructor will use. It is a good idea to consult this rubric frequently throughout the drafting process.

Critical Language Awareness	Clarity and Coherence	Rhetorical Choices
The text always adheres to the “Editing Focus” of this chapter: integrating sources and quotations appropriately as discussed in Section 12.6. The text also shows ample evidence of the writer’s intent to consciously meet or challenge conventional expectations in rhetorically effective ways.	The writer’s position or claim on a debatable issue is stated clearly in the thesis and expertly supported with credible researched evidence. Ideas are clearly presented in well-developed paragraphs with clear topic sentences and relate directly to the thesis. Headings and subheadings clarify organization, and appropriate transitions link ideas.	The writer maintains an objective voice in a paper that reflects an admirable balance of source information, analysis, synthesis, and original thought. Quotations function appropriately as support and are thoughtfully edited to reveal their main points. The writer fully addresses counterclaims and is consistently aware of the audience in terms of language use and background information presented.
The text usually adheres to the “Editing Focus” of this chapter: integrating sources and quotations appropriately as discussed in Section 12.6. The text also shows some evidence of the writer’s intent to consciously meet or challenge conventional expectations in rhetorically effective ways.	The writer’s position or claim on a debatable issue is stated clearly in the thesis and supported with credible researched evidence. Ideas are clearly presented in well-developed paragraphs with topic sentences and usually relate directly to the thesis. Some headings and subheadings clarify organization, and sufficient transitions link ideas.	The writer maintains an objective voice in a paper that reflects a balance of source information, analysis, synthesis, and original thought. Quotations usually function as support, and most are edited to reveal their main points. The writer usually addresses counterclaims and is aware of the audience in terms of language use and background information presented.
The text generally adheres to the “Editing Focus” of this chapter: integrating sources and quotations appropriately as discussed in Section 12.6. The text also shows limited evidence of the writer’s intent to consciously meet or challenge conventional expectations in rhetorically effective ways.	The writer’s position or claim on a debatable issue is stated in the thesis and generally supported with some credible researched evidence. Ideas are presented in moderately developed paragraphs. Most, if not all, have topic sentences and relate to the thesis. Some headings and subheadings may clarify organization, but their use may be inconsistent, inappropriate, or insufficient. More transitions would improve coherence.	The writer generally maintains an objective voice in a paper that reflects some balance of source information, analysis, synthesis, and original thought, although imbalance may well be present. Quotations generally function as support, but some are not edited to reveal their main points. The writer may attempt to address counterclaims but may be inconsistent in awareness of the audience in terms of language use and background information presented.
The text occasionally adheres to the “Editing Focus” of this chapter: integrating sources and quotations appropriately as discussed in Section 12.6. The text also shows emerging evidence of the writer’s intent to consciously meet or challenge conventional expectations in rhetorically effective ways.	The writer’s position or claim on a debatable issue is not clearly stated in the thesis, nor is it sufficiently supported with credible researched evidence. Some ideas are presented in paragraphs, but they are unrelated to the thesis. Some headings and subheadings may clarify organization, while others may not; transitions are either inappropriate or insufficient to link ideas.	The writer sometimes maintains an objective voice in a paper that lacks a balance of source information, analysis, synthesis, and original thought. Quotations usually do not function as support, often replacing the writer’s ideas or are not edited to reveal their main points. Counterclaims are addressed haphazardly or ignored. The writer shows inconsistency in awareness of the audience in terms of language use and background information presented.
The text does not adhere to the “Editing Focus” of this chapter: integrating sources and quotations appropriately as discussed in Section 12.6. The text also shows little to no evidence of the writer’s intent to consciously meet or challenge conventional expectations in rhetorically effective ways.	The writer’s position or claim on a debatable issue is neither clearly stated in the thesis nor sufficiently supported with credible researched evidence. Some ideas are presented in paragraphs. Few, if any, have topic sentences, and they barely relate to the thesis. Headings and subheadings are either missing or unhelpful as organizational tools. Transitions generally are missing or inappropriate.	The writer does not maintain an objective voice in a paper that reflects little to no balance of source information, analysis, synthesis, and original thought. Quotations may function as support, but most are not edited to reveal their main points. The writer may attempt to address counterclaims and may be inconsistent in awareness of the audience in terms of language use and background information presented.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/writing-guide/pages/1-unit-introduction

Authors: Michelle Bachelor Robinson, Maria Jerskey, featuring Toby Fulwiler
Publisher/website: OpenStax
Book title: Writing Guide with Handbook
Publication date: Dec 21, 2021
Location: Houston, Texas
Book URL: https://openstax.org/books/writing-guide/pages/1-unit-introduction
Section URL: https://openstax.org/books/writing-guide/pages/12-7-evaluation-effectiveness-of-research-paper

© Dec 19, 2023 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Research Paper: A step-by-step guide: 7. Evaluating Sources

1. Getting Started
2. Topic Ideas
3. Thesis Statement & Outline
4. Appropriate Sources
5. Search Techniques
6. Taking Notes & Documenting Sources
7. Evaluating Sources
8. Citations & Plagiarism
9. Writing Your Research Paper

Evaluation Criteria

It's very important to evaluate the materials you find to make sure they are appropriate for a research paper. It's not enough that the information is relevant; it must also be credible. You will want to find more than enough resources, so that you can pick and choose the best for your paper. Here are some helpful criteria you can apply to the information you find:

C urrency :

When was the information published?
Is the source out-of-date for the topic?
Are there new discoveries or important events since the publication date?

R elevancy:

How is the information related to your argument?
Is the information too advanced or too simple?
Is the audience focus appropriate for a research paper?
Are there better sources elsewhere?

A uthority :

Who is the author?
What is the author's credential in the related field?
Is the publisher well-known in the field?
Did the information go through the peer-review process or some kind of fact-checking?

A ccuracy :

Can the information be verified?
Are sources cited?
Is the information factual or opinion based?
Is the information biased?
Is the information free of grammatical or spelling errors?
What is the motive of providing the information: to inform? to sell? to persuade? to entertain?
Does the author or publisher make their intentions clear? Who is the intended audience?

Evaluating Web Sources

Most web pages are not fact-checked or anything like that, so it's especially important to evaluate information you find on the web. Many articles on websites are fine for information, and many others are distorted or made up. Check out our media evaluation guide for tips on evaluating what you see on social media, news sites, blogs, and so on.

This three-part video series, in which university students, historians, and pro fact-checkers go head-to-head in checking out online information, is also helpful.

<< Previous: 6. Taking Notes & Documenting Sources
Next: 8. Citations & Plagiarism >>
Last Updated: Apr 18, 2023 12:12 PM
URL: https://butte.libguides.com/ResearchPaper

Skip to main content
Skip to primary sidebar
Skip to footer
QuestionPro

Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
Resources Blog eBooks Survey Templates Case Studies Training Help center

Home Market Research

Evaluation Research: Definition, Methods and Examples

Content Index

What is evaluation research
Why do evaluation research

Quantitative methods

Qualitative methods.

Process evaluation research question examples
Outcome evaluation research question examples

What is evaluation research?

Evaluation research, also known as program evaluation, refers to research purpose instead of a specific method. Evaluation research is the systematic assessment of the worth or merit of time, money, effort and resources spent in order to achieve a goal.

Evaluation research is closely related to but slightly different from more conventional social research . It uses many of the same methods used in traditional social research, but because it takes place within an organizational context, it requires team skills, interpersonal skills, management skills, political smartness, and other research skills that social research does not need much. Evaluation research also requires one to keep in mind the interests of the stakeholders.

Evaluation research is a type of applied research, and so it is intended to have some real-world effect. Many methods like surveys and experiments can be used to do evaluation research. The process of evaluation research consisting of data analysis and reporting is a rigorous, systematic process that involves collecting data about organizations, processes, projects, services, and/or resources. Evaluation research enhances knowledge and decision-making, and leads to practical applications.

LEARN ABOUT: Action Research

Why do evaluation research?

The common goal of most evaluations is to extract meaningful information from the audience and provide valuable insights to evaluators such as sponsors, donors, client-groups, administrators, staff, and other relevant constituencies. Most often, feedback is perceived value as useful if it helps in decision-making. However, evaluation research does not always create an impact that can be applied anywhere else, sometimes they fail to influence short-term decisions. It is also equally true that initially, it might seem to not have any influence, but can have a delayed impact when the situation is more favorable. In spite of this, there is a general agreement that the major goal of evaluation research should be to improve decision-making through the systematic utilization of measurable feedback.

Below are some of the benefits of evaluation research

Gain insights about a project or program and its operations

Evaluation Research lets you understand what works and what doesn’t, where we were, where we are and where we are headed towards. You can find out the areas of improvement and identify strengths. So, it will help you to figure out what do you need to focus more on and if there are any threats to your business. You can also find out if there are currently hidden sectors in the market that are yet untapped.

Improve practice

It is essential to gauge your past performance and understand what went wrong in order to deliver better services to your customers. Unless it is a two-way communication, there is no way to improve on what you have to offer. Evaluation research gives an opportunity to your employees and customers to express how they feel and if there’s anything they would like to change. It also lets you modify or adopt a practice such that it increases the chances of success.

Assess the effects

After evaluating the efforts, you can see how well you are meeting objectives and targets. Evaluations let you measure if the intended benefits are really reaching the targeted audience and if yes, then how effectively.

Build capacity

Evaluations help you to analyze the demand pattern and predict if you will need more funds, upgrade skills and improve the efficiency of operations. It lets you find the gaps in the production to delivery chain and possible ways to fill them.

Methods of evaluation research

All market research methods involve collecting and analyzing the data, making decisions about the validity of the information and deriving relevant inferences from it. Evaluation research comprises of planning, conducting and analyzing the results which include the use of data collection techniques and applying statistical methods.

Some of the evaluation methods which are quite popular are input measurement, output or performance measurement, impact or outcomes assessment, quality assessment, process evaluation, benchmarking, standards, cost analysis, organizational effectiveness, program evaluation methods, and LIS-centered methods. There are also a few types of evaluations that do not always result in a meaningful assessment such as descriptive studies, formative evaluations, and implementation analysis. Evaluation research is more about information-processing and feedback functions of evaluation.

These methods can be broadly classified as quantitative and qualitative methods.

The outcome of the quantitative research methods is an answer to the questions below and is used to measure anything tangible.

Who was involved?
What were the outcomes?
What was the price?

The best way to collect quantitative data is through surveys , questionnaires , and polls . You can also create pre-tests and post-tests, review existing documents and databases or gather clinical data.

Surveys are used to gather opinions, feedback or ideas of your employees or customers and consist of various question types . They can be conducted by a person face-to-face or by telephone, by mail, or online. Online surveys do not require the intervention of any human and are far more efficient and practical. You can see the survey results on dashboard of research tools and dig deeper using filter criteria based on various factors such as age, gender, location, etc. You can also keep survey logic such as branching, quotas, chain survey, looping, etc in the survey questions and reduce the time to both create and respond to the donor survey . You can also generate a number of reports that involve statistical formulae and present data that can be readily absorbed in the meetings. To learn more about how research tool works and whether it is suitable for you, sign up for a free account now.

Create a free account!

Quantitative data measure the depth and breadth of an initiative, for instance, the number of people who participated in the non-profit event, the number of people who enrolled for a new course at the university. Quantitative data collected before and after a program can show its results and impact.

The accuracy of quantitative data to be used for evaluation research depends on how well the sample represents the population, the ease of analysis, and their consistency. Quantitative methods can fail if the questions are not framed correctly and not distributed to the right audience. Also, quantitative data do not provide an understanding of the context and may not be apt for complex issues.

Learn more: Quantitative Market Research: The Complete Guide

Qualitative research methods are used where quantitative methods cannot solve the research problem , i.e. they are used to measure intangible values. They answer questions such as

What is the value added?
How satisfied are you with our service?
How likely are you to recommend us to your friends?
What will improve your experience?

LEARN ABOUT: Qualitative Interview

Qualitative data is collected through observation, interviews, case studies, and focus groups. The steps for creating a qualitative study involve examining, comparing and contrasting, and understanding patterns. Analysts conclude after identification of themes, clustering similar data, and finally reducing to points that make sense.

Observations may help explain behaviors as well as the social context that is generally not discovered by quantitative methods. Observations of behavior and body language can be done by watching a participant, recording audio or video. Structured interviews can be conducted with people alone or in a group under controlled conditions, or they may be asked open-ended qualitative research questions . Qualitative research methods are also used to understand a person’s perceptions and motivations.

LEARN ABOUT: Social Communication Questionnaire

The strength of this method is that group discussion can provide ideas and stimulate memories with topics cascading as discussion occurs. The accuracy of qualitative data depends on how well contextual data explains complex issues and complements quantitative data. It helps get the answer of “why” and “how”, after getting an answer to “what”. The limitations of qualitative data for evaluation research are that they are subjective, time-consuming, costly and difficult to analyze and interpret.

Learn more: Qualitative Market Research: The Complete Guide

Survey software can be used for both the evaluation research methods. You can use above sample questions for evaluation research and send a survey in minutes using research software. Using a tool for research simplifies the process right from creating a survey, importing contacts, distributing the survey and generating reports that aid in research.

Examples of evaluation research

Evaluation research questions lay the foundation of a successful evaluation. They define the topics that will be evaluated. Keeping evaluation questions ready not only saves time and money, but also makes it easier to decide what data to collect, how to analyze it, and how to report it.

Evaluation research questions must be developed and agreed on in the planning stage, however, ready-made research templates can also be used.

Process evaluation research question examples:

How often do you use our product in a day?
Were approvals taken from all stakeholders?
Can you report the issue from the system?
Can you submit the feedback from the system?
Was each task done as per the standard operating procedure?
What were the barriers to the implementation of each task?
Were any improvement areas discovered?

Outcome evaluation research question examples:

How satisfied are you with our product?
Did the program produce intended outcomes?
What were the unintended outcomes?
Has the program increased the knowledge of participants?
Were the participants of the program employable before the course started?
Do participants of the program have the skills to find a job after the course ended?
Is the knowledge of participants better compared to those who did not participate in the program?

MORE LIKE THIS

Customer Experience Lessons from 13,000 Feet — Tuesday CX Thoughts

Aug 20, 2024

Insight: Definition & meaning, types and examples

Aug 19, 2024

Employee Loyalty: Strategies for Long-Term Business Success

Jotform vs SurveyMonkey: Which Is Best in 2024

Aug 15, 2024

Warning: The NCBI web site requires JavaScript to function. more...

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings
Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Panel on the Evaluation of AIDS Interventions; Coyle SL, Boruch RF, Turner CF, editors. Evaluating AIDS Prevention Programs: Expanded Edition. Washington (DC): National Academies Press (US); 1991.

Cover of Evaluating AIDS Prevention Programs

Evaluating AIDS Prevention Programs: Expanded Edition.

Hardcopy Version at National Academies Press

1 Design and Implementation of Evaluation Research

Evaluation has its roots in the social, behavioral, and statistical sciences, and it relies on their principles and methodologies of research, including experimental design, measurement, statistical tests, and direct observation. What distinguishes evaluation research from other social science is that its subjects are ongoing social action programs that are intended to produce individual or collective change. This setting usually engenders a great need for cooperation between those who conduct the program and those who evaluate it. This need for cooperation can be particularly acute in the case of AIDS prevention programs because those programs have been developed rapidly to meet the urgent demands of a changing and deadly epidemic.

Although the characteristics of AIDS intervention programs place some unique demands on evaluation, the techniques for conducting good program evaluation do not need to be invented. Two decades of evaluation research have provided a basic conceptual framework for undertaking such efforts (see, e.g., Campbell and Stanley [1966] and Cook and Campbell [1979] for discussions of outcome evaluation; see Weiss [1972] and Rossi and Freeman [1982] for process and outcome evaluations); in addition, similar programs, such as the antismoking campaigns, have been subject to evaluation, and they offer examples of the problems that have been encountered.

In this chapter the panel provides an overview of the terminology, types, designs, and management of research evaluation. The following chapter provides an overview of program objectives and the selection and measurement of appropriate outcome variables for judging the effectiveness of AIDS intervention programs. These issues are discussed in detail in the subsequent, program-specific Chapters 3 - 5 .

Types of Evaluation

The term evaluation implies a variety of different things to different people. The recent report of the Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences defines the area through a series of questions (Turner, Miller, and Moses, 1989:317-318):

Evaluation is a systematic process that produces a trustworthy account of what was attempted and why; through the examination of results—the outcomes of intervention programs—it answers the questions, "What was done?" "To whom, and how?" and "What outcomes were observed?'' Well-designed evaluation permits us to draw inferences from the data and addresses the difficult question: ''What do the outcomes mean?"

These questions differ in the degree of difficulty of answering them. An evaluation that tries to determine the outcomes of an intervention and what those outcomes mean is a more complicated endeavor than an evaluation that assesses the process by which the intervention was delivered. Both kinds of evaluation are necessary because they are intimately connected: to establish a project's success, an evaluator must first ask whether the project was implemented as planned and then whether its objective was achieved. Questions about a project's implementation usually fall under the rubric of process evaluation . If the investigation involves rapid feedback to the project staff or sponsors, particularly at the earliest stages of program implementation, the work is called formative evaluation . Questions about effects or effectiveness are often variously called summative evaluation, impact assessment, or outcome evaluation, the term the panel uses.

Formative evaluation is a special type of early evaluation that occurs during and after a program has been designed but before it is broadly implemented. Formative evaluation is used to understand the need for the intervention and to make tentative decisions about how to implement or improve it. During formative evaluation, information is collected and then fed back to program designers and administrators to enhance program development and maximize the success of the intervention. For example, formative evaluation may be carried out through a pilot project before a program is implemented at several sites. A pilot study of a community-based organization (CBO), for example, might be used to gather data on problems involving access to and recruitment of targeted populations and the utilization and implementation of services; the findings of such a study would then be used to modify (if needed) the planned program.

Another example of formative evaluation is the use of a "story board" design of a TV message that has yet to be produced. A story board is a series of text and sketches of camera shots that are to be produced in a commercial. To evaluate the effectiveness of the message and forecast some of the consequences of actually broadcasting it to the general public, an advertising agency convenes small groups of people to react to and comment on the proposed design.

Once an intervention has been implemented, the next stage of evaluation is process evaluation, which addresses two broad questions: "What was done?" and "To whom, and how?" Ordinarily, process evaluation is carried out at some point in the life of a project to determine how and how well the delivery goals of the program are being met. When intervention programs continue over a long period of time (as is the case for some of the major AIDS prevention programs), measurements at several times are warranted to ensure that the components of the intervention continue to be delivered by the right people, to the right people, in the right manner, and at the right time. Process evaluation can also play a role in improving interventions by providing the information necessary to change delivery strategies or program objectives in a changing epidemic.

Research designs for process evaluation include direct observation of projects, surveys of service providers and clients, and the monitoring of administrative records. The panel notes that the Centers for Disease Control (CDC) is already collecting some administrative records on its counseling and testing program and community-based projects. The panel believes that this type of evaluation should be a continuing and expanded component of intervention projects to guarantee the maintenance of the projects' integrity and responsiveness to their constituencies.

The purpose of outcome evaluation is to identify consequences and to establish that consequences are, indeed, attributable to a project. This type of evaluation answers the questions, "What outcomes were observed?" and, perhaps more importantly, "What do the outcomes mean?" Like process evaluation, outcome evaluation can also be conducted at intervals during an ongoing program, and the panel believes that such periodic evaluation should be done to monitor goal achievement.

The panel believes that these stages of evaluation (i.e., formative, process, and outcome) are essential to learning how AIDS prevention programs contribute to containing the epidemic. After a body of findings has been accumulated from such evaluations, it may be fruitful to launch another stage of evaluation: cost-effectiveness analysis (see Weinstein et al., 1989). Like outcome evaluation, cost-effectiveness analysis also measures program effectiveness, but it extends the analysis by adding a measure of program cost. The panel believes that consideration of cost-effective analysis should be postponed until more experience is gained with formative, process, and outcome evaluation of the CDC AIDS prevention programs.

Evaluation Research Design

Process and outcome evaluations require different types of research designs, as discussed below. Formative evaluations, which are intended to both assess implementation and forecast effects, use a mix of these designs.

Process Evaluation Designs

To conduct process evaluations on how well services are delivered, data need to be gathered on the content of interventions and on their delivery systems. Suggested methodologies include direct observation, surveys, and record keeping.

Direct observation designs include case studies, in which participant-observers unobtrusively and systematically record encounters within a program setting, and nonparticipant observation, in which long, open-ended (or "focused") interviews are conducted with program participants. 1 For example, "professional customers" at counseling and testing sites can act as project clients to monitor activities unobtrusively; 2 alternatively, nonparticipant observers can interview both staff and clients. Surveys —either censuses (of the whole population of interest) or samples—elicit information through interviews or questionnaires completed by project participants or potential users of a project. For example, surveys within community-based projects can collect basic statistical information on project objectives, what services are provided, to whom, when, how often, for how long, and in what context.

Record keeping consists of administrative or other reporting systems that monitor use of services. Standardized reporting ensures consistency in the scope and depth of data collected. To use the media campaign as an example, the panel suggests using standardized data on the use of the AIDS hotline to monitor public attentiveness to the advertisements broadcast by the media campaign.

These designs are simple to understand, but they require expertise to implement. For example, observational studies must be conducted by people who are well trained in how to carry out on-site tasks sensitively and to record their findings uniformly. Observers can either complete narrative accounts of what occurred in a service setting or they can complete some sort of data inventory to ensure that multiple aspects of service delivery are covered. These types of studies are time consuming and benefit from corroboration among several observers. The use of surveys in research is well-understood, although they, too, require expertise to be well implemented. As the program chapters reflect, survey data collection must be carefully designed to reduce problems of validity and reliability and, if samples are used, to design an appropriate sampling scheme. Record keeping or service inventories are probably the easiest research designs to implement, although preparing standardized internal forms requires attention to detail about salient aspects of service delivery.

Outcome Evaluation Designs

Research designs for outcome evaluations are meant to assess principal and relative effects. Ideally, to assess the effect of an intervention on program participants, one would like to know what would have happened to the same participants in the absence of the program. Because it is not possible to make this comparison directly, inference strategies that rely on proxies have to be used. Scientists use three general approaches to construct proxies for use in the comparisons required to evaluate the effects of interventions: (1) nonexperimental methods, (2) quasi-experiments, and (3) randomized experiments. The first two are discussed below, and randomized experiments are discussed in the subsequent section.

Nonexperimental and Quasi-Experimental Designs 3

The most common form of nonexperimental design is a before-and-after study. In this design, pre-intervention measurements are compared with equivalent measurements made after the intervention to detect change in the outcome variables that the intervention was designed to influence.

Although the panel finds that before-and-after studies frequently provide helpful insights, the panel believes that these studies do not provide sufficiently reliable information to be the cornerstone for evaluation research on the effectiveness of AIDS prevention programs. The panel's conclusion follows from the fact that the postintervention changes cannot usually be attributed unambiguously to the intervention. 4 Plausible competing explanations for differences between pre-and postintervention measurements will often be numerous, including not only the possible effects of other AIDS intervention programs, news stories, and local events, but also the effects that may result from the maturation of the participants and the educational or sensitizing effects of repeated measurements, among others.

Quasi-experimental and matched control designs provide a separate comparison group. In these designs, the control group may be selected by matching nonparticipants to participants in the treatment group on the basis of selected characteristics. It is difficult to ensure the comparability of the two groups even when they are matched on many characteristics because other relevant factors may have been overlooked or mismatched or they may be difficult to measure (e.g., the motivation to change behavior). In some situations, it may simply be impossible to measure all of the characteristics of the units (e.g., communities) that may affect outcomes, much less demonstrate their comparability.

Matched control designs require extraordinarily comprehensive scientific knowledge about the phenomenon under investigation in order for evaluators to be confident that all of the relevant determinants of outcomes have been properly accounted for in the matching. Three types of information or knowledge are required: (1) knowledge of intervening variables that also affect the outcome of the intervention and, consequently, need adjustment to make the groups comparable; (2) measurements on all intervening variables for all subjects; and (3) knowledge of how to make the adjustments properly, which in turn requires an understanding of the functional relationship between the intervening variables and the outcome variables. Satisfying each of these information requirements is likely to be more difficult than answering the primary evaluation question, "Does this intervention produce beneficial effects?"

Given the size and the national importance of AIDS intervention programs and given the state of current knowledge about behavior change in general and AIDS prevention, in particular, the panel believes that it would be unwise to rely on matching and adjustment strategies as the primary design for evaluating AIDS intervention programs. With differently constituted groups, inferences about results are hostage to uncertainty about the extent to which the observed outcome actually results from the intervention and is not an artifact of intergroup differences that may not have been removed by matching or adjustment.

Randomized Experiments

A remedy to the inferential uncertainties that afflict nonexperimental designs is provided by randomized experiments . In such experiments, one singly constituted group is established for study. A subset of the group is then randomly chosen to receive the intervention, with the other subset becoming the control. The two groups are not identical, but they are comparable. Because they are two random samples drawn from the same population, they are not systematically different in any respect, which is important for all variables—both known and unknown—that can influence the outcome. Dividing a singly constituted group into two random and therefore comparable subgroups cuts through the tangle of causation and establishes a basis for the valid comparison of respondents who do and do not receive the intervention. Randomized experiments provide for clear causal inference by solving the problem of group comparability, and may be used to answer the evaluation questions "Does the intervention work?" and "What works better?"

Which question is answered depends on whether the controls receive an intervention or not. When the object is to estimate whether a given intervention has any effects, individuals are randomly assigned to the project or to a zero-treatment control group. The control group may be put on a waiting list or simply not get the treatment. This design addresses the question, "Does it work?"

When the object is to compare variations on a project—e.g., individual counseling sessions versus group counseling—then individuals are randomly assigned to these two regimens, and there is no zero-treatment control group. This design addresses the question, "What works better?" In either case, the control groups must be followed up as rigorously as the experimental groups.

A randomized experiment requires that individuals, organizations, or other treatment units be randomly assigned to one of two or more treatments or program variations. Random assignment ensures that the estimated differences between the groups so constituted are statistically unbiased; that is, that any differences in effects measured between them are a result of treatment. The absence of statistical bias in groups constituted in this fashion stems from the fact that random assignment ensures that there are no systematic differences between them, differences that can and usually do affect groups composed in ways that are not random. 5 The panel believes this approach is far superior for outcome evaluations of AIDS interventions than the nonrandom and quasi-experimental approaches. Therefore,

To improve interventions that are already broadly implemented, the panel recommends the use of randomized field experiments of alternative or enhanced interventions.

Under certain conditions, the panel also endorses randomized field experiments with a nontreatment control group to evaluate new interventions. In the context of a deadly epidemic, ethics dictate that treatment not be withheld simply for the purpose of conducting an experiment. Nevertheless, there may be times when a randomized field test of a new treatment with a no-treatment control group is worthwhile. One such time is during the design phase of a major or national intervention.

Before a new intervention is broadly implemented, the panel recommends that it be pilot tested in a randomized field experiment.

The panel considered the use of experiments with delayed rather than no treatment. A delayed-treatment control group strategy might be pursued when resources are too scarce for an intervention to be widely distributed at one time. For example, a project site that is waiting to receive funding for an intervention would be designated as the control group. If it is possible to randomize which projects in the queue receive the intervention, an evaluator could measure and compare outcomes after the experimental group had received the new treatment but before the control group received it. The panel believes that such a design can be applied only in limited circumstances, such as when groups would have access to related services in their communities and that conducting the study was likely to lead to greater access or better services. For example, a study cited in Chapter 4 used a randomized delayed-treatment experiment to measure the effects of a community-based risk reduction program. However, such a strategy may be impractical for several reasons, including:

sites waiting for funding for an intervention might seek resources from another source;
it might be difficult to enlist the nonfunded site and its clients to participate in the study;
there could be an appearance of favoritism toward projects whose funding was not delayed.

Although randomized experiments have many benefits, the approach is not without pitfalls. In the planning stages of evaluation, it is necessary to contemplate certain hazards, such as the Hawthorne effect 6 and differential project dropout rates. Precautions must be taken either to prevent these problems or to measure their effects. Fortunately, there is some evidence suggesting that the Hawthorne effect is usually not very large (Rossi and Freeman, 1982:175-176).

Attrition is potentially more damaging to an evaluation, and it must be limited if the experimental design is to be preserved. If sample attrition is not limited in an experimental design, it becomes necessary to account for the potentially biasing impact of the loss of subjects in the treatment and control conditions of the experiment. The statistical adjustments required to make inferences about treatment effectiveness in such circumstances can introduce uncertainties that are as worrisome as those afflicting nonexperimental and quasi-experimental designs. Thus, the panel's recommendation of the selective use of randomized design carries an implicit caveat: To realize the theoretical advantages offered by randomized experimental designs, substantial efforts will be required to ensure that the designs are not compromised by flawed execution.

Another pitfall to randomization is its appearance of unfairness or unattractiveness to participants and the controversial legal and ethical issues it sometimes raises. Often, what is being criticized is the control of project assignment of participants rather than the use of randomization itself. In deciding whether random assignment is appropriate, it is important to consider the specific context of the evaluation and how participants would be assigned to projects in the absence of randomization. The Federal Judicial Center (1981) offers five threshold conditions for the use of random assignment.

Does present practice or policy need improvement?
Is there significant uncertainty about the value of the proposed regimen?
Are there acceptable alternatives to randomized experiments?
Will the results of the experiment be used to improve practice or policy?
Is there a reasonable protection against risk for vulnerable groups (i.e., individuals within the justice system)?

The parent committee has argued that these threshold conditions apply in the case of AIDS prevention programs (see Turner, Miller, and Moses, 1989:331-333).

Although randomization may be desirable from an evaluation and ethical standpoint, and acceptable from a legal standpoint, it may be difficult to implement from a practical or political standpoint. Again, the panel emphasizes that questions about the practical or political feasibility of the use of randomization may in fact refer to the control of program allocation rather than to the issues of randomization itself. In fact, when resources are scarce, it is often more ethical and politically palatable to randomize allocation rather than to allocate on grounds that may appear biased.

It is usually easier to defend the use of randomization when the choice has to do with assignment to groups receiving alternative services than when the choice involves assignment to groups receiving no treatment. For example, in comparing a testing and counseling intervention that offered a special "skills training" session in addition to its regular services with a counseling and testing intervention that offered no additional component, random assignment of participants to one group rather than another may be acceptable to program staff and participants because the relative values of the alternative interventions are unknown.

The more difficult issue is the introduction of new interventions that are perceived to be needed and effective in a situation in which there are no services. An argument that is sometimes offered against the use of randomization in this instance is that interventions should be assigned on the basis of need (perhaps as measured by rates of HIV incidence or of high-risk behaviors). But this argument presumes that the intervention will have a positive effect—which is unknown before evaluation—and that relative need can be established, which is a difficult task in itself.

The panel recognizes that community and political opposition to randomization to zero treatments may be strong and that enlisting participation in such experiments may be difficult. This opposition and reluctance could seriously jeopardize the production of reliable results if it is translated into noncompliance with a research design. The feasibility of randomized experiments for AIDS prevention programs has already been demonstrated, however (see the review of selected experiments in Turner, Miller, and Moses, 1989:327-329). The substantial effort involved in mounting randomized field experiments is repaid by the fact that they can provide unbiased evidence of the effects of a program.

Unit of Assignment.

The unit of assignment of an experiment may be an individual person, a clinic (i.e., the clientele of the clinic), or another organizational unit (e.g., the community or city). The treatment unit is selected at the earliest stage of design. Variations of units are illustrated in the following four examples of intervention programs.

Two different pamphlets (A and B) on the same subject (e.g., testing) are distributed in an alternating sequence to individuals calling an AIDS hotline. The outcome to be measured is whether the recipient returns a card asking for more information.

Two instruction curricula (A and B) about AIDS and HIV infections are prepared for use in high school driver education classes. The outcome to be measured is a score on a knowledge test.

Of all clinics for sexually transmitted diseases (STDs) in a large metropolitan area, some are randomly chosen to introduce a change in the fee schedule. The outcome to be measured is the change in patient load.

A coordinated set of community-wide interventions—involving community leaders, social service agencies, the media, community associations and other groups—is implemented in one area of a city. Outcomes are knowledge as assessed by testing at drug treatment centers and STD clinics and condom sales in the community's retail outlets.

In example (1), the treatment unit is an individual person who receives pamphlet A or pamphlet B. If either "treatment" is applied again, it would be applied to a person. In example (2), the high school class is the treatment unit; everyone in a given class experiences either curriculum A or curriculum B. If either treatment is applied again, it would be applied to a class. The treatment unit is the clinic in example (3), and in example (4), the treatment unit is a community .

The consistency of the effects of a particular intervention across repetitions justly carries a heavy weight in appraising the intervention. It is important to remember that repetitions of a treatment or intervention are the number of treatment units to which the intervention is applied. This is a salient principle in the design and execution of intervention programs as well as in the assessment of their results.

The adequacy of the proposed sample size (number of treatment units) has to be considered in advance. Adequacy depends mainly on two factors:

How much variation occurs from unit to unit among units receiving a common treatment? If that variation is large, then the number of units needs to be large.
What is the minimum size of a possible treatment difference that, if present, would be practically important? That is, how small a treatment difference is it essential to detect if it is present? The smaller this quantity, the larger the number of units that are necessary.

Many formal methods for considering and choosing sample size exist (see, e.g., Cohen, 1988). Practical circumstances occasionally allow choosing between designs that involve units at different levels; thus, a classroom might be the unit if the treatment is applied in one way, but an entire school might be the unit if the treatment is applied in another. When both approaches are feasible, the use of a power analysis for each approach may lead to a reasoned choice.

Choice of Methods

There is some controversy about the advantages of randomized experiments in comparison with other evaluative approaches. It is the panel's belief that when a (well executed) randomized study is feasible, it is superior to alternative kinds of studies in the strength and clarity of whatever conclusions emerge, primarily because the experimental approach avoids selection biases. 7 Other evaluation approaches are sometimes unavoidable, but ordinarily the accumulation of valid information will go more slowly and less securely than in randomized approaches.

Experiments in medical research shed light on the advantages of carefully conducted randomized experiments. The Salk vaccine trials are a successful example of a large, randomized study. In a double-blind test of the polio vaccine, 8 children in various communities were randomly assigned to two treatments, either the vaccine or a placebo. By this method, the effectiveness of Salk vaccine was demonstrated in one summer of research (Meier, 1957).

A sufficient accumulation of relevant, observational information, especially when collected in studies using different procedures and sample populations, may also clearly demonstrate the effectiveness of a treatment or intervention. The process of accumulating such information can be a long one, however. When a (well-executed) randomized study is feasible, it can provide evidence that is subject to less uncertainty in its interpretation, and it can often do so in a more timely fashion. In the midst of an epidemic, the panel believes it proper that randomized experiments be one of the primary strategies for evaluating the effectiveness of AIDS prevention efforts. In making this recommendation, however, the panel also wishes to emphasize that the advantages of the randomized experimental design can be squandered by poor execution (e.g., by compromised assignment of subjects, significant subject attrition rates, etc.). To achieve the advantages of the experimental design, care must be taken to ensure that the integrity of the design is not compromised by poor execution.

In proposing that randomized experiments be one of the primary strategies for evaluating the effectiveness of AIDS prevention programs, the panel also recognizes that there are situations in which randomization will be impossible or, for other reasons, cannot be used. In its next report the panel will describe at length appropriate nonexperimental strategies to be considered in situations in which an experiment is not a practical or desirable alternative.

The Management of Evaluation

Conscientious evaluation requires a considerable investment of funds, time, and personnel. Because the panel recognizes that resources are not unlimited, it suggests that they be concentrated on the evaluation of a subset of projects to maximize the return on investment and to enhance the likelihood of high-quality results.

Project Selection

Deciding which programs or sites to evaluate is by no means a trivial matter. Selection should be carefully weighed so that projects that are not replicable or that have little chance for success are not subjected to rigorous evaluations.

The panel recommends that any intensive evaluation of an intervention be conducted on a subset of projects selected according to explicit criteria. These criteria should include the replicability of the project, the feasibility of evaluation, and the project's potential effectiveness for prevention of HIV transmission.

If a project is replicable, it means that the particular circumstances of service delivery in that project can be duplicated. In other words, for CBOs and counseling and testing projects, the content and setting of an intervention can be duplicated across sites. Feasibility of evaluation means that, as a practical matter, the research can be done: that is, the research design is adequate to control for rival hypotheses, it is not excessively costly, and the project is acceptable to the community and the sponsor. Potential effectiveness for HIV prevention means that the intervention is at least based on a reasonable theory (or mix of theories) about behavioral change (e.g., social learning theory [Bandura, 1977], the health belief model [Janz and Becker, 1984], etc.), if it has not already been found to be effective in related circumstances.

In addition, since it is important to ensure that the results of evaluations will be broadly applicable,

The panel recommends that evaluation be conducted and replicated across major types of subgroups, programs, and settings. Attention should be paid to geographic areas with low and high AIDS prevalence, as well as to subpopulations at low and high risk for AIDS.

Research Administration

The sponsoring agency interested in evaluating an AIDS intervention should consider the mechanisms through which the research will be carried out as well as the desirability of both independent oversight and agency in-house conduct and monitoring of the research. The appropriate entities and mechanisms for conducting evaluations depend to some extent on the kinds of data being gathered and the evaluation questions being asked.

Oversight and monitoring are important to keep projects fully informed about the other evaluations relevant to their own and to render assistance when needed. Oversight and monitoring are also important because evaluation is often a sensitive issue for project and evaluation staff alike. The panel is aware that evaluation may appear threatening to practitioners and researchers because of the possibility that evaluation research will show that their projects are not as effective as they believe them to be. These needs and vulnerabilities should be taken into account as evaluation research management is developed.

Conducting the Research

To conduct some aspects of a project's evaluation, it may be appropriate to involve project administrators, especially when the data will be used to evaluate delivery systems (e.g., to determine when and which services are being delivered). To evaluate outcomes, the services of an outside evaluator 9 or evaluation team are almost always required because few practitioners have the necessary professional experience or the time and resources necessary to do evaluation. The outside evaluator must have relevant expertise in evaluation research methodology and must also be sensitive to the fears, hopes, and constraints of project administrators.

Several evaluation management schemes are possible. For example, a prospective AIDS prevention project group (the contractor) can bid on a contract for project funding that includes an intensive evaluation component. The actual evaluation can be conducted either by the contractor alone or by the contractor working in concert with an outside independent collaborator. This mechanism has the advantage of involving project practitioners in the work of evaluation as well as building separate but mutually informing communities of experts around the country. Alternatively, a contract can be let with a single evaluator or evaluation team that will collaborate with the subset of sites that is chosen for evaluation. This variation would be managerially less burdensome than awarding separate contracts, but it would require greater dependence on the expertise of a single investigator or investigative team. ( Appendix A discusses contracting options in greater depth.) Both of these approaches accord with the parent committee's recommendation that collaboration between practitioners and evaluation researchers be ensured. Finally, in the more traditional evaluation approach, independent principal investigators or investigative teams may respond to a request for proposal (RFP) issued to evaluate individual projects. Such investigators are frequently university-based or are members of a professional research organization, and they bring to the task a variety of research experiences and perspectives.

Independent Oversight

The panel believes that coordination and oversight of multisite evaluations is critical because of the variability in investigators' expertise and in the results of the projects being evaluated. Oversight can provide quality control for individual investigators and can be used to review and integrate findings across sites for developing policy. The independence of an oversight body is crucial to ensure that project evaluations do not succumb to the pressures for positive findings of effectiveness.

When evaluation is to be conducted by a number of different evaluation teams, the panel recommends establishing an independent scientific committee to oversee project selection and research efforts, corroborate the impartiality and validity of results, conduct cross-site analyses, and prepare reports on the progress of the evaluations.

The composition of such an independent oversight committee will depend on the research design of a given program. For example, the committee ought to include statisticians and other specialists in randomized field tests when that approach is being taken. Specialists in survey research and case studies should be recruited if either of those approaches is to be used. Appendix B offers a model for an independent oversight group that has been successfully implemented in other settings—a project review team, or advisory board.

Agency In-House Team

As the parent committee noted in its report, evaluations of AIDS interventions require skills that may be in short supply for agencies invested in delivering services (Turner, Miller, and Moses, 1989:349). Although this situation can be partly alleviated by recruiting professional outside evaluators and retaining an independent oversight group, the panel believes that an in-house team of professionals within the sponsoring agency is also critical. The in-house experts will interact with the outside evaluators and provide input into the selection of projects, outcome objectives, and appropriate research designs; they will also monitor the progress and costs of evaluation. These functions require not just bureaucratic oversight but appropriate scientific expertise.

This is not intended to preclude the direct involvement of CDC staff in conducting evaluations. However, given the great amount of work to be done, it is likely a considerable portion will have to be contracted out. The quality and usefulness of the evaluations done under contract can be greatly enhanced by ensuring that there are an adequate number of CDC staff trained in evaluation research methods to monitor these contracts.

The panel recommends that CDC recruit and retain behavioral, social, and statistical scientists trained in evaluation methodology to facilitate the implementation of the evaluation research recommended in this report.

Interagency Collaboration

The panel believes that the federal agencies that sponsor the design of basic research, intervention programs, and evaluation strategies would profit from greater interagency collaboration. The evaluation of AIDS intervention programs would benefit from a coherent program of studies that should provide models of efficacious and effective interventions to prevent further HIV transmission, the spread of other STDs, and unwanted pregnancies (especially among adolescents). A marriage could then be made of basic and applied science, from which the best evaluation is born. Exploring the possibility of interagency collaboration and CDC's role in such collaboration is beyond the scope of this panel's task, but it is an important issue that we suggest be addressed in the future.

Costs of Evaluation

In view of the dearth of current evaluation efforts, the panel believes that vigorous evaluation research must be undertaken over the next few years to build up a body of knowledge about what interventions can and cannot do. Dedicating no resources to evaluation will virtually guarantee that high-quality evaluations will be infrequent and the data needed for policy decisions will be sparse or absent. Yet, evaluating every project is not feasible simply because there are not enough resources and, in many cases, evaluating every project is not necessary for good science or good policy.

The panel believes that evaluating only some of a program's sites or projects, selected under the criteria noted in Chapter 4 , is a sensible strategy. Although we recommend that intensive evaluation be conducted on only a subset of carefully chosen projects, we believe that high-quality evaluation will require a significant investment of time, planning, personnel, and financial support. The panel's aim is to be realistic—not discouraging—when it notes that the costs of program evaluation should not be underestimated. Many of the research strategies proposed in this report require investments that are perhaps greater than has been previously contemplated. This is particularly the case for outcome evaluations, which are ordinarily more difficult and expensive to conduct than formative or process evaluations. And those costs will be additive with each type of evaluation that is conducted.

Panel members have found that the cost of an outcome evaluation sometimes equals or even exceeds the cost of actual program delivery. For example, it was reported to the panel that randomized studies used to evaluate recent manpower training projects cost as much as the projects themselves (see Cottingham and Rodriguez, 1987). In another case, the principal investigator of an ongoing AIDS prevention project told the panel that the cost of randomized experimentation was approximately three times higher than the cost of delivering the intervention (albeit the study was quite small, involving only 104 participants) (Kelly et al., 1989). Fortunately, only a fraction of a program's projects or sites need to be intensively evaluated to produce high-quality information, and not all will require randomized studies.

Because of the variability in kinds of evaluation that will be done as well as in the costs involved, there is no set standard or rule for judging what fraction of a total program budget should be invested in evaluation. Based upon very limited data 10 and assuming that only a small sample of projects would be evaluated, the panel suspects that program managers might reasonably anticipate spending 8 to 12 percent of their intervention budgets to conduct high-quality evaluations (i.e., formative, process, and outcome evaluations). 11 Larger investments seem politically infeasible and unwise in view of the need to put resources into program delivery. Smaller investments in evaluation may risk studying an inadequate sample of program types, and it may also invite compromises in research quality.

The nature of the HIV/AIDS epidemic mandates an unwavering commitment to prevention programs, and the prevention activities require a similar commitment to the evaluation of those programs. The magnitude of what can be learned from doing good evaluations will more than balance the magnitude of the costs required to perform them. Moreover, it should be realized that the costs of shoddy research can be substantial, both in their direct expense and in the lost opportunities to identify effective strategies for AIDS prevention. Once the investment has been made, however, and a reservoir of findings and practical experience has accumulated, subsequent evaluations should be easier and less costly to conduct.

Bandura, A. (1977) Self-efficacy: Toward a unifying theory of behavioral change . Psychological Review 34:191-215. [ PubMed : 847061 ]
Campbell, D. T., and Stanley, J. C. (1966) Experimental and Quasi-Experimental Design and Analysis . Boston: Houghton-Mifflin.
Centers for Disease Control (CDC) (1988) Sourcebook presented at the National Conference on the Prevention of HIV Infection and AIDS Among Racial and Ethnic Minorities in the United States (August).
Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences . 2nd ed. Hillsdale, NJ.: L. Erlbaum Associates.
Cook, T., and Campbell, D. T. (1979) Quasi-Experimentation: Design and Analysis for Field Settings . Boston: Houghton-Mifflin.
Federal Judicial Center (1981) Experimentation in the Law . Washington, D.C.: Federal Judicial Center.
Janz, N. K., and Becker, M. H. (1984) The health belief model: A decade later . Health Education Quarterly 11 (1):1-47. [ PubMed : 6392204 ]
Kelly, J. A., St. Lawrence, J. S., Hood, H. V., and Brasfield, T. L. (1989) Behavioral intervention to reduce AIDS risk activities . Journal of Consulting and Clinical Psychology 57:60-67. [ PubMed : 2925974 ]
Meier, P. (1957) Safety testing of poliomyelitis vaccine . Science 125(3257): 1067-1071. [ PubMed : 13432758 ]
Roethlisberger, F. J. and Dickson, W. J. (1939) Management and the Worker . Cambridge, Mass.: Harvard University Press.
Rossi, P. H., and Freeman, H. E. (1982) Evaluation: A Systematic Approach . 2nd ed. Beverly Hills, Cal.: Sage Publications.
Turner, C. F., editor; , Miller, H. G., editor; , and Moses, L. E., editor. , eds. (1989) AIDS, Sexual Behavior, and Intravenous Drug Use . Report of the NRC Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. [ PubMed : 25032322 ]
Weinstein, M. C., Graham, J. D., Siegel, J. E., and Fineberg, H. V. (1989) Cost-effectiveness analysis of AIDS prevention programs: Concepts, complications, and illustrations . In C.F. Turner, editor; , H. G. Miller, editor; , and L. E. Moses, editor. , eds., AIDS, Sexual Behavior, and Intravenous Drug Use . Report of the NRC Committee on AIDS Research and the Behavioral, Social, and Statistical Sciences. Washington, D.C.: National Academy Press. [ PubMed : 25032322 ]
Weiss, C. H. (1972) Evaluation Research . Englewood Cliffs, N.J.: Prentice-Hall, Inc.

On occasion, nonparticipants observe behavior during or after an intervention. Chapter 3 introduces this option in the context of formative evaluation.

The use of professional customers can raise serious concerns in the eyes of project administrators at counseling and testing sites. The panel believes that site administrators should receive advance notification that professional customers may visit their sites for testing and counseling services and provide their consent before this method of data collection is used.

Parts of this section are adopted from Turner, Miller, and Moses, (1989:324-326).

This weakness has been noted by CDC in a sourcebook provided to its HIV intervention project grantees (CDC, 1988:F-14).

The significance tests applied to experimental outcomes calculate the probability that any observed differences between the sample estimates might result from random variations between the groups.

Research participants' knowledge that they were being observed had a positive effect on their responses in a series of famous studies made at General Electric's Hawthorne Works in Chicago (Roethlisberger and Dickson, 1939); the phenomenon is referred to as the Hawthorne effect.

participants who self-select into a program are likely to be different from non-random comparison groups in terms of interests, motivations, values, abilities, and other attributes that can bias the outcomes.

A double-blind test is one in which neither the person receiving the treatment nor the person administering it knows which treatment (or when no treatment) is being given.

As discussed under ''Agency In-House Team,'' the outside evaluator might be one of CDC's personnel. However, given the large amount of research to be done, it is likely that non-CDC evaluators will also need to be used.

See, for example, chapter 3 which presents cost estimates for evaluations of media campaigns. Similar estimates are not readily available for other program types.

For example, the U. K. Health Education Authority (that country's primary agency for AIDS education and prevention programs) allocates 10 percent of its AIDS budget for research and evaluation of its AIDS programs (D. McVey, Health Education Authority, personal communication, June 1990). This allocation covers both process and outcome evaluation.

Cite this Page National Research Council (US) Panel on the Evaluation of AIDS Interventions; Coyle SL, Boruch RF, Turner CF, editors. Evaluating AIDS Prevention Programs: Expanded Edition. Washington (DC): National Academies Press (US); 1991. 1, Design and Implementation of Evaluation Research.
PDF version of this title (6.0M)

In this Page

Related information.

PubMed Links to PubMed

Recent Activity

Design and Implementation of Evaluation Research - Evaluating AIDS Prevention Pr... Design and Implementation of Evaluation Research - Evaluating AIDS Prevention Programs

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

Grab your spot at the free arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Computers and Society

Title: a conceptual framework for ethical evaluation of machine learning systems.

Abstract: Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations.

Subjects:	Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	[cs.CY]
	(or [cs.CY] for this version)
	Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

HTML (experimental)
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Artificial Intelligence
Generative AI
Cloud Computing
Data Management
Emerging Technology
Technology Industry
Software Development
Microsoft .NET
Development Tools
Open Source
Programming Languages
Enterprise Buyer’s Guides
Newsletters
Foundry Careers
Terms of Service
Privacy Policy
Cookie Policy
Copyright Notice
Member Preferences
About AdChoices
E-commerce Affiliate Relationships
Your California Privacy Rights

Our Network

Computerworld
Network World

Meta working on a Self-Taught Evaluator for LLMs

The new approach trains llms to create their own training data for evaluation purposes..

artificial intelligence AI hands conceptual

Facebook parent Meta’s AI research team is working on developing what it calls a Self-Taught Evaluator for large language models (LLMs) that could help enterprises reduce their time and human resource requirements while developing custom LLMs.

Earlier this month, the social media giant’s AI research team, dubbed Meta FAIR, published a paper on the technology , which claims that these evaluators could help an LLM create its own training data — synthetic data — for evaluation purposes.

Typically, models that are used as evaluators, known as LLM-as-a-Judge, are trained with large amounts of data annotated by humans, which is a costly affair, and the data becomes stale as the model improves, the researchers explained in the paper.

Human annotation of data is required or preferred over LLM responses, as the latter still cannot always successfully resolve challenging tasks such as coding or mathematics problems, the researchers further said, adding that this dependency on human-generated data poses significant challenges for scaling to new tasks or evaluation criteria.

The researchers used only synthetic data generated by an LLM in an iterative manner, without the need for labeling instructions.

“Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions,” the researchers wrote.

Explaining further, the researchers said they started with a seed model and used prompt engineering to generate contrasting synthetic preference pairs for a given input, such that one response is designed to be inferior to the other.

After that, the researchers used the model as an LLM-as-a-Judge to generate reasoning traces and judgments for these pairs, which they could label as correct or not given the synthetic preference pair design.

“After training on this labeled data, we obtain a superior LLM-as-a-Judge, from which we can then iterate the whole process in order for it to self-improve,” they wrote.

As part of their experiments, the researchers at Meta claimed in the paper that without any labeled preference data, the Self-Taught Evaluator improved Llama3-70B-Instruct’s score on the RewardBench benchmarking tool from 75.4 to 88.3.

The change in score, they said, outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

However, the researchers also pointed out that the approach has some limitations. They did not test it on smaller models (their test models had 70 billion parameters) and did not consider any computational requirement concerns, only accuracy.

“Generative LLM-as-a-Judge models usually have longer outputs and thus higher inference cost than reward models that simply output a score, as LLM-as-a-Judge typically first generates a reasoning chain,” they wrote.

Additionally, they pointed out that since they used a seed model to generate the first synthetic preferences during their iterative training scheme, they assumed that the model was capable of generating reasonable evaluations.

“Thus, our approach is limited by having a capable instruction fine-tuned model which is already reasonably aligned to human (or legal/policy) preferences,” they explained.

More from this author

Linux foundation’s adoption of omi could pave way for ethical llms, analysts say, google cloud adds 3 new apache airflow operators to vertex ai, what is google cloud’s generative ai evaluation service, google adds gemini to bigquery, looker to help with data engineering, google cloud adds graph processing to spanner, sql support to bigtable, nist releases new tool to check ai models’ security, why meta’s llama 3.1 is a boon for enterprises and a bane for other llm vendors, google cloud spanner gets dual-region configuration option, most popular authors.

Show me more

3 languages changing data science.

API security starts with API discovery

The basics of Pillow, Python's image manipulation library

How to use the watch command

How to use dbm to stash data quickly in Python

IMAGES

Evaluation Research Examples
(PDF) Evaluation Datasets for Research Paper Recommendation Systems
Program Evaluation Essay
Checklist For Evaluating A Scientific Research Paper
FREE 14+ Sample Evaluation Reports in Google Docs
What is evaluation research: Methods & examples

COMMENTS

Evaluating Research in Academic Journals: A Practical Guide to
Academic Journals. Evaluating Research in Academic Journals is a guide for students who are learning how to. evaluate reports of empirical research published in academic journals. It breaks down ...
Evaluating research: A multidisciplinary approach to assessing research
In a recent evaluation of research constellations within a large university in Sweden, ... In Canada, standard quality assessment criteria for research papers have been developed, and these deal separately with quantitative and qualitative research studies (Kmet et al., 2004).
Critical appraisal of published research papers
INTRODUCTION. Critical appraisal of a research paper is defined as "The process of carefully and systematically examining research to judge its trustworthiness, value and relevance in a particular context."[] Since scientific literature is rapidly expanding with more than 12,000 articles being added to the MEDLINE database per week,[] critical appraisal is very important to distinguish ...
Research Evaluation
5.1 Evaluation of Research Papers. Researchers, and even late-stage PhD students, are often asked to review technical papers submitted for publication to a journal or a conference. In Sect. 3.4, I mentioned that reviewing papers is one of the important services every scientist is supposed to deliver to the research community. When invited to ...
Evaluating Research Articles
Critical Appraisal. Critical appraisal is the process of systematically evaluating research using established and transparent methods. In critical appraisal, health professionals use validated checklists/worksheets as tools to guide their assessment of the research. It is a more advanced way of evaluating research than the more basic method ...
Measuring research: A guide to research evaluation frameworks and tools
In addition, a detailed overview of six research evaluation frameworks is provided, along with a brief overview of a further eight frameworks, and discussion of the main tools used in research evaluation. The report is likely to be of interest to policymakers, research funders, institutional leaders and research managers. ...
Evaluation: Sage Journals
Evaluation. The journal Evaluation launched in 1995, publishes fully refereed papers and aims to advance the theory, methodology and practice of evaluation. We favour articles that bridge theory and practice whether through generalizable and exemplary cases or … | View full journal description. This journal is a member of the Committee on ...
Understanding and Evaluating Research: A Critical Guide
Understanding and Evaluating Research: A Critical Guide shows students how to be critical consumers of research and to appreciate the power of methodology as it shapes the research question, the use of theory in the study, the methods used, and how the outcomes are reported. The book starts with what it means to be a critical and uncritical ...
What Is Evaluation?: Perspectives of How Evaluation Differs (or Not
Overall, evaluators believed research and evaluation intersect, whereas researchers believed evaluation is a subcomponent of research. Furthermore, evaluators perceived greater differences between evaluation and research than researchers did, particularly in characteristics relevant at the beginning (e.g., purpose, questions, audience) and end ...
Evaluating Research Impact: A Comprehensive Overview of ...
Abstract. The purpose of this research paper is to analyze and compare the various research metrics and online databases used to evaluate the impact and quality of scientific publications. The study focuses on the most widely used research metrics, such as the h-index, the Impact Factor (IF), and the number of citations.
Assessment, evaluations, and definitions of research impact: A review
Impact is assessed alongside research outputs and environment to provide an evaluation of research taking place within an institution. As such research outputs, for example, knowledge generated and publications, can be translated into outcomes, for example, new products and services, and impacts or added value (Duryea et al. 2007).
Evaluating Research
Definition: Evaluating Research refers to the process of assessing the quality, credibility, and relevance of a research study or project. This involves examining the methods, data, and results of the research in order to determine its validity, reliability, and usefulness. Evaluating research can be done by both experts and non-experts in the ...
Criteria for Good Qualitative Research: A Comprehensive Review
Fundamental Criteria: General Research Quality. Various researchers have put forward criteria for evaluating qualitative research, which have been summarized in Table 3.Also, the criteria outlined in Table 4 effectively deliver the various approaches to evaluate and assess the quality of qualitative work. The entries in Table 4 are based on Tracy's "Eight big‐tent criteria for excellent ...
Research Evaluation
We are seeking to recruit a new Editor for Research Evaluation to join an expert editorial team. The role is for an initial five-year term, with effect from summer 2024. Apply by 31st May. ... Characteristics of highly cited papers . Assessment, evaluations, and definitions of research impact: A review
Research impact evaluation and academic discourse
Introduction. The introduction of 'research impact' as an element of evaluation constitutes a major change in the construction of research evaluation systems. 'Impact', understood broadly ...
Evaluating Sources
Lateral reading is the act of evaluating the credibility of a source by comparing it to other sources. This allows you to: Verify evidence. Contextualize information. Find potential weaknesses. If a source is using methods or drawing conclusions that are incompatible with other research in its field, it may not be reliable. Example: Lateral ...
PDF Developing a research evaluation framework
The traditional approaches to research evaluation. are summative, assessing, for example, outputs such as the quality and number of papers published, as measured with bibliometrics, or comparing institutions' past performance. These examine what has happened in the past but do not tell us why.
[PDF] Evaluation Research: An Overview
Evaluation Research: An Overview. R. Powell. Published in Library Trends 6 September 2006. Education, Sociology. TLDR. It is concluded that evaluation research should be a rigorous, systematic process that involves collecting data about organizations, processes, programs, services, and/or resources that enhance knowledge and decision making and ...
How to Write a Literature Review
Examples of literature reviews. Step 1 - Search for relevant literature. Step 2 - Evaluate and select sources. Step 3 - Identify themes, debates, and gaps. Step 4 - Outline your literature review's structure. Step 5 - Write your literature review.
12.7 Evaluation: Effectiveness of Research Paper
12.2 Argumentative Research Trailblazer: Samin Nosrat; 12.3 Glance at Genre: Introducing Research as Evidence; 12.4 Annotated Student Sample: "Healthy Diets from Sustainable Sources Can Save the Earth" by Lily Tran; 12.5 Writing Process: Integrating Research; 12.6 Editing Focus: Integrating Sources and Quotations; 12.7 Evaluation: Effectiveness ...
Research Paper: A step-by-step guide: 7. Evaluating Sources
Evaluation Criteria. It's very important to evaluate the materials you find to make sure they are appropriate for a research paper. It's not enough that the information is relevant; it must also be credible. You will want to find more than enough resources, so that you can pick and choose the best for your paper.
Evaluation Research: Definition, Methods and Examples
The process of evaluation research consisting of data analysis and reporting is a rigorous, systematic process that involves collecting data about organizations, processes, projects, services, and/or resources. Evaluation research enhances knowledge and decision-making, and leads to practical applications. LEARN ABOUT: Action Research.
Design and Implementation of Evaluation Research
Evaluation has its roots in the social, behavioral, and statistical sciences, and it relies on their principles and methodologies of research, including experimental design, measurement, statistical tests, and direct observation. What distinguishes evaluation research from other social science is that its subjects are ongoing social action programs that are intended to produce individual or ...
A Conceptual Framework for Ethical Evaluation of Machine Learning Systems
Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly ...
Meta working on a Self-Taught Evaluator for LLMs
Earlier this month, the social media giant's AI research team, dubbed Meta FAIR, published a paper on the technology, which claims that these evaluators could help an LLM create its own training ...

Evaluating Research in Academic Journals: A Practical Guide to Realistic Evaluation

Evaluating Research Articles

Statistical Versus Clinical Significance

Criteria for Good Qualitative Research: A Comprehensive Review

Cite this article

Similar content being viewed by others

Good Qualitative Research: Opening up the Debate

What is Qualitative in Research

Introduction

Criteria for Evaluating Qualitative Studies

Qualitative Research: Interpretive Paradigms

Improving Quality: Strategies

How to Assess the Quality of the Research Findings?

Quality Checklists: Tools for Assessing the Quality

Conclusions, Future Directions, and Outlook

Prospects : A Road Ahead for Qualitative Research

Author information

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

Research impact evaluation and academic discourse

Similar content being viewed by others

Writing impact case studies: a comparative study of high-scoring and low-scoring case studies from REF2014

The transformative power of values-enacted scholarship

IDADA: towards a multimethod methodological framework for PhD by publication underpinned by critical realism

Theoretical framework and data

The discourse of impact

Impact infrastructure

The genre of impact case study as element of infrastructure

Narrative patterns

Grammatical and lexical features

Vision of research promoted by the genre of CS

Academics’ positioning towards the Impact Agenda

Ironizing the notion of impact

Stage management

Subjectivation

Concluding remarks

Data availability

Acknowledgements

Author information

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Share this article

Quick links

12.7 Evaluation: Effectiveness of Research Paper

Research Paper: A step-by-step guide: 7. Evaluating Sources

Evaluation Criteria

Evaluating Web Sources

Evaluation Research: Definition, Methods and Examples

Quantitative methods

What is evaluation research?

Why do evaluation research?

Below are some of the benefits of evaluation research

Methods of evaluation research

Examples of evaluation research

Process evaluation research question examples:

Outcome evaluation research question examples:

MORE LIKE THIS

Customer Experience Lessons from 13,000 Feet — Tuesday CX Thoughts

Insight: Definition & meaning, types and examples

Employee Loyalty: Strategies for Long-Term Business Success

Jotform vs SurveyMonkey: Which Is Best in 2024

Other categories

Evaluating AIDS Prevention Programs: Expanded Edition.

1 Design and Implementation of Evaluation Research

Process Evaluation Designs

Outcome Evaluation Designs

Nonexperimental and Quasi-Experimental Designs 3

Randomized Experiments

Unit of Assignment.

Choice of Methods

Project Selection

Research Administration

Conducting the Research