Educational resources and simple solutions for your research journey

theoretical framework

What is a Theoretical Framework? How to Write It (with Examples) 

What is a Theoretical Framework? How to Write It (with Examples)

Theoretical framework 1,2 is the structure that supports and describes a theory. A theory is a set of interrelated concepts and definitions that present a systematic view of phenomena by describing the relationship among the variables for explaining these phenomena. A theory is developed after a long research process and explains the existence of a research problem in a study. A theoretical framework guides the research process like a roadmap for the research study and helps researchers clearly interpret their findings by providing a structure for organizing data and developing conclusions.   

A theoretical framework in research is an important part of a manuscript and should be presented in the first section. It shows an understanding of the theories and concepts relevant to the research and helps limit the scope of the research.  

Table of Contents

What is a theoretical framework ?  

A theoretical framework in research can be defined as a set of concepts, theories, ideas, and assumptions that help you understand a specific phenomenon or problem. It can be considered a blueprint that is borrowed by researchers to develop their own research inquiry. A theoretical framework in research helps researchers design and conduct their research and analyze and interpret their findings. It explains the relationship between variables, identifies gaps in existing knowledge, and guides the development of research questions, hypotheses, and methodologies to address that gap.  

analytical framework in research example

Now that you know the answer to ‘ What is a theoretical framework? ’, check the following table that lists the different types of theoretical frameworks in research: 3

   
Conceptual  Defines key concepts and relationships 
Deductive  Starts with a general hypothesis and then uses data to test it; used in quantitative research 
Inductive  Starts with data and then develops a hypothesis; used in qualitative research 
Empirical  Focuses on the collection and analysis of empirical data; used in scientific research 
Normative  Defines a set of norms that guide behavior; used in ethics and social sciences 
Explanatory  Explains causes of particular behavior; used in psychology and social sciences 

Developing a theoretical framework in research can help in the following situations: 4

  • When conducting research on complex phenomena because a theoretical framework helps organize the research questions, hypotheses, and findings  
  • When the research problem requires a deeper understanding of the underlying concepts  
  • When conducting research that seeks to address a specific gap in knowledge  
  • When conducting research that involves the analysis of existing theories  

Summarizing existing literature for theoretical frameworks is easy. Get our Research Ideation pack  

Importance of a theoretical framework  

The purpose of theoretical framework s is to support you in the following ways during the research process: 2  

  • Provide a structure for the complete research process  
  • Assist researchers in incorporating formal theories into their study as a guide  
  • Provide a broad guideline to maintain the research focus  
  • Guide the selection of research methods, data collection, and data analysis  
  • Help understand the relationships between different concepts and develop hypotheses and research questions  
  • Address gaps in existing literature  
  • Analyze the data collected and draw meaningful conclusions and make the findings more generalizable  

Theoretical vs. Conceptual framework  

While a theoretical framework covers the theoretical aspect of your study, that is, the various theories that can guide your research, a conceptual framework defines the variables for your study and presents how they relate to each other. The conceptual framework is developed before collecting the data. However, both frameworks help in understanding the research problem and guide the development, collection, and analysis of the research.  

The following table lists some differences between conceptual and theoretical frameworks . 5

   
Based on existing theories that have been tested and validated by others  Based on concepts that are the main variables in the study 
Used to create a foundation of the theory on which your study will be developed  Visualizes the relationships between the concepts and variables based on the existing literature 
Used to test theories, to predict and control the situations within the context of a research inquiry  Helps the development of a theory that would be useful to practitioners 
Provides a general set of ideas within which a study belongs  Refers to specific ideas that researchers utilize in their study 
Offers a focal point for approaching unknown research in a specific field of inquiry  Shows logically how the research inquiry should be undertaken 
Works deductively  Works inductively 
Used in quantitative studies  Used in qualitative studies 

analytical framework in research example

How to write a theoretical framework  

The following general steps can help those wondering how to write a theoretical framework: 2

  • Identify and define the key concepts clearly and organize them into a suitable structure.  
  • Use appropriate terminology and define all key terms to ensure consistency.  
  • Identify the relationships between concepts and provide a logical and coherent structure.  
  • Develop hypotheses that can be tested through data collection and analysis.  
  • Keep it concise and focused with clear and specific aims.  

Write a theoretical framework 2x faster. Get our Manuscript Writing pack  

Examples of a theoretical framework  

Here are two examples of a theoretical framework. 6,7

Example 1 .   

An insurance company is facing a challenge cross-selling its products. The sales department indicates that most customers have just one policy, although the company offers over 10 unique policies. The company would want its customers to purchase more than one policy since most customers are purchasing policies from other companies.  

Objective : To sell more insurance products to existing customers.  

Problem : Many customers are purchasing additional policies from other companies.  

Research question : How can customer product awareness be improved to increase cross-selling of insurance products?  

Sub-questions: What is the relationship between product awareness and sales? Which factors determine product awareness?  

Since “product awareness” is the main focus in this study, the theoretical framework should analyze this concept and study previous literature on this subject and propose theories that discuss the relationship between product awareness and its improvement in sales of other products.  

Example 2 .

A company is facing a continued decline in its sales and profitability. The main reason for the decline in the profitability is poor services, which have resulted in a high level of dissatisfaction among customers and consequently a decline in customer loyalty. The management is planning to concentrate on clients’ satisfaction and customer loyalty.  

Objective: To provide better service to customers and increase customer loyalty and satisfaction.  

Problem: Continued decrease in sales and profitability.  

Research question: How can customer satisfaction help in increasing sales and profitability?  

Sub-questions: What is the relationship between customer loyalty and sales? Which factors influence the level of satisfaction gained by customers?  

Since customer satisfaction, loyalty, profitability, and sales are the important topics in this example, the theoretical framework should focus on these concepts.  

Benefits of a theoretical framework  

There are several benefits of a theoretical framework in research: 2  

  • Provides a structured approach allowing researchers to organize their thoughts in a coherent way.  
  • Helps to identify gaps in knowledge highlighting areas where further research is needed.  
  • Increases research efficiency by providing a clear direction for research and focusing efforts on relevant data.  
  • Improves the quality of research by providing a rigorous and systematic approach to research, which can increase the likelihood of producing valid and reliable results.  
  • Provides a basis for comparison by providing a common language and conceptual framework for researchers to compare their findings with other research in the field, facilitating the exchange of ideas and the development of new knowledge.  

analytical framework in research example

Frequently Asked Questions 

Q1. How do I develop a theoretical framework ? 7

A1. The following steps can be used for developing a theoretical framework :  

  • Identify the research problem and research questions by clearly defining the problem that the research aims to address and identifying the specific questions that the research aims to answer.
  • Review the existing literature to identify the key concepts that have been studied previously. These concepts should be clearly defined and organized into a structure.
  • Develop propositions that describe the relationships between the concepts. These propositions should be based on the existing literature and should be testable.
  • Develop hypotheses that can be tested through data collection and analysis.
  • Test the theoretical framework through data collection and analysis to determine whether the framework is valid and reliable.

Q2. How do I know if I have developed a good theoretical framework or not? 8

A2. The following checklist could help you answer this question:  

  • Is my theoretical framework clearly seen as emerging from my literature review?  
  • Is it the result of my analysis of the main theories previously studied in my same research field?  
  • Does it represent or is it relevant to the most current state of theoretical knowledge on my topic?  
  • Does the theoretical framework in research present a logical, coherent, and analytical structure that will support my data analysis?  
  • Do the different parts of the theory help analyze the relationships among the variables in my research?  
  • Does the theoretical framework target how I will answer my research questions or test the hypotheses?  
  • Have I documented every source I have used in developing this theoretical framework ?  
  • Is my theoretical framework a model, a table, a figure, or a description?  
  • Have I explained why this is the appropriate theoretical framework for my data analysis?  

Q3. Can I use multiple theoretical frameworks in a single study?  

A3. Using multiple theoretical frameworks in a single study is acceptable as long as each theory is clearly defined and related to the study. Each theory should also be discussed individually. This approach may, however, be tedious and effort intensive. Therefore, multiple theoretical frameworks should be used only if absolutely necessary for the study.  

Q4. Is it necessary to include a theoretical framework in every research study?  

A4. The theoretical framework connects researchers to existing knowledge. So, including a theoretical framework would help researchers get a clear idea about the research process and help structure their study effectively by clearly defining an objective, a research problem, and a research question.  

Q5. Can a theoretical framework be developed for qualitative research?  

A5. Yes, a theoretical framework can be developed for qualitative research. However, qualitative research methods may or may not involve a theory developed beforehand. In these studies, a theoretical framework can guide the study and help develop a theory during the data analysis phase. This resulting framework uses inductive reasoning. The outcome of this inductive approach can be referred to as an emergent theoretical framework . This method helps researchers develop a theory inductively, which explains a phenomenon without a guiding framework at the outset.  

analytical framework in research example

Q6. What is the main difference between a literature review and a theoretical framework ?  

A6. A literature review explores already existing studies about a specific topic in order to highlight a gap, which becomes the focus of the current research study. A theoretical framework can be considered the next step in the process, in which the researcher plans a specific conceptual and analytical approach to address the identified gap in the research.  

Theoretical frameworks are thus important components of the research process and researchers should therefore devote ample amount of time to develop a solid theoretical framework so that it can effectively guide their research in a suitable direction. We hope this article has provided a good insight into the concept of theoretical frameworks in research and their benefits.  

References  

  • Organizing academic research papers: Theoretical framework. Sacred Heart University library. Accessed August 4, 2023. https://library.sacredheart.edu/c.php?g=29803&p=185919#:~:text=The%20theoretical%20framework%20is%20the,research%20problem%20under%20study%20exists .  
  • Salomao A. Understanding what is theoretical framework. Mind the Graph website. Accessed August 5, 2023. https://mindthegraph.com/blog/what-is-theoretical-framework/  
  • Theoretical framework—Types, examples, and writing guide. Research Method website. Accessed August 6, 2023. https://researchmethod.net/theoretical-framework/  
  • Grant C., Osanloo A. Understanding, selecting, and integrating a theoretical framework in dissertation research: Creating the blueprint for your “house.” Administrative Issues Journal : Connecting Education, Practice, and Research; 4(2):12-26. 2014. Accessed August 7, 2023. https://files.eric.ed.gov/fulltext/EJ1058505.pdf  
  • Difference between conceptual framework and theoretical framework. MIM Learnovate website. Accessed August 7, 2023. https://mimlearnovate.com/difference-between-conceptual-framework-and-theoretical-framework/  
  • Example of a theoretical framework—Thesis & dissertation. BacherlorPrint website. Accessed August 6, 2023. https://www.bachelorprint.com/dissertation/example-of-a-theoretical-framework/  
  • Sample theoretical framework in dissertation and thesis—Overview and example. Students assignment help website. Accessed August 6, 2023. https://www.studentsassignmenthelp.co.uk/blogs/sample-dissertation-theoretical-framework/#Example_of_the_theoretical_framework  
  • Kivunja C. Distinguishing between theory, theoretical framework, and conceptual framework: A systematic review of lessons from the field. Accessed August 8, 2023. https://files.eric.ed.gov/fulltext/EJ1198682.pdf  

Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage.  

Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place –  Get All Access now starting at just $14 a month !    

Related Posts

Peer Review Week 2024

Join Us for Peer Review Week 2024

Editage All Access Boosting Productivity for Academics in India

How Editage All Access is Boosting Productivity for Academics in India

  • Infographics
  • Youtube Summary
  • Youtube to Blog

Beginner's Guide to Analytical Frameworks

  • Introduction

Analytical frameworks are essential tools for understanding complex problems, making informed decisions, and developing effective strategies. This guide is designed to introduce beginners to the fundamental concepts of analytical frameworks, exploring its various types, and providing insights into how they can be applied in practical scenarios. Whether you're a student, a professional, or someone with a keen interest in analytics, this article will equip you with the knowledge to confidently navigate the world of analytical frameworks.

  • Key Highlights

Importance of analytical frameworks in problem-solving

Overview of different types of analytical frameworks

Step-by-step guide on applying analytical frameworks

Real-world examples of analytical framework application

Tips for choosing the right framework for your needs

  • Understanding Analytical Frameworks

Understanding Analytical Frameworks

Before we delve into the intricacies of analytical frameworks, let's establish a foundational understanding. These frameworks are not just tools; they are the compasses that guide us through the complex landscapes of data, decisions, and strategic planning. They empower us to dissect and understand multifaceted problems, ensuring our decisions are informed, strategic, and impactful. This section paves the way for a deeper exploration, illuminating the what, the why, and the how of analytical frameworks.

Definition and Purpose of Analytical Frameworks

At their core, analytical frameworks are structured tools used to systematically analyze and interpret information. They serve multiple purposes across various fields, from business strategy and market analysis to social research and policy development. For instance, a business might use a SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis to evaluate its competitive position in the market or to plan its growth strategy.

These frameworks help by providing a clear methodology for breaking down complex issues into manageable components, making it easier to identify the root causes of problems, understand relationships, and predict future trends. By applying an analytical framework like the PESTLE (Political, Economic, Social, Technological, Legal, Environmental) analysis, organizations can assess external factors that could impact their operations, enabling proactive rather than reactive strategies.

Importance in Decision-Making

The power of analytical frameworks lies in their ability to aid in making informed decisions . By systematically breaking down complex problems, these frameworks ensure that every aspect of a situation is considered. For example, Porter’s Five Forces framework allows businesses to analyze their industry's competitive forces to strategize accordingly.

This methodical approach to decision-making not only enhances the quality of the decisions but also increases the decision-makers' confidence. It transforms guesswork into a structured analysis, providing a clear path from problem identification to solution implementation. In essence, analytical frameworks turn the daunting task of decision-making into a clear, manageable process.

Common Misconceptions

Despite their utility, several misconceptions about analytical frameworks persist. One common belief is that they are only useful for large corporations or complex research projects. However, these tools are equally beneficial for small businesses or even individual decision-making processes. They scale to fit the problem at hand, whether it's choosing a new product line or planning a career move.

Another misconception is that using an analytical framework guarantees success. While these frameworks can significantly improve the decision-making process, they are not a magic bullet. Success also depends on the quality of input data and the decision-maker's judgment. It's crucial to approach these frameworks as guides rather than definitive answers to complex problems. By clearing these misunderstandings, beginners can more effectively leverage analytical frameworks to their advantage.

  • Exploring Analytical Frameworks: A Beginner's Guide

Exploring Analytical Frameworks: A Beginner's Guide

In the vast expanse of business strategy and analysis, understanding and applying the right analytical frameworks can be the beacon that guides organizations through the complexities of decision-making. This section delves into some of the most influential frameworks, offering a doorway for beginners to step into the realm of strategic analysis.

Diving Deep into SWOT Analysis

SWOT Analysis stands as a cornerstone in strategic planning, offering a clear framework to evaluate the Strengths, Weaknesses, Opportunities, and Threats associated with a project or in a business scenario. Its application spans across various sectors, from business strategy to personal career planning.

Practical Application: Imagine a startup in the tech industry. By applying SWOT, they can pinpoint their innovative technology (Strength), limited market presence (Weakness), emerging markets (Opportunity), and established competitors (Threat). This comprehensive analysis aids in crafting strategic plans that leverage strengths, address weaknesses, exploit opportunities, and mitigate threats.

Benefits: The simplicity and versatility of SWOT Analysis make it an essential tool for strategic planning. It provides actionable insights that can guide decision-making processes, from market entry strategies to product development.

Unpacking PESTLE Analysis

PESTLE Analysis offers a macro-environmental framework, analyzing Political, Economic, Social, Technological, Legal, and Environmental factors impacting businesses. It's particularly valuable for assessing market growth or decline, business position, and direction for operations.

Example in Action: Consider a multinational corporation assessing a new market entry. Through PESTLE, they evaluate the political stability of the country, economic trends , cultural attitudes towards foreign businesses, technological infrastructure , legal barriers to entry , and environmental regulations . This holistic view enables informed strategic decisions, aligning business objectives with external realities.

Real-World Application Examples: By understanding the broader macro-environmental factors, businesses can anticipate changes and adapt strategies accordingly. For instance, a company might shift its supply chain strategy in anticipation of changes in environmental regulations.

Exploring Porter’s Five Forces

Porter’s Five Forces provides a framework for analyzing the competitive forces in an industry, including competition intensity, potential of new entrants, power of suppliers, power of customers, and threat of substitute products . It’s especially useful for evaluating the strategic position of a business within its industry.

Application Insight: An e-commerce business may use Porter’s Five Forces to assess its competitive landscape. By analyzing the bargaining power of suppliers (can suppliers dictate terms?), bargaining power of customers (do customers have choices?), threat of new entrants (is it easy for new players to enter the market?), threat of substitutes (are there alternative products?), and competitive rivalry (how intense is the competition?), the business can identify strategic opportunities and threats.

Relevance and Application: This framework helps businesses understand the dynamics of their industry and shape strategies that exploit competitive advantages or address vulnerabilities. For example, a business might decide to focus on customer loyalty programs to counter strong competition and the threat of substitutes.

  • Applying Analytical Frameworks Effortlessly

Applying Analytical Frameworks Effortlessly

Understanding the theoretical aspects of analytical frameworks is a critical first step. However, the real power of these frameworks is unleashed only when they are applied effectively to real-world scenarios. This section is designed to bridge the gap between theory and practice, offering a concise, step-by-step guide that caters to the application of these frameworks across various situations.

A Step-by-Step Guide to Applying Analytical Frameworks

Identify the Objective : Begin by clearly defining what you wish to achieve with the analysis. Setting a clear objective guides the selection of the appropriate framework.

Select the Right Framework : Based on the objective, choose a framework that best fits the problem at hand. For instance, SWOT Analysis is ideal for evaluating project feasibility, while PESTLE Analysis can help understand the macro-environmental factors affecting a project.

Gather Relevant Data : Collect all necessary information that will feed into the framework. This could range from internal performance metrics to external market research.

Apply the Framework : Methodically fill in the sections of the chosen framework with the collected data. This process often reveals insights that were not apparent before.

Analyze the Results : Look for patterns, strengths, weaknesses, opportunities, and threats that emerge from the analysis.

Make Informed Decisions : Use the insights gained to guide strategic decisions, ensuring they are backed by systematic analysis.

Real-World Applications: Case Studies on Analytical Frameworks

Analytical frameworks have been successfully applied across industries to drive strategic decisions. For example, a tech startup might use SWOT Analysis to navigate its market entry strategy, identifying key opportunities and threats in the competitive landscape. Similarly, a multinational corporation might employ PESTLE Analysis to assess the viability of entering a new geographical market, taking into consideration political, economic, social, technological, legal, and environmental factors. These real-world applications underscore the versatility and utility of analytical frameworks in providing structured insights that inform strategic decisions.

Avoiding Common Pitfalls in Applying Analytical Frameworks

While analytical frameworks are powerful tools, their effectiveness can be compromised by common mistakes. Here are a few to avoid:

Overlooking Context : Applying a framework without adjusting for the specific context of the problem can lead to misleading conclusions. Always tailor the framework to fit the unique aspects of the scenario.

Data Overload : While thorough data collection is crucial, overwhelming the framework with too much information can obscure key insights. Focus on data that directly impacts the objective.

Ignoring External Inputs : Solely relying on internal perspectives can limit the scope of analysis. Incorporate external viewpoints and data sources to enrich the analysis.

By being mindful of these pitfalls and adopting a structured approach to applying analytical frameworks, organizations can significantly enhance their decision-making processes.

  • Choosing the Right Analytical Framework

Choosing the Right Analytical Framework

Selecting the most suitable analytical framework from the plethora available can seem like a daunting task. This section is aimed at demystifying this process, providing clear, actionable insights that would guide you in identifying the framework that best matches the unique requirements of your project or problem. Let's dive into the factors that should influence your decision and how to compare different frameworks effectively.

Factors to Consider

Understanding Your Objectives: The first step in choosing the right analytical framework is crystal clear clarity on what you aim to achieve. Are you looking to understand consumer behavior, or are you more interested in the competitive landscape of your industry? For instance, a SWOT Analysis might be perfect for strategic planning, while Porter’s Five Forces could provide deeper insights into competitive dynamics.

Complexity of the Data: The nature and complexity of the data at hand also play a crucial role. A framework like PESTLE Analysis is fantastic for macro-environmental insights but might be too broad if your data is highly specific or technical. In such cases, a more niche framework, perhaps a statistical model or a custom-built analytical tool, might be more appropriate.

Resource Availability: Consider the resources you have at your disposal, including time, financial resources, and expertise. Some frameworks require more extensive data collection and analysis than others. For example, conducting a comprehensive Market Analysis might be resource-intensive, but essential for entering new markets.

Comparing Frameworks

Comparing different analytical frameworks involves looking at their strengths and weaknesses in various contexts. For example, while SWOT Analysis offers a broad overview of internal and external factors, it might not delve deeply into the specifics of competitive rivalry, where Porter’s Five Forces shines.

Practical Application: Consider how each framework has been applied in real-world scenarios similar to yours. For instance, PESTLE Analysis has been widely used by businesses looking to expand internationally to understand the macro-environmental factors affecting their operation in new markets.

Customization and Flexibility: Some frameworks offer more flexibility and can be customized to better fit specific projects. Understanding the extent to which a framework can be tailored to meet your needs is crucial. A mix-and-match approach, combining elements from different frameworks, could also provide a comprehensive analysis.

In conclusion, the choice of analytical framework is not one-size-fits-all. It requires a careful consideration of your project's specific needs, the nature of the data, and the resources available to you. By systematically comparing the options, you can select the most suitable framework to gain valuable insights and drive informed decisions.

  • Best Practices and Tips for Mastering Analytical Frameworks

Best Practices and Tips for Mastering Analytical Frameworks

As we wrap up our beginner's guide to analytical frameworks, it's crucial to focus on best practices and tips that will empower enthusiasts to not only understand but also effectively implement these frameworks in their projects or analyses. The journey from learning to mastery involves continuous improvement and the effective application of strategies that complement the frameworks' strengths.

Crafting Effective Implementation Strategies

Implementing analytical frameworks effectively requires a blend of strategic planning and practical application. Start by defining clear objectives for your analysis, ensuring that the chosen framework aligns with your goals. For instance, if your project involves evaluating competitive market dynamics, Porter’s Five Forces provides a structured way to assess competitors, potential entrants, and substitute products.

  • Engage stakeholders early in the process to gather diverse insights and foster buy-in. A collaborative approach enhances the relevance and acceptance of the analysis.
  • Iterate and adapt your analysis as you gather more data or as the situation evolves. Flexibility is key to uncovering nuanced insights.

Remember, the value of an analytical framework lies in its application to real-world scenarios. For example, applying SWOT Analysis to a startup might reveal opportunities for innovation that were not initially apparent. Regularly revisiting and refining your analysis ensures that your strategies remain aligned with changing circumstances.

Embracing Continuous Learning and Improvement

The landscape of analytical frameworks is ever-evolving, with new methodologies emerging and existing ones being refined. Staying at the forefront of this evolution requires a commitment to continuous learning .

  • Participate in online forums and communities , such as Reddit’s r/dataisbeautiful or Kaggle , to exchange insights with peers.
  • Attend workshops and webinars that focus on the latest trends in analytical frameworks.
  • Practice regularly by applying frameworks to diverse scenarios, even outside your immediate project needs. This not only enhances your skill but also your adaptability.

For example, exploring how PESTLE Analysis can be applied to understand the impact of global events on local markets can provide valuable insights into external factors affecting businesses.

Exploring Additional Resources for Deeper Insights

Diving deeper into analytical frameworks necessitates access to quality resources. Here is a curated list to kickstart your journey beyond this guide:

  • Books : 'Competitive Strategy' by Michael E. Porter offers in-depth insights into industry analysis.
  • Courses : Platforms like Coursera and edX offer courses on strategic business analytics.
  • Online Resources : Websites like MindTools provide practical guides on numerous frameworks.

Leveraging these resources can significantly enhance your understanding and application of analytical frameworks. Whether you're a student, a professional, or an enthusiast, the journey towards mastering these tools is ongoing and deeply rewarding.

Analytical frameworks are powerful tools that, when understood and applied correctly, can significantly enhance the quality of your decisions and strategies. This guide has aimed to provide a comprehensive overview for beginners, covering everything from the basics to practical application tips. With this knowledge, you are now better equipped to navigate the complex world of analytical frameworks, ready to tackle challenges with a structured and informed approach.

Q: What is an analytical framework?

A: An analytical framework is a structured approach used to dissect complex problems, analyze data, and make decisions. It involves applying specific models or strategies to systematically understand and address challenges.

Q: Why are analytical frameworks important?

A: Analytical frameworks are important because they provide a systematic method for analyzing problems and making informed decisions. They help break down complex issues into manageable parts, facilitating clearer understanding and more effective solutions.

Q: Can you give examples of analytical frameworks?

A: Yes, some common examples include SWOT Analysis (Strengths, Weaknesses, Opportunities, Threats), PESTLE Analysis (Political, Economic, Social, Technological, Legal, Environmental), and Porter’s Five Forces .

Q: How do I choose the right analytical framework?

A: Choosing the right analytical framework depends on your specific needs, the nature of the problem, and the context in which you are operating. Consider factors such as the framework’s relevance to your industry, its complexity, and your own familiarity with it.

Q: What are common pitfalls in applying analytical frameworks?

A: Common pitfalls include overly complicating the analysis, using the wrong framework for your specific situation, or relying too heavily on the framework without considering unique aspects of your problem. Avoid these by being mindful of your choice and application.

Q: How can I effectively implement an analytical framework?

A: To effectively implement an analytical framework, start by clearly defining your problem. Gather and analyze relevant data, apply the chosen framework systematically, and use your findings to inform your decision-making or strategy development.

Q: Are analytical frameworks only useful for businesses?

A: No, analytical frameworks can be applied in various fields beyond business, including public policy, education, healthcare, and personal decision-making. They are versatile tools for problem-solving and analysis in any complex situation.

Table of Contents

Share this article

Instant Analytics Report and Slides with skills.ai

Related Articles

Mastering Subtotals in Excel: A Beginner's Guide

Mastering Subtotals in Excel: A Beginner's Guide

Learn how to efficiently create subtotal summaries in Excel 2007 with our comprehensive beginner's guide. Master this essential skill for data analysis.

Mastering Excel Borders: A Beginner's Guide

Mastering Excel Borders: A Beginner's Guide

Learn how to apply and customize borders in Excel with this comprehensive guide for beginners. Enhance your spreadsheets now.

Count Substrings in Excel: A Beginner's Guide

Count Substrings in Excel: A Beginner's Guide

Learn to easily count the occurrences of a substring within a string in Excel, a must-read guide for Excel beginners.

Unlock Advanced AI Analytics

Elevate your analytics with Skills.ai. Our platform offers AI-driven insights, enabling you to customize data and create tailored reports effortlessly. Start your journey towards smarter analytics today!

"Skills.ai is the ultimate analytics partner, providing efficient and intelligent solutions." "An invaluable tool for data scientists, the platform's self-driven analytics reduce time to insights significantly."

skills.ai customer, Ramkumar Ravichandran, Ex-Google Data Science Manager, Ex-Visa Director of Data Science

Ram Ravichandran

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

analytical framework in research example

Home Market Research Research Tools and Apps

Analytical Research: What is it, Importance + Examples

Analytical research is a type of research that requires critical thinking skills and the examination of relevant facts and information.

Finding knowledge is a loose translation of the word “research.” It’s a systematic and scientific way of researching a particular subject. As a result, research is a form of scientific investigation that seeks to learn more. Analytical research is one of them.

Any kind of research is a way to learn new things. In this research, data and other pertinent information about a project are assembled; after the information is gathered and assessed, the sources are used to support a notion or prove a hypothesis.

An individual can successfully draw out minor facts to make more significant conclusions about the subject matter by using critical thinking abilities (a technique of thinking that entails identifying a claim or assumption and determining whether it is accurate or untrue).

What is analytical research?

This particular kind of research calls for using critical thinking abilities and assessing data and information pertinent to the project at hand.

Determines the causal connections between two or more variables. The analytical study aims to identify the causes and mechanisms underlying the trade deficit’s movement throughout a given period.

It is used by various professionals, including psychologists, doctors, and students, to identify the most pertinent material during investigations. One learns crucial information from analytical research that helps them contribute fresh concepts to the work they are producing.

Some researchers perform it to uncover information that supports ongoing research to strengthen the validity of their findings. Other scholars engage in analytical research to generate fresh perspectives on the subject.

Various approaches to performing research include literary analysis, Gap analysis , general public surveys, clinical trials, and meta-analysis.

Importance of analytical research

The goal of analytical research is to develop new ideas that are more believable by combining numerous minute details.

The analytical investigation is what explains why a claim should be trusted. Finding out why something occurs is complex. You need to be able to evaluate information critically and think critically. 

This kind of information aids in proving the validity of a theory or supporting a hypothesis. It assists in recognizing a claim and determining whether it is true.

Analytical kind of research is valuable to many people, including students, psychologists, marketers, and others. It aids in determining which advertising initiatives within a firm perform best. In the meantime, medical research and research design determine how well a particular treatment does.

Thus, analytical research can help people achieve their goals while saving lives and money.

Methods of Conducting Analytical Research

Analytical research is the process of gathering, analyzing, and interpreting information to make inferences and reach conclusions. Depending on the purpose of the research and the data you have access to, you can conduct analytical research using a variety of methods. Here are a few typical approaches:

Quantitative research

Numerical data are gathered and analyzed using this method. Statistical methods are then used to analyze the information, which is often collected using surveys, experiments, or pre-existing datasets. Results from quantitative research can be measured, compared, and generalized numerically.

Qualitative research

In contrast to quantitative research, qualitative research focuses on collecting non-numerical information. It gathers detailed information using techniques like interviews, focus groups, observations, or content research. Understanding social phenomena, exploring experiences, and revealing underlying meanings and motivations are all goals of qualitative research.

Mixed methods research

This strategy combines quantitative and qualitative methodologies to grasp a research problem thoroughly. Mixed methods research often entails gathering and evaluating both numerical and non-numerical data, integrating the results, and offering a more comprehensive viewpoint on the research issue.

Experimental research

Experimental research is frequently employed in scientific trials and investigations to establish causal links between variables. This approach entails modifying variables in a controlled environment to identify cause-and-effect connections. Researchers randomly divide volunteers into several groups, provide various interventions or treatments, and track the results.

Observational research

With this approach, behaviors or occurrences are observed and methodically recorded without any outside interference or variable data manipulation . Both controlled surroundings and naturalistic settings can be used for observational research . It offers useful insights into behaviors that occur in the actual world and enables researchers to explore events as they naturally occur.

Case study research

This approach entails thorough research of a single case or a small group of related cases. Case-control studies frequently include a variety of information sources, including observations, records, and interviews. They offer rich, in-depth insights and are particularly helpful for researching complex phenomena in practical settings.

Secondary data analysis

Examining secondary information is time and money-efficient, enabling researchers to explore new research issues or confirm prior findings. With this approach, researchers examine previously gathered information for a different reason. Information from earlier cohort studies, accessible databases, or corporate documents may be included in this.

Content analysis

Content research is frequently employed in social sciences, media observational studies, and cross-sectional studies. This approach systematically examines the content of texts, including media, speeches, and written documents. Themes, patterns, or keywords are found and categorized by researchers to make inferences about the content.

Depending on your research objectives, the resources at your disposal, and the type of data you wish to analyze, selecting the most appropriate approach or combination of methodologies is crucial to conducting analytical research.

Examples of analytical research

Analytical research takes a unique measurement. Instead, you would consider the causes and changes to the trade imbalance. Detailed statistics and statistical checks help guarantee that the results are significant.

For example, it can look into why the value of the Japanese Yen has decreased. This is so that an analytical study can consider “how” and “why” questions.

Another example is that someone might conduct analytical research to identify a study’s gap. It presents a fresh perspective on your data. Therefore, it aids in supporting or refuting notions.

Descriptive vs analytical research

Here are the key differences between descriptive research and analytical research:

AspectDescriptive ResearchAnalytical Research
ObjectiveDescribe and document characteristics or phenomena.Analyze and interpret data to understand relationships or causality.
Focus“What” questions“Why” and “How” questions
Data AnalysisSummarizing informationStatistical research, hypothesis testing, qualitative research
GoalProvide an accurate and comprehensive descriptionGain insights, make inferences, provide explanations or predictions
Causal RelationshipsNot the primary focusExamining underlying factors, causes, or effects
ExamplesSurveys, observations, case-control study, content analysisExperiments, statistical research, qualitative analysis

The study of cause and effect makes extensive use of analytical research. It benefits from numerous academic disciplines, including marketing, health, and psychology, because it offers more conclusive information for addressing research issues.

QuestionPro offers solutions for every issue and industry, making it more than just survey software. For handling data, we also have systems like our InsightsHub research library.

You may make crucial decisions quickly while using QuestionPro to understand your clients and other study subjects better. Make use of the possibilities of the enterprise-grade research suite right away!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

Participant Engagement

Participant Engagement: Strategies + Improving Interaction

Sep 12, 2024

Employee Recognition Programs

Employee Recognition Programs: A Complete Guide

Sep 11, 2024

Agile Qual for Rapid Insights

A guide to conducting agile qualitative research for rapid insights with Digsite 

Cultural Insights

Cultural Insights: What it is, Importance + How to Collect?

Sep 10, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence

Analytical Approach and Framework

Analytical Approach and Framework

What is an analytical approach and framework.

An analytical framework is a structure that helps us make sense of data in an organized way. We take an analytical approach by dividing a complex problem into clear, manageable segments and then reintegrating the results into a unified solution.

Below, we will explore how and when to use three types of analytical frameworks:

  • A Framework for Qualitative Research: Translating problems into numbers.
  • Case Study 1: Banner ad strategy.
  • A Framework for Quantitative Research: Putting numbers in context.
  • Case Study 2: Marketing channel metrics.
  • Data Science Methodology: Step-by-step approach to gathering data and drawing conclusions.

Types of Analytical Frameworks

There are three main types of data analytics frameworks, each with its own strengths depending on what it is they help us organize.

1 - Qualitative research frameworks: When dealing with categorical questions such as, “are our clients satisfied with our product?”, we need a way to translate that question into numbers in order to create data-based insights. A qualitative research framework does this by transforming “soft” problems into “hard” numbers.

The qualitative research framework also helps us translate abstract concepts into quantifiable data . Its used for questions like “would investing five more hours per week in research add more value to our product?”. In this case, we aim to quantify the concept of value to compare different strategies. A qualitative framework eases this process.

2 - Quantitative research frameworks: Let’s say that we are already dealing with well-defined numeric quantities. For example, the “daily active users” our application sees is a metric we have extensively defined and measured. This information helps us know how well the app is currently doing - but doesn’t say much about where to find improvements.

To improve, we need to understand which factors are driving our key metrics ; we need to give our metrics context . Quantitative research analytics frameworks help us understand the relationships between different metrics to put our core metrics in context.

3 - Data science methodology: Let’s say we have defined our concepts and put all our metrics in context, then we’re just getting started. We still need to gather data to draw conclusions.

Numerous ways exist to do this, some prone to error or inconsistency. So we need an organized process to reduce risks and maintain organization. Data science methodology frameworks offer a reliable step-by-step approach to drawing conclusions from data.

Now, let’s examine how each of these analytical frameworks works.

A Framework for Qualitative Research

There are a few qualitative research analytical frameworks we could use depending on the context of the business environment. Specific situations and problems call for different approaches; we want to ensure that we are translating the business challenge into numerical measurements in the right way.

Two examples of these approaches include product metric frameworks for measuring success and diagnosing changes in metrics , as well as evaluating the impact of potential feature changes to our product. Another common business case for translating a problem into hard numbers is through A/B tests , which have a framework of their own.

However, each of these specific frameworks follows the same four-step structure outlined below. They begin with a vaguely defined business problem and need to convert it into hard numbers to address it.

The framework to go about finding these solutions has four steps:

  • First, ask clarifying questions. Gather all the context you need to narrow down the scope of the problem and determine what requires further clarification.
  • Second, assess the requirements. Define the problem in terms of precise metrics that can be used to address gaps from the previous step.
  • Third, provide a solution. Each solution will vary depending on the type of problem you’re dealing with.
  • Fourth, validate the solution. Do this against your pre-existing knowledge and available data to minimize the likelihood of making mistakes.

Case Study 1: Banner Ad Strategy

Let’s look go through each of those framework steps with a business example of an online media company that wants to monetize web traffic by embedding banner ads in their content. Our task is to measure the success of different banner ad strategies and select the best one to scale up.

1 - Clarifying Questions & Assumptions:

Initially, we need to gather context about our monetization method. Will revenue depend on ad impressions, clicks, or the number of users who buy the advertised products?

We also need to identify our audience type. Does it consist of stable (loyal) readers with regular engagement? Or is it primarily composed of click-bait article chasers with low rates of future engagement?

This information is necessary to define each strategy’s success and determine which strategies to test in the future. For example, if we have a click-bait audience, we can observe the revenue for each monetization strategy in the short term and then compare the results.

However, if we have a regular audience, we need to understand the customer lifetime value for each strategy. This is because strategies like filling the page with ad banners could make us more money in the short term - but contribute to the loss of loyal readers, hurting profits in the long term.

2 - Assessing Requirements:

Once we have gathered context and clarified assumptions, we need to define the solution requirements precisely. Let’s say our review reveals that our revenue depends on how many clicks the ads get and that our webpage has a stable user base who reads the webpage regularly.

Now we need to define the metric to optimize our banner ad strategy. We stated that the average customer lifetime value (CLV) was a good choice, which is the total revenue the company expects to make for each of its readers. In this case, the average CLV would be the average number of clicks per session times the average number of times each user views our pages for each banner strategy.

image

The resulting metric help us choose between a strategy that generates more clicks in the short term versus a strategy that reduces reader churn. We also need to define the set of strategies we’ll evaluate. For simplicity, let’s say that we will only test the number of banners we show to each user.

3 - Solution:

At this point, we’ve defined our problem numerically and can create a data-driven solution.

In general, solutions can involve running experiments , deciding on product features , or explaining metric changes. In this case, we’ll design an A/B test to identify the best banner ad strategy, based on our assessment requirements.

In this case, we need to define an A/B test to decide our optimal strategy. Based on our requirements, the A/B test should be user-based instead of session-based: We’ll divide users into two groups, showing each group a different number of ads during their visits. For example, Bucket A receives one banner ad per webpage, while Bucket B gets two. Over time, we will be able to capture how the number of ads shown impacts engagement.

To reduce causal effects, we must ensure identical banner content for both groups. If Bucket B sees the same two banners, half of Bucket A should see one banner and the other half the other banner. We should also alternate the order of banners for Bucket B to avoid interference from the display order.

Lastly, decide on the experiment duration. To account for long-term effects, we should run the experiment for at least three months.

4 - Validation:

A useful first step is to re-check the numbers and perform a gut instinct check. If results seem odd, we should suspect a problem, investigate the cause, and revise our approach.

In this example, we tested a banner strategy hypothesis. The validation step involves evaluating differences between the test and control groups (users who didn’t receive the treatment over three months) and identifying any confounding factors that might have affected the results. We must also determine if the differences and observations are statistically significant or potentially spurious results.

A Framework for Quantitative Research

The second type of analytical approach comes from the quantitative research framework. After we define our key metrics clearly, this framework helps give them context. With this framework, teams can enhance their understanding of the key metric, making it easier to control, track, assign responsibilities, and identify opportunities for improvements.

to understand the factors that drive them, assign responsibilities to team members, and identify opportunities for improvement.

We do this by breaking down the key metric into lower-level metrics. Here’s a step-by-step guide:

Identify the key metric : Determine the main metric you want to focus on (e.g., revenue for a sales team).

Define first-level metrics : Break down the key metric into components that directly relate to it. For a sales team, first-level metrics would be the sales volume and the average selling price because the revenue is the sales volume times the average selling price.

Identify second-level metrics : Further refine your analysis by breaking down the first-level metrics into their underlying factors. For a sales team, second-level metrics could include:

  • Number of leads generated
  • Conversion rate
  • Average order value
  • Discounts and promotions
  • Competitor prices

Assign responsibility and track progress : With a better understanding of first and second-level metrics, allocate responsibility for improving them to different team members. Track their progress to enhance the key metric.

Case Study 2: Marketing Channel Metrics

Let’s explore an example where we apply the quantitative analytics framework to a company called Mode, which sells B2B analytics dashboards through a SaaS freemium subscription model (users can use the product for free but must pay monthly or annually for advanced features).

Step 1: Identify the key metric

Our key metric is marketing ROI (revenue over expenses) for each of our marketing channels.

Step 2: Define first-level metrics

Two first-level metrics stand out:

  • Revenue: Driven by our average Customer Lifetime Value (CLV) - the total revenue we make over the years for each new customer.
  • Expenses: Driven by our Customer Acquisition Cost (CAC) - the cost of gaining new customers.

Step 3: Identify second-level metrics

Now we need to identify the second-level metrics for each of our first-level metrics.

First-Level Metric: Customer Lifetime Value

CLV is calculated as the Average Revenue Per Customer (ARPC) - the average amount a customer spends each month - divided by the churn rate (CR) - the percentage of users that stop using the platform each month:

image

So ARPC and CR are the second-level metrics driving CLV.

First-Level Metric: Customer Acquisition Cost

On the other side of our marketing ROI equation, CAC is the average amount spent by the sales team in salaried time and equipment/software value to sign up one new customer.

There are quite a few second-level metrics we could investigate under CAC, mostly from looking at the customer acquisition funnel:

  • Cost per View (CPV): The amount it costs the company for each new person to see our landing page.
  • Free Sign-Ups per Total Number of Views (FSU/TNV): The percentage of landing page visitors who create a free account.
  • Paid Customers per Total Number of Views (PC/TNV): The percentage of landing page visitors who create a premium account directly.
  • Paid Customers per Free Sign-Ups (PC/FSU): The percentage of free account users who upgrade to a premium account.

With this information, we can define our CAC as:

image

So the four metrics we identified serve as our second-level metrics.

Step 4: Assign responsibility and track progress

With a clear understanding of first- and second-level metrics, the sales team can assign responsibilities for improving each metric and track their progress in enhancing the key metric of marketing ROI.

Data Science Methodology

Let’s say we’ve defined our concepts and metrics. We translated our business problem into hard numbers using a qualitative framework. Then we used the quantitative framework to get an analytical understanding of the metrics involved and their relationships. Now we want to draw conclusions from the data .

To do this, we need a reliable process that minimizes errors and keeps things organized. This is where our third analytical framework comes into use. The data science methodology provides a step-by-step approach for reaching conclusions from data, which is especially useful when questions become increasingly complex:

  • Data Requirements - Figure out the necessary data, formats, and sources to collect.
  • Data Collection - Gather and validate the data, ensuring it’s representative of the problem.
  • Data Processing - Clean and transform the data.
  • Modeling - Build models to predict or describe outcomes.
  • Evaluation - Check if the model meets business requirements and is high-quality.
  • Deployment - Prepare the model for real-world use.
  • Feedback - Refine the model based on its performance and impact.

Imagine you’re working at a company that wants to boost customer retention in its online store. They collect customer data through website analytics and a customer database. Here’s how they might follow the data science methodology:

Going through each of the steps would look something like this:

  • Data Requirements: Identify data needed to improve customer retention, such as demographics, purchase history, website engagement, and feedback.
  • Data Collection: Gather data from sources like databases, website analytics, and surveys. Ensure data is accurate, complete, and relevant.
  • Data Processing: Clean and analyze the data to remove errors, duplicates, and missing values. Look for patterns and trends you could use for feature engineering .
  • Modeling: Create predictive models to find factors that impact customer retention using machine learning algorithms based on historical data.
  • Evaluation: Compare the model’s predictions to actual customer behavior, checking for accuracy, interpretability, and scalability.
  • Deployment: Implement the model in the online store’s retention strategies. This could include targeted marketing campaigns, personalized recommendations, or loyalty programs based on the model’s predictions. If you’re working on your own, ensure you showcase your projects and results in the best possible way .
  • Feedback: Keep an eye on the model’s performance and gather customer feedback to refine it. Update the model’s algorithms or adjust retention strategies based on its predictions. Continuously assess and improve the model to maintain its effectiveness.
  • Privacy Policy

Research Method

Home » Theoretical Framework – Types, Examples and Writing Guide

Theoretical Framework – Types, Examples and Writing Guide

Table of Contents

Theoretical Framework

Theoretical Framework

Definition:

Theoretical framework refers to a set of concepts, theories, ideas , and assumptions that serve as a foundation for understanding a particular phenomenon or problem. It provides a conceptual framework that helps researchers to design and conduct their research, as well as to analyze and interpret their findings.

In research, a theoretical framework explains the relationship between various variables, identifies gaps in existing knowledge, and guides the development of research questions, hypotheses, and methodologies. It also helps to contextualize the research within a broader theoretical perspective, and can be used to guide the interpretation of results and the formulation of recommendations.

Types of Theoretical Framework

Types of Types of Theoretical Framework are as follows:

Conceptual Framework

This type of framework defines the key concepts and relationships between them. It helps to provide a theoretical foundation for a study or research project .

Deductive Framework

This type of framework starts with a general theory or hypothesis and then uses data to test and refine it. It is often used in quantitative research .

Inductive Framework

This type of framework starts with data and then develops a theory or hypothesis based on the patterns and themes that emerge from the data. It is often used in qualitative research .

Empirical Framework

This type of framework focuses on the collection and analysis of empirical data, such as surveys or experiments. It is often used in scientific research .

Normative Framework

This type of framework defines a set of norms or values that guide behavior or decision-making. It is often used in ethics and social sciences.

Explanatory Framework

This type of framework seeks to explain the underlying mechanisms or causes of a particular phenomenon or behavior. It is often used in psychology and social sciences.

Components of Theoretical Framework

The components of a theoretical framework include:

  • Concepts : The basic building blocks of a theoretical framework. Concepts are abstract ideas or generalizations that represent objects, events, or phenomena.
  • Variables : These are measurable and observable aspects of a concept. In a research context, variables can be manipulated or measured to test hypotheses.
  • Assumptions : These are beliefs or statements that are taken for granted and are not tested in a study. They provide a starting point for developing hypotheses.
  • Propositions : These are statements that explain the relationships between concepts and variables in a theoretical framework.
  • Hypotheses : These are testable predictions that are derived from the theoretical framework. Hypotheses are used to guide data collection and analysis.
  • Constructs : These are abstract concepts that cannot be directly measured but are inferred from observable variables. Constructs provide a way to understand complex phenomena.
  • Models : These are simplified representations of reality that are used to explain, predict, or control a phenomenon.

How to Write Theoretical Framework

A theoretical framework is an essential part of any research study or paper, as it helps to provide a theoretical basis for the research and guide the analysis and interpretation of the data. Here are some steps to help you write a theoretical framework:

  • Identify the key concepts and variables : Start by identifying the main concepts and variables that your research is exploring. These could include things like motivation, behavior, attitudes, or any other relevant concepts.
  • Review relevant literature: Conduct a thorough review of the existing literature in your field to identify key theories and ideas that relate to your research. This will help you to understand the existing knowledge and theories that are relevant to your research and provide a basis for your theoretical framework.
  • Develop a conceptual framework : Based on your literature review, develop a conceptual framework that outlines the key concepts and their relationships. This framework should provide a clear and concise overview of the theoretical perspective that underpins your research.
  • Identify hypotheses and research questions: Based on your conceptual framework, identify the hypotheses and research questions that you want to test or explore in your research.
  • Test your theoretical framework: Once you have developed your theoretical framework, test it by applying it to your research data. This will help you to identify any gaps or weaknesses in your framework and refine it as necessary.
  • Write up your theoretical framework: Finally, write up your theoretical framework in a clear and concise manner, using appropriate terminology and referencing the relevant literature to support your arguments.

Theoretical Framework Examples

Here are some examples of theoretical frameworks:

  • Social Learning Theory : This framework, developed by Albert Bandura, suggests that people learn from their environment, including the behaviors of others, and that behavior is influenced by both external and internal factors.
  • Maslow’s Hierarchy of Needs : Abraham Maslow proposed that human needs are arranged in a hierarchy, with basic physiological needs at the bottom, followed by safety, love and belonging, esteem, and self-actualization at the top. This framework has been used in various fields, including psychology and education.
  • Ecological Systems Theory : This framework, developed by Urie Bronfenbrenner, suggests that a person’s development is influenced by the interaction between the individual and the various environments in which they live, such as family, school, and community.
  • Feminist Theory: This framework examines how gender and power intersect to influence social, cultural, and political issues. It emphasizes the importance of understanding and challenging systems of oppression.
  • Cognitive Behavioral Theory: This framework suggests that our thoughts, beliefs, and attitudes influence our behavior, and that changing our thought patterns can lead to changes in behavior and emotional responses.
  • Attachment Theory: This framework examines the ways in which early relationships with caregivers shape our later relationships and attachment styles.
  • Critical Race Theory : This framework examines how race intersects with other forms of social stratification and oppression to perpetuate inequality and discrimination.

When to Have A Theoretical Framework

Following are some situations When to Have A Theoretical Framework:

  • A theoretical framework should be developed when conducting research in any discipline, as it provides a foundation for understanding the research problem and guiding the research process.
  • A theoretical framework is essential when conducting research on complex phenomena, as it helps to organize and structure the research questions, hypotheses, and findings.
  • A theoretical framework should be developed when the research problem requires a deeper understanding of the underlying concepts and principles that govern the phenomenon being studied.
  • A theoretical framework is particularly important when conducting research in social sciences, as it helps to explain the relationships between variables and provides a framework for testing hypotheses.
  • A theoretical framework should be developed when conducting research in applied fields, such as engineering or medicine, as it helps to provide a theoretical basis for the development of new technologies or treatments.
  • A theoretical framework should be developed when conducting research that seeks to address a specific gap in knowledge, as it helps to define the problem and identify potential solutions.
  • A theoretical framework is also important when conducting research that involves the analysis of existing theories or concepts, as it helps to provide a framework for comparing and contrasting different theories and concepts.
  • A theoretical framework should be developed when conducting research that seeks to make predictions or develop generalizations about a particular phenomenon, as it helps to provide a basis for evaluating the accuracy of these predictions or generalizations.
  • Finally, a theoretical framework should be developed when conducting research that seeks to make a contribution to the field, as it helps to situate the research within the broader context of the discipline and identify its significance.

Purpose of Theoretical Framework

The purposes of a theoretical framework include:

  • Providing a conceptual framework for the study: A theoretical framework helps researchers to define and clarify the concepts and variables of interest in their research. It enables researchers to develop a clear and concise definition of the problem, which in turn helps to guide the research process.
  • Guiding the research design: A theoretical framework can guide the selection of research methods, data collection techniques, and data analysis procedures. By outlining the key concepts and assumptions underlying the research questions, the theoretical framework can help researchers to identify the most appropriate research design for their study.
  • Supporting the interpretation of research findings: A theoretical framework provides a framework for interpreting the research findings by helping researchers to make connections between their findings and existing theory. It enables researchers to identify the implications of their findings for theory development and to assess the generalizability of their findings.
  • Enhancing the credibility of the research: A well-developed theoretical framework can enhance the credibility of the research by providing a strong theoretical foundation for the study. It demonstrates that the research is based on a solid understanding of the relevant theory and that the research questions are grounded in a clear conceptual framework.
  • Facilitating communication and collaboration: A theoretical framework provides a common language and conceptual framework for researchers, enabling them to communicate and collaborate more effectively. It helps to ensure that everyone involved in the research is working towards the same goals and is using the same concepts and definitions.

Characteristics of Theoretical Framework

Some of the characteristics of a theoretical framework include:

  • Conceptual clarity: The concepts used in the theoretical framework should be clearly defined and understood by all stakeholders.
  • Logical coherence : The framework should be internally consistent, with each concept and assumption logically connected to the others.
  • Empirical relevance: The framework should be based on empirical evidence and research findings.
  • Parsimony : The framework should be as simple as possible, without sacrificing its ability to explain the phenomenon in question.
  • Flexibility : The framework should be adaptable to new findings and insights.
  • Testability : The framework should be testable through research, with clear hypotheses that can be falsified or supported by data.
  • Applicability : The framework should be useful for practical applications, such as designing interventions or policies.

Advantages of Theoretical Framework

Here are some of the advantages of having a theoretical framework:

  • Provides a clear direction : A theoretical framework helps researchers to identify the key concepts and variables they need to study and the relationships between them. This provides a clear direction for the research and helps researchers to focus their efforts and resources.
  • Increases the validity of the research: A theoretical framework helps to ensure that the research is based on sound theoretical principles and concepts. This increases the validity of the research by ensuring that it is grounded in established knowledge and is not based on arbitrary assumptions.
  • Enables comparisons between studies : A theoretical framework provides a common language and set of concepts that researchers can use to compare and contrast their findings. This helps to build a cumulative body of knowledge and allows researchers to identify patterns and trends across different studies.
  • Helps to generate hypotheses: A theoretical framework provides a basis for generating hypotheses about the relationships between different concepts and variables. This can help to guide the research process and identify areas that require further investigation.
  • Facilitates communication: A theoretical framework provides a common language and set of concepts that researchers can use to communicate their findings to other researchers and to the wider community. This makes it easier for others to understand the research and its implications.

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Thesis Format

Thesis Format – Templates and Samples

APA Research Paper Format

APA Research Paper Format – Example, Sample and...

Research Project

Research Project – Definition, Writing Guide and...

Significance of the Study

Significance of the Study – Examples and Writing...

Research Paper Introduction

Research Paper Introduction – Writing Guide and...

Thesis

Thesis – Structure, Example and Writing Guide

Understanding and solving intractable resource governance problems.

  • Conferences and Talks
  • Exploring models of electronic wastes governance in the United States and Mexico: Recycling, risk and environmental justice
  • The Collaborative Resource Governance Lab (CoReGovLab)
  • Water Conflicts in Mexico: A Multi-Method Approach
  • Past projects
  • Publications and scholarly output
  • Research Interests
  • Higher education and academia
  • Public administration, public policy and public management research
  • Research-oriented blog posts
  • Stuff about research methods
  • Research trajectory
  • Publications
  • Developing a Writing Practice
  • Outlining Papers
  • Publishing strategies
  • Writing a book manuscript
  • Writing a research paper, book chapter or dissertation/thesis chapter
  • Everything Notebook
  • Literature Reviews
  • Note-Taking Techniques
  • Organization and Time Management
  • Planning Methods and Approaches
  • Qualitative Methods, Qualitative Research, Qualitative Analysis
  • Reading Notes of Books
  • Reading Strategies
  • Teaching Public Policy, Public Administration and Public Management
  • My Reading Notes of Books on How to Write a Doctoral Dissertation/How to Conduct PhD Research
  • Writing a Thesis (Undergraduate or Masters) or a Dissertation (PhD)
  • Reading strategies for undergraduates
  • Social Media in Academia
  • Resources for Job Seekers in the Academic Market
  • Writing Groups and Retreats
  • Regional Development (Fall 2015)
  • State and Local Government (Fall 2015)
  • Public Policy Analysis (Fall 2016)
  • Regional Development (Fall 2016)
  • Public Policy Analysis (Fall 2018)
  • Public Policy Analysis (Fall 2019)
  • Public Policy Analysis (Spring 2016)
  • POLI 351 Environmental Policy and Politics (Summer Session 2011)
  • POLI 352 Comparative Politics of Public Policy (Term 2)
  • POLI 375A Global Environmental Politics (Term 2)
  • POLI 350A Public Policy (Term 2)
  • POLI 351 Environmental Policy and Politics (Term 1)
  • POLI 332 Latin American Environmental Politics (Term 2, Spring 2012)
  • POLI 350A Public Policy (Term 1, Sep-Dec 2011)
  • POLI 375A Global Environmental Politics (Term 1, Sep-Dec 2011)

Creating tables and diagrams to describe theoretical, conceptual, and analytical frameworks

Doctoral supervisors (and often, editors!) will ask you to create a conceptual, theoretical and/or analytical framework for your book, dissertation, chapter, or journal article. This is a good idea. I used to get confused by all the “framework”-associated terms, so I wrote

THIS blog post: Writing theoretical frameworks, analytical frameworks and conceptual frameworks https://t.co/DeAqoV5xcQ This post helped me clarify the differences between TF, AF, and CF. Frequently, a graphic depiction is way, way way more helpful than just words on paper. — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020

Like I have done in other blog posts of mine, I am going to show you several graphic and table-based depictions of frameworks that may help you think through how you can visually explain the concepts you are using to analyze what you are analyzing.

Here is the 411:

I find it incredibly useful to draw diagrams (often times, mind maps or conceptual diagrams, or even fish-bone diagrams) to show how variables are linked with each other and how these factors help explain a phenomenon. You can (and I often do) use tables for this purpose. Like with the frameworks, we often link the words “theoretical”, “conceptual” and “analytical” with the word “diagram”.

Around 2015-ish, I published a framework that helps scholars and analysts think about environmental non-governmental organizations…

You can download the Pacheco-Vega 2015 Double Grid Framework chapter here https://t.co/0Hpzp7iVEZ Anyhow, the framewok is comprised of three components: 1) Two tables describing the dimensions of domestic and international influence 2) Two grids to showcase case studies. — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020
Years later, @AmandaMurdie and I did a quantitative, empirical test of the Double Grid Framework https://t.co/UNiUX4mFzR (happy to email you a PDF if you’re not able to download it). In this article, Amanda and I developed an amended version of the framework, now our own. pic.twitter.com/rgkfVi5k6q — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020
As you can see, in our work, we use diagrams and tables to develop more clearly the theoretical constructs underlying our analysis. Now, another example I like, a framework developed by @chris_weible and @tanyaheikkila – The Policy Conflict Framework https://t.co/TkbgvWTjhm — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020
You may also find (as Chris and Tanya may have discovered as they wrote this paper) that you need ADDITIONAL diagrams to help explain the entirety of the phenomenon you are trying to investigate. This is also normal. What you’d need to do in your paper, book, chapter, etc… pic.twitter.com/cugEujT2SO — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020
… each level of conflict intensity is connected to actors’ political positions and how these interact with other factors. Another one (OBVIOUSLY WE ALL KNOW I WAS GOING TO CITE THIS ONE) whose graphic depiction I love is the Institutional Analysis and Development Framework. pic.twitter.com/ogJFNkgS5L — Dr Raul Pacheco-Vega (@raulpacheco) August 5, 2020

To be perfectly honest, I always looked up to Lin and Vincent Ostrom for how to write good tables and diagrams that depicted theoretical, conceptual and analytical frameworks. There are many other frameworks developed by the Ostroms, and pretty much all of them have tables/diagrams.

In sum, your development of theoretical, conceptual and analytical frameworks is well served by depicting these in table form or in graphic, diagrammatic form. What I usually do is – I read A METRIC TONNE of books and articles to see how other authors develop theirs.

And then, I think through how I want to write my own.

I do hope this blog post is useful to anyone who is trying to develop “a theoretical figure” or a “conceptual table”.

You can share this blog post on the following social networks by clicking on their icon.

Posted in academia , writing .

Tagged with analytical framework , conceptual framework , graphs , tables , theoretical framework .

By Raul Pacheco-Vega – August 6, 2020

3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post .

' src=

Thank you for this great post. What softare do you use to design nice diagrams like these ones ?

' src=

I know this is going to sound like I am super low-tech, but… I often use Power Point!

Thanks ! That’s good to know

Leave a Reply Cancel Some HTML is OK

Name (required)

Email (required, but never shared)

or, reply to this post via trackback .

About Raul Pacheco-Vega, PhD

Find me online.

My Research Output

  • Google Scholar Profile
  • Academia.Edu
  • ResearchGate

My Social Networks

  • Polycentricity Network

Recent Posts

  • The value and importance of the pre-writing stage of writing
  • My experience teaching residential academic writing workshops
  • “State-Sponsored Activism: Bureaucrats and Social Movements in Brazil” – Jessica Rich – my reading notes
  • Reading Like a Writer – Francine Prose – my reading notes
  • Using the Pacheco-Vega workflows and frameworks to write and/or revise a scholarly book

Recent Comments

  • Charlotte on The value and importance of the pre-writing stage of writing
  • Raul Pacheco-Vega on The value and importance of the pre-writing stage of writing
  • Noni on Developing a structured daily routine for writing and research
  • Alan Parker on Project management for academics I: Managing a research pipeline

Follow me on Twitter:

Proudly powered by WordPress and Carrington .

Carrington Theme by Crowd Favorite

  • Correspondence
  • Open access
  • Published: 18 September 2013

Using the framework method for the analysis of qualitative data in multi-disciplinary health research

  • Nicola K Gale 1 ,
  • Gemma Heath 2 ,
  • Elaine Cameron 3 ,
  • Sabina Rashid 4 &
  • Sabi Redwood 2  

BMC Medical Research Methodology volume  13 , Article number:  117 ( 2013 ) Cite this article

540k Accesses

5672 Citations

107 Altmetric

Metrics details

The Framework Method is becoming an increasingly popular approach to the management and analysis of qualitative data in health research. However, there is confusion about its potential application and limitations.

The article discusses when it is appropriate to adopt the Framework Method and explains the procedure for using it in multi-disciplinary health research teams, or those that involve clinicians, patients and lay people. The stages of the method are illustrated using examples from a published study.

Used effectively, with the leadership of an experienced qualitative researcher, the Framework Method is a systematic and flexible approach to analysing qualitative data and is appropriate for use in research teams even where not all members have previous experience of conducting qualitative research.

The Framework Method for the management and analysis of qualitative data has been used since the 1980s [ 1 ]. The method originated in large-scale social policy research but is becoming an increasingly popular approach in medical and health research; however, there is some confusion about its potential application and limitations. In this article we discuss when it is appropriate to use the Framework Method and how it compares to other qualitative analysis methods. In particular, we explore how it can be used in multi-disciplinary health research teams. Multi-disciplinary and mixed methods studies are becoming increasingly commonplace in applied health research. As well as disciplines familiar with qualitative research, such as nursing, psychology and sociology, teams often include epidemiologists, health economists, management scientists and others. Furthermore, applied health research often has clinical representation and, increasingly, patient and public involvement [ 2 ]. We argue that while leadership is undoubtedly required from an experienced qualitative methodologist, non-specialists from the wider team can and should be involved in the analysis process. We then present a step-by-step guide to the application of the Framework Method, illustrated using a worked example (See Additional File 1 ) from a published study [ 3 ] to illustrate the main stages of the process. Technical terms are included in the glossary (below). Finally, we discuss the strengths and limitations of the approach.

Glossary of key terms used in the Framework Method

Analytical framework: A set of codes organised into categories that have been jointly developed by researchers involved in analysis that can be used to manage and organise the data. The framework creates a new structure for the data (rather than the full original accounts given by participants) that is helpful to summarize/reduce the data in a way that can support answering the research questions.

Analytic memo: A written investigation of a particular concept, theme or problem, reflecting on emerging issues in the data that captures the analytic process (see Additional file 1 , Section 7).

Categories: During the analysis process, codes are grouped into clusters around similar and interrelated ideas or concepts. Categories and codes are usually arranged in a tree diagram structure in the analytical framework. While categories are closely and explicitly linked to the raw data, developing categories is a way to start the process of abstraction of the data (i.e. towards the general rather than the specific or anecdotal).

Charting: Entering summarized data into the Framework Method matrix (see Additional File 1 , Section 6).

Code: A descriptive or conceptual label that is assigned to excerpts of raw data in a process called ‘coding’ (see Additional File 1 , Section 3).

Data: Qualitative data usually needs to be in textual form before analysis. These texts can either be elicited texts (written specifically for the research, such as food diaries), or extant texts (pre-existing texts, such as meeting minutes, policy documents or weblogs), or can be produced by transcribing interview or focus group data, or creating ‘field’ notes while conducting participant-observation or observing objects or social situations.

Indexing: The systematic application of codes from the agreed analytical framework to the whole dataset (see Additional File 1 , Section 5).

Matrix: A spreadsheet contains numerous cells into which summarized data are entered by codes (columns) and cases (rows) (see Additional File 1 , Section 6).

Themes: Interpretive concepts or propositions that describe or explain aspects of the data, which are the final output of the analysis of the whole dataset. Themes are articulated and developed by interrogating data categories through comparison between and within cases. Usually a number of categories would fall under each theme or sub-theme [ 3 ].

Transcript: A written verbatim (word-for-word) account of a verbal interaction, such as an interview or conversation.

The Framework Method sits within a broad family of analysis methods often termed thematic analysis or qualitative content analysis. These approaches identify commonalities and differences in qualitative data, before focusing on relationships between different parts of the data, thereby seeking to draw descriptive and/or explanatory conclusions clustered around themes. The Framework Method was developed by researchers, Jane Ritchie and Liz Spencer, from the Qualitative Research Unit at the National Centre for Social Research in the United Kingdom in the late 1980s for use in large-scale policy research [ 1 ]. It is now used widely in other areas, including health research [ 3 – 12 ]. Its defining feature is the matrix output: rows (cases), columns (codes) and ‘cells’ of summarised data, providing a structure into which the researcher can systematically reduce the data, in order to analyse it by case and by code [ 1 ]. Most often a ‘case’ is an individual interviewee, but this can be adapted to other units of analysis, such as predefined groups or organisations. While in-depth analyses of key themes can take place across the whole data set, the views of each research participant remain connected to other aspects of their account within the matrix so that the context of the individual’s views is not lost. Comparing and contrasting data is vital to qualitative analysis and the ability to compare with ease data across cases as well as within individual cases is built into the structure and process of the Framework Method.

The Framework Method provides clear steps to follow and produces highly structured outputs of summarised data. It is therefore useful where multiple researchers are working on a project, particularly in multi-disciplinary research teams were not all members have experience of qualitative data analysis, and for managing large data sets where obtaining a holistic, descriptive overview of the entire data set is desirable. However, caution is recommended before selecting the method as it is not a suitable tool for analysing all types of qualitative data or for answering all qualitative research questions, nor is it an ‘easy’ version of qualitative research for quantitative researchers. Importantly, the Framework Method cannot accommodate highly heterogeneous data, i.e. data must cover similar topics or key issues so that it is possible to categorize it. Individual interviewees may, of course, have very different views or experiences in relation to each topic, which can then be compared and contrasted. The Framework Method is most commonly used for the thematic analysis of semi-structured interview transcripts, which is what we focus on in this article, although it could, in principle, be adapted for other types of textual data [ 13 ], including documents, such as meeting minutes or diaries [ 12 ], or field notes from observations [ 10 ].

For quantitative researchers working with qualitative colleagues or when exploring qualitative research for the first time, the nature of the Framework Method is seductive because its methodical processes and ‘spreadsheet’ approach seem more closely aligned to the quantitative paradigm [ 14 ]. Although the Framework Method is a highly systematic method of categorizing and organizing what may seem like unwieldy qualitative data, it is not a panacea for problematic issues commonly associated with qualitative data analysis such as how to make analytic choices and make interpretive strategies visible and auditable. Qualitative research skills are required to appropriately interpret the matrix, and facilitate the generation of descriptions, categories, explanations and typologies. Moreover, reflexivity, rigour and quality are issues that are requisite in the Framework Method just as they are in other qualitative methods. It is therefore essential that studies using the Framework Method for analysis are overseen by an experienced qualitative researcher, though this does not preclude those new to qualitative research from contributing to the analysis as part of a wider research team.

There are a number of approaches to qualitative data analysis, including those that pay close attention to language and how it is being used in social interaction such as discourse analysis [ 15 ] and ethnomethodology [ 16 ]; those that are concerned with experience, meaning and language such as phenomenology [ 17 , 18 ] and narrative methods [ 19 ]; and those that seek to develop theory derived from data through a set of procedures and interconnected stages such as Grounded Theory [ 20 , 21 ]. Many of these approaches are associated with specific disciplines and are underpinned by philosophical ideas which shape the process of analysis [ 22 ]. The Framework Method, however, is not aligned with a particular epistemological, philosophical, or theoretical approach. Rather it is a flexible tool that can be adapted for use with many qualitative approaches that aim to generate themes.

The development of themes is a common feature of qualitative data analysis, involving the systematic search for patterns to generate full descriptions capable of shedding light on the phenomenon under investigation. In particular, many qualitative approaches use the ‘constant comparative method’ , developed as part of Grounded Theory, which involves making systematic comparisons across cases to refine each theme [ 21 , 23 ]. Unlike Grounded Theory, the Framework Method is not necessarily concerned with generating social theory, but can greatly facilitate constant comparative techniques through the review of data across the matrix.

Perhaps because the Framework Method is so obviously systematic, it has often, as other commentators have noted, been conflated with a deductive approach to qualitative analysis [ 13 , 14 ]. However, the tool itself has no allegiance to either inductive or deductive thematic analysis; where the research sits along this inductive-deductive continuum depends on the research question. A question such as, ‘Can patients give an accurate biomedical account of the onset of their cardiovascular disease?’ is essentially a yes/no question (although it may be nuanced by the extent of their account or by appropriate use of terminology) and so requires a deductive approach to both data collection and analysis (e.g. structured or semi-structured interviews and directed qualitative content analysis [ 24 ]). Similarly, a deductive approach may be taken if basing analysis on a pre-existing theory, such as behaviour change theories, for example in the case of a research question such as ‘How does the Theory of Planned Behaviour help explain GP prescribing?’ [ 11 ]. However, a research question such as, ‘How do people construct accounts of the onset of their cardiovascular disease?’ would require a more inductive approach that allows for the unexpected, and permits more socially-located responses [ 25 ] from interviewees that may include matters of cultural beliefs, habits of food preparation, concepts of ‘fate’, or links to other important events in their lives, such as grief, which cannot be predicted by the researcher in advance (e.g. an interviewee-led open ended interview and grounded theory [ 20 ]). In all these cases, it may be appropriate to use the Framework Method to manage the data. The difference would become apparent in how themes are selected: in the deductive approach, themes and codes are pre-selected based on previous literature, previous theories or the specifics of the research question; whereas in the inductive approach, themes are generated from the data though open (unrestricted) coding, followed by refinement of themes. In many cases, a combined approach is appropriate when the project has some specific issues to explore, but also aims to leave space to discover other unexpected aspects of the participants’ experience or the way they assign meaning to phenomena. In sum, the Framework Method can be adapted for use with deductive, inductive, or combined types of qualitative analysis. However, there are some research questions where analysing data by case and theme is not appropriate and so the Framework Method should be avoided. For instance, depending on the research question, life history data might be better analysed using narrative analysis [ 19 ]; recorded consultations between patients and their healthcare practitioners using conversation analysis [ 26 ]; and documentary data, such as resources for pregnant women, using discourse analysis [ 27 ].

It is not within the scope of this paper to consider study design or data collection in any depth, but before moving on to describe the Framework Method analysis process, it is worth taking a step back to consider briefly what needs to happen before analysis begins. The selection of analysis method should have been considered at the proposal stage of the research and should fit with the research questions and overall aims of the study. Many qualitative studies, particularly ones using inductive analysis, are emergent in nature; this can be a challenge and the researchers can only provide an “imaginative rehearsal” of what is to come [ 28 ]. In mixed methods studies, the role of the qualitative component within the wider goals of the project must also be considered. In the data collection stage, resources must be allocated for properly trained researchers to conduct the qualitative interviewing because it is a highly skilled activity. In some cases, a research team may decide that they would like to use lay people, patients or peers to do the interviews [ 29 – 32 ] and in this case they must be properly trained and mentored which requires time and resources. At this early stage it is also useful to consider whether the team will use Computer Assisted Qualitative Data Analysis Software (CAQDAS), which can assist with data management and analysis.

As any form of qualitative or quantitative analysis is not a purely technical process, but influenced by the characteristics of the researchers and their disciplinary paradigms, critical reflection throughout the research process is paramount, including in the design of the study, the construction or collection of data, and the analysis. All members of the team should keep a research diary, where they record reflexive notes, impressions of the data and thoughts about analysis throughout the process. Experienced qualitative researchers become more skilled at sifting through data and analysing it in a rigorous and reflexive way. They cannot be too attached to certainty, but must remain flexible and adaptive throughout the research in order to generate rich and nuanced findings that embrace and explain the complexity of real social life and can be applied to complex social issues. It is important to remember when using the Framework Method that, unlike quantitative research where data collection and data analysis are strictly sequential and mutually exclusive stages of the research process, in qualitative analysis there is, to a greater or lesser extent depending on the project, ongoing interplay between data collection, analysis, and theory development. For example, new ideas or insights from participants may suggest potentially fruitful lines of enquiry, or close analysis might reveal subtle inconsistencies in an account which require further exploration.

Procedure for analysis

Stage 1: transcription.

A good quality audio recording and, ideally, a verbatim (word for word) transcription of the interview is needed. For Framework Method analysis, it is not necessarily important to include the conventions of dialogue transcriptions which can be difficult to read (e.g. pauses or two people talking simultaneously), because the content is what is of primary interest. Transcripts should have large margins and adequate line spacing for later coding and making notes. The process of transcription is a good opportunity to become immersed in the data and is to be strongly encouraged for new researchers. However, in some projects, the decision may be made that it is a better use of resources to outsource this task to a professional transcriber.

Stage 2: Familiarisation with the interview

Becoming familiar with the whole interview using the audio recording and/or transcript and any contextual or reflective notes that were recorded by the interviewer is a vital stage in interpretation. It can also be helpful to re-listen to all or parts of the audio recording. In multi-disciplinary or large research projects, those involved in analysing the data may be different from those who conducted or transcribed the interviews, which makes this stage particularly important. One margin can be used to record any analytical notes, thoughts or impressions.

Stage 3: Coding

After familiarization, the researcher carefully reads the transcript line by line, applying a paraphrase or label (a ‘code’) that describes what they have interpreted in the passage as important. In more inductive studies, at this stage ‘open coding’ takes place, i.e. coding anything that might be relevant from as many different perspectives as possible. Codes could refer to substantive things (e.g. particular behaviours, incidents or structures), values (e.g. those that inform or underpin certain statements, such as a belief in evidence-based medicine or in patient choice), emotions (e.g. sorrow, frustration, love) and more impressionistic/methodological elements (e.g. interviewee found something difficult to explain, interviewee became emotional, interviewer felt uncomfortable) [ 33 ]. In purely deductive studies, the codes may have been pre-defined (e.g. by an existing theory, or specific areas of interest to the project) so this stage may not be strictly necessary and you could just move straight onto indexing, although it is generally helpful even if you are taking a broadly deductive approach to do some open coding on at least a few of the transcripts to ensure important aspects of the data are not missed. Coding aims to classify all of the data so that it can be compared systematically with other parts of the data set. At least two researchers (or at least one from each discipline or speciality in a multi-disciplinary research team) should independently code the first few transcripts, if feasible. Patients, public involvement representatives or clinicians can also be productively involved at this stage, because they can offer alternative viewpoints thus ensuring that one particular perspective does not dominate. It is vital in inductive coding to look out for the unexpected and not to just code in a literal, descriptive way so the involvement of people from different perspectives can aid greatly in this. As well as getting a holistic impression of what was said, coding line-by-line can often alert the researcher to consider that which may ordinarily remain invisible because it is not clearly expressed or does not ‘fit’ with the rest of the account. In this way the developing analysis is challenged; to reconcile and explain anomalies in the data can make the analysis stronger. Coding can also be done digitally using CAQDAS, which is a useful way to keep track automatically of new codes. However, some researchers prefer to do the early stages of coding with a paper and pen, and only start to use CAQDAS once they reach Stage 5 (see below).

Stage 4: Developing a working analytical framework

After coding the first few transcripts, all researchers involved should meet to compare the labels they have applied and agree on a set of codes to apply to all subsequent transcripts. Codes can be grouped together into categories (using a tree diagram if helpful), which are then clearly defined. This forms a working analytical framework. It is likely that several iterations of the analytical framework will be required before no additional codes emerge. It is always worth having an ‘other’ code under each category to avoid ignoring data that does not fit; the analytical framework is never ‘final’ until the last transcript has been coded.

Stage 5: Applying the analytical framework

The working analytical framework is then applied by indexing subsequent transcripts using the existing categories and codes. Each code is usually assigned a number or abbreviation for easy identification (and so the full names of the codes do not have to be written out each time) and written directly onto the transcripts. Computer Assisted Qualitative Data Analysis Software (CAQDAS) is particularly useful at this stage because it can speed up the process and ensures that, at later stages, data is easily retrievable. It is worth noting that unlike software for statistical analyses, which actually carries out the calculations with the correct instruction, putting the data into a qualitative analysis software package does not analyse the data; it is simply an effective way of storing and organising the data so that they are accessible for the analysis process.

Stage 6: Charting data into the framework matrix

Qualitative data are voluminous (an hour of interview can generate 15–30 pages of text) and being able to manage and summarize (reduce) data is a vital aspect of the analysis process. A spreadsheet is used to generate a matrix and the data are ‘charted’ into the matrix. Charting involves summarizing the data by category from each transcript. Good charting requires an ability to strike a balance between reducing the data on the one hand and retaining the original meanings and ‘feel’ of the interviewees’ words on the other. The chart should include references to interesting or illustrative quotations. These can be tagged automatically if you are using CAQDAS to manage your data (N-Vivo version 9 onwards has the capability to generate framework matrices), or otherwise a capital ‘Q’, an (anonymized) transcript number, page and line reference will suffice. It is helpful in multi-disciplinary teams to compare and contrast styles of summarizing in the early stages of the analysis process to ensure consistency within the team. Any abbreviations used should be agreed by the team. Once members of the team are familiar with the analytical framework and well practised at coding and charting, on average, it will take about half a day per hour-long transcript to reach this stage. In the early stages, it takes much longer.

Stage 7: Interpreting the data

It is useful throughout the research to have a separate note book or computer file to note down impressions, ideas and early interpretations of the data. It may be worth breaking off at any stage to explore an interesting idea, concept or potential theme by writing an analytic memo [ 20 , 21 ] to then discuss with other members of the research team, including lay and clinical members. Gradually, characteristics of and differences between the data are identified, perhaps generating typologies, interrogating theoretical concepts (either prior concepts or ones emerging from the data) or mapping connections between categories to explore relationships and/or causality. If the data are rich enough, the findings generated through this process can go beyond description of particular cases to explanation of, for example, reasons for the emergence of a phenomena, predicting how an organisation or other social actor is likely to instigate or respond to a situation, or identifying areas that are not functioning well within an organisation or system. It is worth noting that this stage often takes longer than anticipated and that any project plan should ensure that sufficient time is allocated to meetings and individual researcher time to conduct interpretation and writing up of findings (see Additional file 1 , Section 7).

The Framework Method has been developed and used successfully in research for over 25 years, and has recently become a popular analysis method in qualitative health research. The issue of how to assess quality in qualitative research has been highly debated [ 20 , 34 – 40 ], but ensuring rigour and transparency in analysis is a vital component. There are, of course, many ways to do this but in the Framework Method the following are helpful:

Summarizing the data during charting, as well as being a practical way to reduce the data, means that all members of a multi-disciplinary team, including lay, clinical and (quantitative) academic members can engage with the data and offer their perspectives during the analysis process without necessarily needing to read all the transcripts or be involved in the more technical parts of analysis.

Charting also ensures that researchers pay close attention to describing the data using each participant’s own subjective frames and expressions in the first instance, before moving onto interpretation.

The summarized data is kept within the wider context of each case, thereby encouraging thick description that pays attention to complex layers of meaning and understanding [ 38 ].

The matrix structure is visually straightforward and can facilitate recognition of patterns in the data by any member of the research team, including through drawing attention to contradictory data, deviant cases or empty cells.

The systematic procedure (described in this article) makes it easy to follow, even for multi-disciplinary teams and/or with large data sets.

It is flexible enough that non-interview data (such as field notes taken during the interview or reflexive considerations) can be included in the matrix.

It is not aligned with a particular epistemological viewpoint or theoretical approach and therefore can be adapted for use in inductive or deductive analysis or a combination of the two (e.g. using pre-existing theoretical constructs deductively, then revising the theory with inductive aspects; or using an inductive approach to identify themes in the data, before returning to the literature and using theories deductively to help further explain certain themes).

It is easy to identify relevant data extracts to illustrate themes and to check whether there is sufficient evidence for a proposed theme.

Finally, there is a clear audit trail from original raw data to final themes, including the illustrative quotes.

There are also a number of potential pitfalls to this approach:

The systematic approach and matrix format, as we noted in the background, is intuitively appealing to those trained quantitatively but the ‘spreadsheet’ look perhaps further increases the temptation for those without an in-depth understanding of qualitative research to attempt to quantify qualitative data (e.g. “13 out of 20 participants said X). This kind of statement is clearly meaningless because the sampling in qualitative research is not designed to be representative of a wider population, but purposive to capture diversity around a phenomenon [ 41 ].

Like all qualitative analysis methods, the Framework Method is time consuming and resource-intensive. When involving multiple stakeholders and disciplines in the analysis and interpretation of the data, as is good practice in applied health research, the time needed is extended. This time needs to be factored into the project proposal at the pre-funding stage.

There is a high training component to successfully using the method in a new multi-disciplinary team. Depending on their role in the analysis, members of the research team may have to learn how to code, index, and chart data, to think reflexively about how their identities and experience affect the analysis process, and/or they may have to learn about the methods of generalisation (i.e. analytic generalisation and transferability, rather than statistical generalisation [ 41 ]) to help to interpret legitimately the meaning and significance of the data.

While the Framework Method is amenable to the participation of non-experts in data analysis, it is critical to the successful use of the method that an experienced qualitative researcher leads the project (even if the overall lead for a large mixed methods study is a different person). The qualitative lead would ideally be joined by other researchers with at least some prior training in or experience of qualitative analysis. The responsibilities of the lead qualitative researcher are: to contribute to study design, project timelines and resource planning; to mentor junior qualitative researchers; to train clinical, lay and other (non-qualitative) academics to contribute as appropriate to the analysis process; to facilitate analysis meetings in a way that encourages critical and reflexive engagement with the data and other team members; and finally to lead the write-up of the study.

We have argued that Framework Method studies can be conducted by multi-disciplinary research teams that include, for example, healthcare professionals, psychologists, sociologists, economists, and lay people/service users. The inclusion of so many different perspectives means that decision-making in the analysis process can be very time consuming and resource-intensive. It may require extensive, reflexive and critical dialogue about how the ideas expressed by interviewees and identified in the transcript are related to pre-existing concepts and theories from each discipline, and to the real ‘problems’ in the health system that the project is addressing. This kind of team effort is, however, an excellent forum for driving forward interdisciplinary collaboration, as well as clinical and lay involvement in research, to ensure that ‘the whole is greater than the sum of the parts’, by enhancing the credibility and relevance of the findings.

The Framework Method is appropriate for thematic analysis of textual data, particularly interview transcripts, where it is important to be able to compare and contrast data by themes across many cases, while also situating each perspective in context by retaining the connection to other aspects of each individual’s account. Experienced qualitative researchers should lead and facilitate all aspects of the analysis, although the Framework Method’s systematic approach makes it suitable for involving all members of a multi-disciplinary team. An open, critical and reflexive approach from all team members is essential for rigorous qualitative analysis.

Acceptance of the complexity of real life health systems and the existence of multiple perspectives on health issues is necessary to produce high quality qualitative research. If done well, qualitative studies can shed explanatory and predictive light on important phenomena, relate constructively to quantitative parts of a larger study, and contribute to the improvement of health services and development of health policy. The Framework Method, when selected and implemented appropriately, can be a suitable tool for achieving these aims through producing credible and relevant findings.

The Framework Method is an excellent tool for supporting thematic (qualitative content) analysis because it provides a systematic model for managing and mapping the data.

The Framework Method is most suitable for analysis of interview data, where it is desirable to generate themes by making comparisons within and between cases.

The management of large data sets is facilitated by the Framework Method as its matrix form provides an intuitively structured overview of summarised data.

The clear, step-by-step process of the Framework Method makes it is suitable for interdisciplinary and collaborative projects.

The use of the method should be led and facilitated by an experienced qualitative researcher.

Ritchie J, Lewis J: Qualitative research practice: a guide for social science students and researchers. 2003, London: Sage

Google Scholar  

Ives J, Damery S, Redwod S: PPI, paradoxes and Plato: who's sailing the ship?. J Med Ethics. 2013, 39 (3): 181-185. 10.1136/medethics-2011-100150.

Article   Google Scholar  

Heath G, Cameron E, Cummins C, Greenfield S, Pattison H, Kelly D, Redwood S: Paediatric ‘care closer to home’: stake-holder views and barriers to implementation. Health Place. 2012, 18 (5): 1068-1073. 10.1016/j.healthplace.2012.05.003.

Elkington H, White P, Addington-Hall J, Higgs R, Petternari C: The last year of life of COPD: a qualitative study of symptoms and services. Respir Med. 2004, 98 (5): 439-445. 10.1016/j.rmed.2003.11.006.

Murtagh J, Dixey R, Rudolf M: A qualitative investigation into the levers and barriers to weight loss in children: opinions of obese children. Archives Dis Child. 2006, 91 (11): 920-923. 10.1136/adc.2005.085712.

Barnard M, Webster S, O’Connor W, Jones A, Donmall M: The drug treatment outcomes research study (DTORS): qualitative study. 2009, London: Home Office

Ayatollahi H, Bath PA, Goodacre S: Factors influencing the use of IT in the emergency department: a qualitative study. Health Inform J. 2010, 16 (3): 189-200. 10.1177/1460458210377480.

Sheard L, Prout H, Dowding D, Noble S, Watt I, Maraveyas A, Johnson M: Barriers to the diagnosis and treatment of venous thromboembolism in advanced cancer patients: a qualitative study. Palliative Med. 2012, 27 (2): 339-348.

Ellis J, Wagland R, Tishelman C, Williams ML, Bailey CD, Haines J, Caress A, Lorigan P, Smith JA, Booton R, et al: Considerations in developing and delivering a nonpharmacological intervention for symptom management in lung cancer: the views of patients and informal caregivers. J Pain Symptom Manag (0). 2012, 44 (6): 831-842. 10.1016/j.jpainsymman.2011.12.274.

Gale N, Sultan H: Telehealth as ‘peace of mind’: embodiment, emotions and the home as the primary health space for people with chronic obstructive pulmonary disorder. Health place. 2013, 21: 140-147.

Rashidian A, Eccles MP, Russell I: Falling on stony ground? A qualitative study of implementation of clinical guidelines’ prescribing recommendations in primary care. Health policy. 2008, 85 (2): 148-161. 10.1016/j.healthpol.2007.07.011.

Jones RK: The unsolicited diary as a qualitative research tool for advanced research capacity in the field of health and illness. Qualitative Health Res. 2000, 10 (4): 555-567. 10.1177/104973200129118543.

Pope C, Ziebland S, Mays N: Analysing qualitative data. British Med J. 2000, 320: 114-116. 10.1136/bmj.320.7227.114.

Pope C, Mays N: Critical reflections on the rise of qualitative research. British Med J. 2009, 339: 737-739.

Fairclough N: Critical discourse analysis: the critical study of language. 2010, London: Longman

Garfinkel H: Ethnomethodology’s program. Soc Psychol Quarter. 1996, 59 (1): 5-21. 10.2307/2787116.

Merleau-Ponty M: The phenomenology of perception. 1962, London: Routledge and Kegan Paul

Svenaeus F: The phenomenology of health and illness. Handbook of phenomenology and medicine. 2001, Netherlands: Springer, 87-108.

Reissmann CK: Narrative methods for the human sciences. 2008, London: Sage

Charmaz K: Constructing grounded theory: a practical guide through qualitative analysis. 2006, London: Sage

Glaser A, Strauss AL: The discovery of grounded theory. 1967, Chicago: Aldine

Crotty M: The foundations of social research: meaning and perspective in the research process. 1998, London: Sage

Boeije H: A purposeful approach to the constant comparative method in the analysis of qualitative interviews. Qual Quant. 2002, 36 (4): 391-409. 10.1023/A:1020909529486.

Hsieh H-F, Shannon SE: Three approaches to qualitative content analysis. Qual Health Res. 2005, 15 (9): 1277-1288. 10.1177/1049732305276687.

Redwood S, Gale NK, Greenfield S: ‘You give us rangoli, we give you talk’: using an art-based activity to elicit data from a seldom heard group. BMC Med Res Methodol. 2012, 12 (1): 7-10.1186/1471-2288-12-7.

Mishler EG: The struggle between the voice of medicine and the voice of the lifeworld. The sociology of health and illness: critical perspectives. Edited by: Conrad P, Kern R. 1990, New York: St Martins Press, Third

Hodges BD, Kuper A, Reeves S: Discourse analysis. British Med J. 2008, 337: 570-572. 10.1136/bmj.39370.701782.DE.

Sandelowski M, Barroso J: Writing the proposal for a qualitative research methodology project. Qual Health Res. 2003, 13 (6): 781-820. 10.1177/1049732303013006003.

Ellins J: It’s better together: involving older people in research. HSMC Newsletter Focus Serv Users Publ. 2010, 16 (1): 4-

Phillimore J, Goodson L, Hennessy D, Ergun E: Empowering Birmingham’s migrant and refugee community organisations: making a difference. 2009, York: Joseph Rowntree Foundation

Leamy M, Clough R: How older people became researchers. 2006, York: Joseph Rowntree Foundation

Glasby J, Miller R, Ellins J, Durose J, Davidson D, McIver S, Littlechild R, Tanner D, Snelling I, Spence K: Final report NIHR service delivery and organisation programme. Understanding and improving transitions of older people: a user and carer centred approach. 2012, London: The Stationery Office

Saldaña J: The coding manual for qualitative researchers. 2009, London: Sage

Lincoln YS: Emerging criteria for quality in qualitative and interpretive research. Qual Inquiry. 1995, 1 (3): 275-289. 10.1177/107780049500100301.

Mays N, Pope C: Qualitative research in health care: assessing quality in qualitative research. BMJ British Med J. 2000, 320 (7226): 50-10.1136/bmj.320.7226.50.

Seale C: Quality in qualitative research. Qual Inquiry. 1999, 5 (4): 465-478. 10.1177/107780049900500402.

Dingwall R, Murphy E, Watson P, Greatbatch D, Parker S: Catching goldfish: quality in qualitative research. J Health serv Res Policy. 1998, 3 (3): 167-172.

Popay J, Rogers A, Williams G: Rationale and standards for the systematic review of qualitative literature in health services research. Qual Health Res. 1998, 8 (3): 341-351. 10.1177/104973239800800305.

Morse JM, Barrett M, Mayan M, Olson K, Spiers J: Verification strategies for establishing reliability and validity in qualitative research. Int J Qual Methods. 2008, 1 (2): 13-22.

Smith JA: Reflecting on the development of interpretative phenomenological analysis and its contribution to qualitative research in psychology. Qual Res Psychol. 2004, 1 (1): 39-54.

Polit DF, Beck CT: Generalization in quantitative and qualitative research: Myths and strategies. Int J Nurs Studies. 2010, 47 (11): 1451-1458. 10.1016/j.ijnurstu.2010.06.004.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/13/117/prepub

Download references

Acknowledgments

All authors were funded by the National Institute for Health Research (NIHR) through the Collaborations for Leadership in Applied Health Research and Care for Birmingham and Black Country (CLAHRC-BBC) programme. The views in this publication expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

Author information

Authors and affiliations.

Health Services Management Centre, University of Birmingham, Park House, 40 Edgbaston Park Road, Birmingham, B15 2RT, UK

Nicola K Gale

School of Health and Population Sciences, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK

Gemma Heath & Sabi Redwood

School of Life and Health Sciences, Aston University, Aston Triangle, Birmingham, B4 7ET, UK

Elaine Cameron

East and North Hertfordshire NHS Trust, Lister hospital, Coreys Mill Lane, Stevenage, SG1 4AB, UK

Sabina Rashid

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nicola K Gale .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

All authors were involved in the development of the concept of the article and drafting the article. NG wrote the first draft of the article, GH and EC prepared the text and figures related to the illustrative example, SRa did the literature search to identify if there were any similar articles currently available and contributed to drafting of the article, and SRe contributed to drafting of the article and the illustrative example. All authors read and approved the final manuscript.

Electronic supplementary material

Additional file 1: illustrative example of the use of the framework method.(docx 167 kb), authors’ original submitted files for images.

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, authors’ original file for figure 3, authors’ original file for figure 4, authors’ original file for figure 5, authors’ original file for figure 6, rights and permissions.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Gale, N.K., Heath, G., Cameron, E. et al. Using the framework method for the analysis of qualitative data in multi-disciplinary health research. BMC Med Res Methodol 13 , 117 (2013). https://doi.org/10.1186/1471-2288-13-117

Download citation

Received : 17 December 2012

Accepted : 06 September 2013

Published : 18 September 2013

DOI : https://doi.org/10.1186/1471-2288-13-117

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Qualitative research
  • Qualitative content analysis
  • Multi-disciplinary research

BMC Medical Research Methodology

ISSN: 1471-2288

analytical framework in research example

analytical framework in research example

Framework Analysis: Methods and Use Cases

analytical framework in research example

Introduction

What is framework analysis, implementing the framework analysis methodology.

Among qualitative methods in social research, framework analysis stands out as a structured approach to analyzing qualitative data . Originally developed for applied policy analysis and multi-disciplinary health research, this method has found application in various domains due to its emphasis on transparency and systematic data analysis. As with other research methods, the objective remains to extract meaningful themes and patterns, but the framework method provides a specific roadmap for doing so.

Whether you're a seasoned researcher or someone new to the realm of qualitative methodology, understanding the nuances of framework analysis can enhance the depth and rigor of your research efforts. In this article, we will explore the methods and use cases of framework analysis, diving deep into its benefits and the analytical framework it enables researchers to develop.

analytical framework in research example

Framework analysis is a systematic approach for analyzing qualitative data . Rooted in the traditions of social research relevant to policy making, it was found to be a useful tool for analysis in multi-disciplinary health research where the eventual analysis of qualitative data can identify themes and actionable insights relevant to policy outcomes.

Unlike some other qualitative analysis methods , framework analysis is explicitly focused on addressing specific research questions , making it particularly suitable for applied or policy-related qualitative research .

Purpose of framework analysis

The primary aim of framework analysis is to offer a clear and transparent process for conducting qualitative research by managing, reducing, and analyzing large datasets without losing sight of the original context. Given the vast amounts of data often generated in qualitative studies, having a systematic method to sift through this data is crucial.

By using the framework method, researchers can remain focused on their research questions while ensuring that the data collection and analysis process retains its integrity and depth.

Characteristics of framework analysis

Transparent structure: One of the distinct features of framework analysis is its emphasis on transparency . Every step in the analysis process is documented, allowing for easy scrutiny and replication by multiple researchers.

Thematic framework: Central to framework analysis is the development of a framework identifying key themes, concepts, and relationships in the data. The framework guides the subsequent stages of coding and charting.

Flexibility: While it provides a clear structure, framework analysis is also adaptable. Depending on the objectives of the study, researchers can modify the process to better suit their data and questions.

Iterative process: The process in framework analysis is not linear. As data is collected and data analysis progresses, researchers often revisit earlier stages, refining the framework or revising codes to better capture the nuances in the data.

Benefits of framework analysis

Conducting framework analysis has several advantages:

Rigorous data management: The structured approach means data is managed and analyzed with a high level of rigor, minimizing the potential influence of preconceptions.

Inclusivity: Framework analysis accommodates both a priori issues, driven by the research questions , and emergent issues that arise from the data itself.

Comparability: Given its structured nature, framework analysis allows researchers to compare and contrast data, facilitating the identification of patterns and differences.

Accessibility: By presenting data in a summarized, charted form , findings from framework analysis become more accessible and comprehensible, aiding in reporting and disseminating results.

Relevance for applied research: Given its origins in policy research and its clear focus on addressing specific research questions, framework analysis is particularly relevant for studies aiming to inform policy or practice.

analytical framework in research example

Efficient, easy data analysis with ATLAS.ti

Start analyzing data quickly and more deeply with ATLAS.ti. Download a free trial today.

Successfully conducting framework analysis involves a series of structured steps. Proper implementation of framework analysis not only ensures the rigor of a qualitative analysis but also that the findings are credible and meaningful.

Familiarization with the data

Before discussing a more detailed analysis, it's paramount to understand the breadth and depth of the data at hand.

Reading and re-reading: Begin by reading textual data such as transcripts , field notes , and other data sources multiple times. This immersion allows researchers to understand participants' perspectives and grasp the overall context.

Noting preliminary ideas: As researchers familiarize themselves with the data, preliminary themes or ideas may start to emerge. Jotting these down in memos helps in forming an initial understanding and can be instrumental in the subsequent phase of developing a set of themes.

Developing a thematic framework

As is the case across nearly all types of qualitative methodology , central to framework analysis is the construction of a robust analytical framework . This structure aids in organizing and interpreting the data .

Identifying key themes: Based on the initial familiarization, it's important to identify themes that occur in the multimedia or textual data. These themes should be relevant to the research question . Researchers can begin assigning codes to specific chunks of data to capture emerging themes.

Categorizing and coding: Each identified theme can further be broken down into sub-themes or brought together under categories. At this stage, researchers can continue coding (or recoding ) their data according to these themes or categories.

Refining the framework: As the analysis progresses, the initial themes represented by your coding framework may need adjustments. It's an iterative process, where the framework can be continually refined to better fit the data.

Indexing and charting the data

Once the framework is established, the next phase involves systematically applying it to the data.

Indexing: Using the resulting coding framework , you can verify that codes have been systematically assigned to relevant portions of the data. This ensures every relevant piece of data is categorized under the appropriate theme or sub-theme.

Charting: This step involves creating charts or matrices for each theme. Data from different sources (like interviews or focus groups ) is summarized under the relevant theme. For example, a table can be created with each theme in a column and each data source in a row, and researchers can then populate the cells with relevant data extracts or notes. These charts provide a visual representation , allowing researchers to easily see patterns or discrepancies in the data.

Mapping and interpretation: With the data systematically charted, researchers can begin to map the relationships between themes and interpret the broader implications . This step is where the true essence of the research emerges, as researchers link the patterns in the data to the broader objectives of the study.

Framework analysis is an involved process, with intentional decision-making at every step of the way. As a result, implementing structured qualitative methodologies such as framework analysis requires patience, meticulous attention to detail, and a clear understanding of the research objectives. When conducted diligently, it offers a transparent and systematic approach to analyzing qualitative data , ensuring the research not only has depth but also clarity.

Whether comparing data across multiple sources or drilling down into the nuances of individual narratives, framework analysis equips researchers with the tools needed to derive meaningful insights from their qualitative data . As more researchers across disciplines recognize its value, it stands to become an even more integral part of the research landscape.

analytical framework in research example

Systematic data analysis made easy with ATLAS.ti

Powerful tools to analyze qualitative data are right at your fingertips. Download your free trial today.

analytical framework in research example

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Using Framework Analysis in nursing research: a worked example

Affiliation.

  • 1 School of Nursing, Midwifery & Social Work, University of Manchester, UK.
  • PMID: 23517523
  • DOI: 10.1111/jan.12127

Aims: To demonstrate Framework Analysis using a worked example and to illustrate how criticisms of qualitative data analysis including issues of clarity and transparency can be addressed.

Background: Critics of the analysis of qualitative data sometimes cite lack of clarity and transparency about analytical procedures; this can deter nurse researchers from undertaking qualitative studies. Framework Analysis is flexible, systematic, and rigorous, offering clarity, transparency, an audit trail, an option for theme-based and case-based analysis and for readily retrievable data. This paper offers further explanation of the process undertaken which is illustrated with a worked example.

Data source and research design: Data were collected from 31 nursing students in 2009 using semi-structured interviews.

Discussion: The data collected are not reported directly here but used as a worked example for the five steps of Framework Analysis. Suggestions are provided to guide researchers through essential steps in undertaking Framework Analysis. The benefits and limitations of Framework Analysis are discussed.

Implications for nursing: Nurses increasingly use qualitative research methods and need to use an analysis approach that offers transparency and rigour which Framework Analysis can provide. Nurse researchers may find the detailed critique of Framework Analysis presented in this paper a useful resource when designing and conducting qualitative studies.

Conclusion: Qualitative data analysis presents challenges in relation to the volume and complexity of data obtained and the need to present an 'audit trail' for those using the research findings. Framework Analysis is an appropriate, rigorous and systematic method for undertaking qualitative analysis.

Keywords: Framework Analysis; nursing; qualitative data analysis.

© 2013 Blackwell Publishing Ltd.

PubMed Disclaimer

Similar articles

  • A student's perspective of managing data collection in a complex qualitative study. Dowse EM, van der Riet P, Keatinge DR. Dowse EM, et al. Nurse Res. 2014 Nov;22(2):34-9. doi: 10.7748/nr.22.2.34.e1302. Nurse Res. 2014. PMID: 25423940
  • Qualitative case study data analysis: an example from practice. Houghton C, Murphy K, Shaw D, Casey D. Houghton C, et al. Nurse Res. 2015 May;22(5):8-12. doi: 10.7748/nr.22.5.8.e1307. Nurse Res. 2015. PMID: 25976531
  • Using the framework approach to analyse qualitative data: a worked example. Hackett A, Strickland K. Hackett A, et al. Nurse Res. 2019 Sep 21;26(2):8-13. doi: 10.7748/nr.2018.e1580. Epub 2018 Sep 11. Nurse Res. 2019. PMID: 30215482
  • Methodological rigour within a qualitative framework. Tobin GA, Begley CM. Tobin GA, et al. J Adv Nurs. 2004 Nov;48(4):388-96. doi: 10.1111/j.1365-2648.2004.03207.x. J Adv Nurs. 2004. PMID: 15500533 Review.
  • Translation and back-translation in qualitative nursing research: methodological review. Chen HY, Boore JR. Chen HY, et al. J Clin Nurs. 2010 Jan;19(1-2):234-9. doi: 10.1111/j.1365-2702.2009.02896.x. Epub 2009 Nov 3. J Clin Nurs. 2010. PMID: 19886874 Review.
  • Measuring what matters: Context-specific indicators for assessing immunisation performance in Pacific Island Countries and Areas. Patel C, Sargent GM, Tinessia A, Mayfield H, Chateau D, Ali A, Tuibeqa I, Sheel M. Patel C, et al. PLOS Glob Public Health. 2024 Jul 25;4(7):e0003068. doi: 10.1371/journal.pgph.0003068. eCollection 2024. PLOS Glob Public Health. 2024. PMID: 39052626 Free PMC article.
  • Assessing the efficacy of a brief universal family skills programme on child behaviour and family functioning in Gilgit-Baltistan, Pakistan: protocol for a feasibility randomised controlled trial of the Strong Families programme. El-Khani A, Asif M, Shahzad S, Bux MS, Maalouf W, Rafiq NUZ, Khoso AB, Chaudhry IB, Van Hout MC, Zadeh Z, Tahir A, Memon R, Chaudhry N, Husain N. El-Khani A, et al. BMJ Open. 2024 Jul 1;14(6):e081557. doi: 10.1136/bmjopen-2023-081557. BMJ Open. 2024. PMID: 38951006 Free PMC article.
  • Use of Framework Matrix and Thematic Coding Methods in Qualitative Analysis for mHealth: NIRUDAK Study Data. Rosen RK, Gainey M, Nasrin S, Garbern SC, Lantini R, Elshabassi N, Sultana S, Hasnin T, Alam NH, Nelson EJ, Levine AC. Rosen RK, et al. Int J Qual Methods. 2023 Jan-Dec;22:10.1177/16094069231184123. doi: 10.1177/16094069231184123. Epub 2023 Jun 21. Int J Qual Methods. 2023. PMID: 38817641 Free PMC article.
  • Factors affecting major depression in Iran: a mixed-method study. Hosseinzadeh-Shanjani Z, Khodayari-Zarnaq R, Khosravi MF, Arab-Zozani M, Alizadeh G. Hosseinzadeh-Shanjani Z, et al. J Health Popul Nutr. 2024 May 27;43(1):73. doi: 10.1186/s41043-024-00571-x. J Health Popul Nutr. 2024. PMID: 38802965 Free PMC article. Review.
  • Exploring Shared Implementation Leadership of Point of Care Nursing Leadership Teams on Inpatient Hospital Units: Protocol for a Collective Case Study. Castiglione SA, Lavoie-Tremblay M, Kilpatrick K, Gifford W, Semenic SE. Castiglione SA, et al. JMIR Res Protoc. 2024 Feb 19;13:e54681. doi: 10.2196/54681. JMIR Res Protoc. 2024. PMID: 38373024 Free PMC article.

Publication types

  • Search in MeSH

Related information

  • Cited in Books

LinkOut - more resources

Full text sources.

  • Ovid Technologies, Inc.

Other Literature Sources

  • scite Smart Citations
  • MedlinePlus Health Information

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Med Res Methodol

Logo of bmcmrm

Using the framework method for the analysis of qualitative data in multi-disciplinary health research

Nicola k gale.

1 Health Services Management Centre, University of Birmingham, Park House, 40 Edgbaston Park Road, Birmingham B15 2RT, UK

Gemma Heath

2 School of Health and Population Sciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK

Elaine Cameron

3 School of Life and Health Sciences, Aston University, Aston Triangle, Birmingham B4 7ET, UK

Sabina Rashid

4 East and North Hertfordshire NHS Trust, Lister hospital, Coreys Mill Lane, Stevenage SG1 4AB, UK

Sabi Redwood

Associated data.

The Framework Method is becoming an increasingly popular approach to the management and analysis of qualitative data in health research. However, there is confusion about its potential application and limitations.

The article discusses when it is appropriate to adopt the Framework Method and explains the procedure for using it in multi-disciplinary health research teams, or those that involve clinicians, patients and lay people. The stages of the method are illustrated using examples from a published study.

Used effectively, with the leadership of an experienced qualitative researcher, the Framework Method is a systematic and flexible approach to analysing qualitative data and is appropriate for use in research teams even where not all members have previous experience of conducting qualitative research.

The Framework Method for the management and analysis of qualitative data has been used since the 1980s [ 1 ]. The method originated in large-scale social policy research but is becoming an increasingly popular approach in medical and health research; however, there is some confusion about its potential application and limitations. In this article we discuss when it is appropriate to use the Framework Method and how it compares to other qualitative analysis methods. In particular, we explore how it can be used in multi-disciplinary health research teams. Multi-disciplinary and mixed methods studies are becoming increasingly commonplace in applied health research. As well as disciplines familiar with qualitative research, such as nursing, psychology and sociology, teams often include epidemiologists, health economists, management scientists and others. Furthermore, applied health research often has clinical representation and, increasingly, patient and public involvement [ 2 ]. We argue that while leadership is undoubtedly required from an experienced qualitative methodologist, non-specialists from the wider team can and should be involved in the analysis process. We then present a step-by-step guide to the application of the Framework Method, illustrated using a worked example (See Additional File 1 ) from a published study [ 3 ] to illustrate the main stages of the process. Technical terms are included in the glossary (below). Finally, we discuss the strengths and limitations of the approach.

Glossary of key terms used in the Framework Method

Analytical framework: A set of codes organised into categories that have been jointly developed by researchers involved in analysis that can be used to manage and organise the data. The framework creates a new structure for the data (rather than the full original accounts given by participants) that is helpful to summarize/reduce the data in a way that can support answering the research questions.

Analytic memo: A written investigation of a particular concept, theme or problem, reflecting on emerging issues in the data that captures the analytic process (see Additional file 1 , Section 7).

Categories: During the analysis process, codes are grouped into clusters around similar and interrelated ideas or concepts. Categories and codes are usually arranged in a tree diagram structure in the analytical framework. While categories are closely and explicitly linked to the raw data, developing categories is a way to start the process of abstraction of the data (i.e. towards the general rather than the specific or anecdotal).

Charting: Entering summarized data into the Framework Method matrix (see Additional File 1 , Section 6).

Code: A descriptive or conceptual label that is assigned to excerpts of raw data in a process called ‘coding’ (see Additional File 1 , Section 3).

Data: Qualitative data usually needs to be in textual form before analysis. These texts can either be elicited texts (written specifically for the research, such as food diaries), or extant texts (pre-existing texts, such as meeting minutes, policy documents or weblogs), or can be produced by transcribing interview or focus group data, or creating ‘field’ notes while conducting participant-observation or observing objects or social situations.

Indexing: The systematic application of codes from the agreed analytical framework to the whole dataset (see Additional File 1 , Section 5).

Matrix: A spreadsheet contains numerous cells into which summarized data are entered by codes (columns) and cases (rows) (see Additional File 1 , Section 6).

Themes: Interpretive concepts or propositions that describe or explain aspects of the data, which are the final output of the analysis of the whole dataset. Themes are articulated and developed by interrogating data categories through comparison between and within cases. Usually a number of categories would fall under each theme or sub-theme [ 3 ].

Transcript: A written verbatim (word-for-word) account of a verbal interaction, such as an interview or conversation.

The Framework Method sits within a broad family of analysis methods often termed thematic analysis or qualitative content analysis. These approaches identify commonalities and differences in qualitative data, before focusing on relationships between different parts of the data, thereby seeking to draw descriptive and/or explanatory conclusions clustered around themes. The Framework Method was developed by researchers, Jane Ritchie and Liz Spencer, from the Qualitative Research Unit at the National Centre for Social Research in the United Kingdom in the late 1980s for use in large-scale policy research [ 1 ]. It is now used widely in other areas, including health research [ 3 - 12 ]. Its defining feature is the matrix output: rows (cases), columns (codes) and ‘cells’ of summarised data, providing a structure into which the researcher can systematically reduce the data, in order to analyse it by case and by code [ 1 ]. Most often a ‘case’ is an individual interviewee, but this can be adapted to other units of analysis, such as predefined groups or organisations. While in-depth analyses of key themes can take place across the whole data set, the views of each research participant remain connected to other aspects of their account within the matrix so that the context of the individual’s views is not lost. Comparing and contrasting data is vital to qualitative analysis and the ability to compare with ease data across cases as well as within individual cases is built into the structure and process of the Framework Method.

The Framework Method provides clear steps to follow and produces highly structured outputs of summarised data. It is therefore useful where multiple researchers are working on a project, particularly in multi-disciplinary research teams were not all members have experience of qualitative data analysis, and for managing large data sets where obtaining a holistic, descriptive overview of the entire data set is desirable. However, caution is recommended before selecting the method as it is not a suitable tool for analysing all types of qualitative data or for answering all qualitative research questions, nor is it an ‘easy’ version of qualitative research for quantitative researchers. Importantly, the Framework Method cannot accommodate highly heterogeneous data, i.e. data must cover similar topics or key issues so that it is possible to categorize it. Individual interviewees may, of course, have very different views or experiences in relation to each topic, which can then be compared and contrasted. The Framework Method is most commonly used for the thematic analysis of semi-structured interview transcripts, which is what we focus on in this article, although it could, in principle, be adapted for other types of textual data [ 13 ], including documents, such as meeting minutes or diaries [ 12 ], or field notes from observations [ 10 ].

For quantitative researchers working with qualitative colleagues or when exploring qualitative research for the first time, the nature of the Framework Method is seductive because its methodical processes and ‘spreadsheet’ approach seem more closely aligned to the quantitative paradigm [ 14 ]. Although the Framework Method is a highly systematic method of categorizing and organizing what may seem like unwieldy qualitative data, it is not a panacea for problematic issues commonly associated with qualitative data analysis such as how to make analytic choices and make interpretive strategies visible and auditable. Qualitative research skills are required to appropriately interpret the matrix, and facilitate the generation of descriptions, categories, explanations and typologies. Moreover, reflexivity, rigour and quality are issues that are requisite in the Framework Method just as they are in other qualitative methods. It is therefore essential that studies using the Framework Method for analysis are overseen by an experienced qualitative researcher, though this does not preclude those new to qualitative research from contributing to the analysis as part of a wider research team.

There are a number of approaches to qualitative data analysis, including those that pay close attention to language and how it is being used in social interaction such as discourse analysis [ 15 ] and ethnomethodology [ 16 ]; those that are concerned with experience, meaning and language such as phenomenology [ 17 , 18 ] and narrative methods [ 19 ]; and those that seek to develop theory derived from data through a set of procedures and interconnected stages such as Grounded Theory [ 20 , 21 ]. Many of these approaches are associated with specific disciplines and are underpinned by philosophical ideas which shape the process of analysis [ 22 ]. The Framework Method, however, is not aligned with a particular epistemological, philosophical, or theoretical approach. Rather it is a flexible tool that can be adapted for use with many qualitative approaches that aim to generate themes.

The development of themes is a common feature of qualitative data analysis, involving the systematic search for patterns to generate full descriptions capable of shedding light on the phenomenon under investigation. In particular, many qualitative approaches use the ‘constant comparative method’ , developed as part of Grounded Theory, which involves making systematic comparisons across cases to refine each theme [ 21 , 23 ]. Unlike Grounded Theory, the Framework Method is not necessarily concerned with generating social theory, but can greatly facilitate constant comparative techniques through the review of data across the matrix.

Perhaps because the Framework Method is so obviously systematic, it has often, as other commentators have noted, been conflated with a deductive approach to qualitative analysis [ 13 , 14 ]. However, the tool itself has no allegiance to either inductive or deductive thematic analysis; where the research sits along this inductive-deductive continuum depends on the research question. A question such as, ‘Can patients give an accurate biomedical account of the onset of their cardiovascular disease?’ is essentially a yes/no question (although it may be nuanced by the extent of their account or by appropriate use of terminology) and so requires a deductive approach to both data collection and analysis (e.g. structured or semi-structured interviews and directed qualitative content analysis [ 24 ]). Similarly, a deductive approach may be taken if basing analysis on a pre-existing theory, such as behaviour change theories, for example in the case of a research question such as ‘How does the Theory of Planned Behaviour help explain GP prescribing?’ [ 11 ]. However, a research question such as, ‘How do people construct accounts of the onset of their cardiovascular disease?’ would require a more inductive approach that allows for the unexpected, and permits more socially-located responses [ 25 ] from interviewees that may include matters of cultural beliefs, habits of food preparation, concepts of ‘fate’, or links to other important events in their lives, such as grief, which cannot be predicted by the researcher in advance (e.g. an interviewee-led open ended interview and grounded theory [ 20 ]). In all these cases, it may be appropriate to use the Framework Method to manage the data. The difference would become apparent in how themes are selected: in the deductive approach, themes and codes are pre-selected based on previous literature, previous theories or the specifics of the research question; whereas in the inductive approach, themes are generated from the data though open (unrestricted) coding, followed by refinement of themes. In many cases, a combined approach is appropriate when the project has some specific issues to explore, but also aims to leave space to discover other unexpected aspects of the participants’ experience or the way they assign meaning to phenomena. In sum, the Framework Method can be adapted for use with deductive, inductive, or combined types of qualitative analysis. However, there are some research questions where analysing data by case and theme is not appropriate and so the Framework Method should be avoided. For instance, depending on the research question, life history data might be better analysed using narrative analysis [ 19 ]; recorded consultations between patients and their healthcare practitioners using conversation analysis [ 26 ]; and documentary data, such as resources for pregnant women, using discourse analysis [ 27 ].

It is not within the scope of this paper to consider study design or data collection in any depth, but before moving on to describe the Framework Method analysis process, it is worth taking a step back to consider briefly what needs to happen before analysis begins. The selection of analysis method should have been considered at the proposal stage of the research and should fit with the research questions and overall aims of the study. Many qualitative studies, particularly ones using inductive analysis, are emergent in nature; this can be a challenge and the researchers can only provide an “imaginative rehearsal” of what is to come [ 28 ]. In mixed methods studies, the role of the qualitative component within the wider goals of the project must also be considered. In the data collection stage, resources must be allocated for properly trained researchers to conduct the qualitative interviewing because it is a highly skilled activity. In some cases, a research team may decide that they would like to use lay people, patients or peers to do the interviews [ 29 - 32 ] and in this case they must be properly trained and mentored which requires time and resources. At this early stage it is also useful to consider whether the team will use Computer Assisted Qualitative Data Analysis Software (CAQDAS), which can assist with data management and analysis.

As any form of qualitative or quantitative analysis is not a purely technical process, but influenced by the characteristics of the researchers and their disciplinary paradigms, critical reflection throughout the research process is paramount, including in the design of the study, the construction or collection of data, and the analysis. All members of the team should keep a research diary, where they record reflexive notes, impressions of the data and thoughts about analysis throughout the process. Experienced qualitative researchers become more skilled at sifting through data and analysing it in a rigorous and reflexive way. They cannot be too attached to certainty, but must remain flexible and adaptive throughout the research in order to generate rich and nuanced findings that embrace and explain the complexity of real social life and can be applied to complex social issues. It is important to remember when using the Framework Method that, unlike quantitative research where data collection and data analysis are strictly sequential and mutually exclusive stages of the research process, in qualitative analysis there is, to a greater or lesser extent depending on the project, ongoing interplay between data collection, analysis, and theory development. For example, new ideas or insights from participants may suggest potentially fruitful lines of enquiry, or close analysis might reveal subtle inconsistencies in an account which require further exploration.

Procedure for analysis

Stage 1: transcription.

A good quality audio recording and, ideally, a verbatim (word for word) transcription of the interview is needed. For Framework Method analysis, it is not necessarily important to include the conventions of dialogue transcriptions which can be difficult to read (e.g. pauses or two people talking simultaneously), because the content is what is of primary interest. Transcripts should have large margins and adequate line spacing for later coding and making notes. The process of transcription is a good opportunity to become immersed in the data and is to be strongly encouraged for new researchers. However, in some projects, the decision may be made that it is a better use of resources to outsource this task to a professional transcriber.

Stage 2: Familiarisation with the interview

Becoming familiar with the whole interview using the audio recording and/or transcript and any contextual or reflective notes that were recorded by the interviewer is a vital stage in interpretation. It can also be helpful to re-listen to all or parts of the audio recording. In multi-disciplinary or large research projects, those involved in analysing the data may be different from those who conducted or transcribed the interviews, which makes this stage particularly important. One margin can be used to record any analytical notes, thoughts or impressions.

Stage 3: Coding

After familiarization, the researcher carefully reads the transcript line by line, applying a paraphrase or label (a ‘code’) that describes what they have interpreted in the passage as important. In more inductive studies, at this stage ‘open coding’ takes place, i.e. coding anything that might be relevant from as many different perspectives as possible. Codes could refer to substantive things (e.g. particular behaviours, incidents or structures), values (e.g. those that inform or underpin certain statements, such as a belief in evidence-based medicine or in patient choice), emotions (e.g. sorrow, frustration, love) and more impressionistic/methodological elements (e.g. interviewee found something difficult to explain, interviewee became emotional, interviewer felt uncomfortable) [ 33 ]. In purely deductive studies, the codes may have been pre-defined (e.g. by an existing theory, or specific areas of interest to the project) so this stage may not be strictly necessary and you could just move straight onto indexing, although it is generally helpful even if you are taking a broadly deductive approach to do some open coding on at least a few of the transcripts to ensure important aspects of the data are not missed. Coding aims to classify all of the data so that it can be compared systematically with other parts of the data set. At least two researchers (or at least one from each discipline or speciality in a multi-disciplinary research team) should independently code the first few transcripts, if feasible. Patients, public involvement representatives or clinicians can also be productively involved at this stage, because they can offer alternative viewpoints thus ensuring that one particular perspective does not dominate. It is vital in inductive coding to look out for the unexpected and not to just code in a literal, descriptive way so the involvement of people from different perspectives can aid greatly in this. As well as getting a holistic impression of what was said, coding line-by-line can often alert the researcher to consider that which may ordinarily remain invisible because it is not clearly expressed or does not ‘fit’ with the rest of the account. In this way the developing analysis is challenged; to reconcile and explain anomalies in the data can make the analysis stronger. Coding can also be done digitally using CAQDAS, which is a useful way to keep track automatically of new codes. However, some researchers prefer to do the early stages of coding with a paper and pen, and only start to use CAQDAS once they reach Stage 5 (see below).

Stage 4: Developing a working analytical framework

After coding the first few transcripts, all researchers involved should meet to compare the labels they have applied and agree on a set of codes to apply to all subsequent transcripts. Codes can be grouped together into categories (using a tree diagram if helpful), which are then clearly defined. This forms a working analytical framework. It is likely that several iterations of the analytical framework will be required before no additional codes emerge. It is always worth having an ‘other’ code under each category to avoid ignoring data that does not fit; the analytical framework is never ‘final’ until the last transcript has been coded.

Stage 5: Applying the analytical framework

The working analytical framework is then applied by indexing subsequent transcripts using the existing categories and codes. Each code is usually assigned a number or abbreviation for easy identification (and so the full names of the codes do not have to be written out each time) and written directly onto the transcripts. Computer Assisted Qualitative Data Analysis Software (CAQDAS) is particularly useful at this stage because it can speed up the process and ensures that, at later stages, data is easily retrievable. It is worth noting that unlike software for statistical analyses, which actually carries out the calculations with the correct instruction, putting the data into a qualitative analysis software package does not analyse the data; it is simply an effective way of storing and organising the data so that they are accessible for the analysis process.

Stage 6: Charting data into the framework matrix

Qualitative data are voluminous (an hour of interview can generate 15–30 pages of text) and being able to manage and summarize (reduce) data is a vital aspect of the analysis process. A spreadsheet is used to generate a matrix and the data are ‘charted’ into the matrix. Charting involves summarizing the data by category from each transcript. Good charting requires an ability to strike a balance between reducing the data on the one hand and retaining the original meanings and ‘feel’ of the interviewees’ words on the other. The chart should include references to interesting or illustrative quotations. These can be tagged automatically if you are using CAQDAS to manage your data (N-Vivo version 9 onwards has the capability to generate framework matrices), or otherwise a capital ‘Q’, an (anonymized) transcript number, page and line reference will suffice. It is helpful in multi-disciplinary teams to compare and contrast styles of summarizing in the early stages of the analysis process to ensure consistency within the team. Any abbreviations used should be agreed by the team. Once members of the team are familiar with the analytical framework and well practised at coding and charting, on average, it will take about half a day per hour-long transcript to reach this stage. In the early stages, it takes much longer.

Stage 7: Interpreting the data

It is useful throughout the research to have a separate note book or computer file to note down impressions, ideas and early interpretations of the data. It may be worth breaking off at any stage to explore an interesting idea, concept or potential theme by writing an analytic memo [ 20 , 21 ] to then discuss with other members of the research team, including lay and clinical members. Gradually, characteristics of and differences between the data are identified, perhaps generating typologies, interrogating theoretical concepts (either prior concepts or ones emerging from the data) or mapping connections between categories to explore relationships and/or causality. If the data are rich enough, the findings generated through this process can go beyond description of particular cases to explanation of, for example, reasons for the emergence of a phenomena, predicting how an organisation or other social actor is likely to instigate or respond to a situation, or identifying areas that are not functioning well within an organisation or system. It is worth noting that this stage often takes longer than anticipated and that any project plan should ensure that sufficient time is allocated to meetings and individual researcher time to conduct interpretation and writing up of findings (see Additional file 1 , Section 7).

The Framework Method has been developed and used successfully in research for over 25 years, and has recently become a popular analysis method in qualitative health research. The issue of how to assess quality in qualitative research has been highly debated [ 20 , 34 - 40 ], but ensuring rigour and transparency in analysis is a vital component. There are, of course, many ways to do this but in the Framework Method the following are helpful:

•Summarizing the data during charting, as well as being a practical way to reduce the data, means that all members of a multi-disciplinary team, including lay, clinical and (quantitative) academic members can engage with the data and offer their perspectives during the analysis process without necessarily needing to read all the transcripts or be involved in the more technical parts of analysis.

•Charting also ensures that researchers pay close attention to describing the data using each participant’s own subjective frames and expressions in the first instance, before moving onto interpretation.

•The summarized data is kept within the wider context of each case, thereby encouraging thick description that pays attention to complex layers of meaning and understanding [ 38 ].

•The matrix structure is visually straightforward and can facilitate recognition of patterns in the data by any member of the research team, including through drawing attention to contradictory data, deviant cases or empty cells.

•The systematic procedure (described in this article) makes it easy to follow, even for multi-disciplinary teams and/or with large data sets.

•It is flexible enough that non-interview data (such as field notes taken during the interview or reflexive considerations) can be included in the matrix.

•It is not aligned with a particular epistemological viewpoint or theoretical approach and therefore can be adapted for use in inductive or deductive analysis or a combination of the two (e.g. using pre-existing theoretical constructs deductively, then revising the theory with inductive aspects; or using an inductive approach to identify themes in the data, before returning to the literature and using theories deductively to help further explain certain themes).

•It is easy to identify relevant data extracts to illustrate themes and to check whether there is sufficient evidence for a proposed theme.

•Finally, there is a clear audit trail from original raw data to final themes, including the illustrative quotes.

There are also a number of potential pitfalls to this approach:

•The systematic approach and matrix format, as we noted in the background, is intuitively appealing to those trained quantitatively but the ‘spreadsheet’ look perhaps further increases the temptation for those without an in-depth understanding of qualitative research to attempt to quantify qualitative data (e.g. “13 out of 20 participants said X). This kind of statement is clearly meaningless because the sampling in qualitative research is not designed to be representative of a wider population, but purposive to capture diversity around a phenomenon [ 41 ].

•Like all qualitative analysis methods, the Framework Method is time consuming and resource-intensive. When involving multiple stakeholders and disciplines in the analysis and interpretation of the data, as is good practice in applied health research, the time needed is extended. This time needs to be factored into the project proposal at the pre-funding stage.

•There is a high training component to successfully using the method in a new multi-disciplinary team. Depending on their role in the analysis, members of the research team may have to learn how to code, index, and chart data, to think reflexively about how their identities and experience affect the analysis process, and/or they may have to learn about the methods of generalisation (i.e. analytic generalisation and transferability, rather than statistical generalisation [ 41 ]) to help to interpret legitimately the meaning and significance of the data.

While the Framework Method is amenable to the participation of non-experts in data analysis, it is critical to the successful use of the method that an experienced qualitative researcher leads the project (even if the overall lead for a large mixed methods study is a different person). The qualitative lead would ideally be joined by other researchers with at least some prior training in or experience of qualitative analysis. The responsibilities of the lead qualitative researcher are: to contribute to study design, project timelines and resource planning; to mentor junior qualitative researchers; to train clinical, lay and other (non-qualitative) academics to contribute as appropriate to the analysis process; to facilitate analysis meetings in a way that encourages critical and reflexive engagement with the data and other team members; and finally to lead the write-up of the study.

We have argued that Framework Method studies can be conducted by multi-disciplinary research teams that include, for example, healthcare professionals, psychologists, sociologists, economists, and lay people/service users. The inclusion of so many different perspectives means that decision-making in the analysis process can be very time consuming and resource-intensive. It may require extensive, reflexive and critical dialogue about how the ideas expressed by interviewees and identified in the transcript are related to pre-existing concepts and theories from each discipline, and to the real ‘problems’ in the health system that the project is addressing. This kind of team effort is, however, an excellent forum for driving forward interdisciplinary collaboration, as well as clinical and lay involvement in research, to ensure that ‘the whole is greater than the sum of the parts’, by enhancing the credibility and relevance of the findings.

The Framework Method is appropriate for thematic analysis of textual data, particularly interview transcripts, where it is important to be able to compare and contrast data by themes across many cases, while also situating each perspective in context by retaining the connection to other aspects of each individual’s account. Experienced qualitative researchers should lead and facilitate all aspects of the analysis, although the Framework Method’s systematic approach makes it suitable for involving all members of a multi-disciplinary team. An open, critical and reflexive approach from all team members is essential for rigorous qualitative analysis.

Acceptance of the complexity of real life health systems and the existence of multiple perspectives on health issues is necessary to produce high quality qualitative research. If done well, qualitative studies can shed explanatory and predictive light on important phenomena, relate constructively to quantitative parts of a larger study, and contribute to the improvement of health services and development of health policy. The Framework Method, when selected and implemented appropriately, can be a suitable tool for achieving these aims through producing credible and relevant findings.

•The Framework Method is an excellent tool for supporting thematic (qualitative content) analysis because it provides a systematic model for managing and mapping the data.

•The Framework Method is most suitable for analysis of interview data, where it is desirable to generate themes by making comparisons within and between cases.

•The management of large data sets is facilitated by the Framework Method as its matrix form provides an intuitively structured overview of summarised data.

•The clear, step-by-step process of the Framework Method makes it is suitable for interdisciplinary and collaborative projects.

•The use of the method should be led and facilitated by an experienced qualitative researcher.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

All authors were involved in the development of the concept of the article and drafting the article. NG wrote the first draft of the article, GH and EC prepared the text and figures related to the illustrative example, SRa did the literature search to identify if there were any similar articles currently available and contributed to drafting of the article, and SRe contributed to drafting of the article and the illustrative example. All authors read and approved the final manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here:

http://www.biomedcentral.com/1471-2288/13/117/prepub

Supplementary Material

Illustrative Example of the use of the Framework Method.

Acknowledgments

All authors were funded by the National Institute for Health Research (NIHR) through the Collaborations for Leadership in Applied Health Research and Care for Birmingham and Black Country (CLAHRC-BBC) programme. The views in this publication expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

  • Ritchie J, Lewis J. Qualitative research practice: a guide for social science students and researchers. London: Sage; 2003. [ Google Scholar ]
  • Ives J, Damery S, Redwod S. PPI, paradoxes and Plato: who's sailing the ship? J Med Ethics. 2013; 39 (3):181–185. doi: 10.1136/medethics-2011-100150. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Heath G, Cameron E, Cummins C, Greenfield S, Pattison H, Kelly D, Redwood S. Paediatric ‘care closer to home’: stake-holder views and barriers to implementation. Health Place. 2012; 18 (5):1068–1073. doi: 10.1016/j.healthplace.2012.05.003. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Elkington H, White P, Addington-Hall J, Higgs R, Petternari C. The last year of life of COPD: a qualitative study of symptoms and services. Respir Med. 2004; 98 (5):439–445. doi: 10.1016/j.rmed.2003.11.006. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Murtagh J, Dixey R, Rudolf M. A qualitative investigation into the levers and barriers to weight loss in children: opinions of obese children. Archives Dis Child. 2006; 91 (11):920–923. doi: 10.1136/adc.2005.085712. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Barnard M, Webster S, O’Connor W, Jones A, Donmall M. The drug treatment outcomes research study (DTORS): qualitative study. London: Home Office; 2009. [ Google Scholar ]
  • Ayatollahi H, Bath PA, Goodacre S. Factors influencing the use of IT in the emergency department: a qualitative study. Health Inform J. 2010; 16 (3):189–200. doi: 10.1177/1460458210377480. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sheard L, Prout H, Dowding D, Noble S, Watt I, Maraveyas A, Johnson M. Barriers to the diagnosis and treatment of venous thromboembolism in advanced cancer patients: a qualitative study. Palliative Med. 2012; 27 (2):339–348. [ PubMed ] [ Google Scholar ]
  • Ellis J, Wagland R, Tishelman C, Williams ML, Bailey CD, Haines J, Caress A, Lorigan P, Smith JA, Booton R. et al. Considerations in developing and delivering a nonpharmacological intervention for symptom management in lung cancer: the views of patients and informal caregivers. J Pain Symptom Manag (0) 2012; 44 (6):831–842. doi: 10.1016/j.jpainsymman.2011.12.274. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gale N, Sultan H. Telehealth as ‘peace of mind’: embodiment, emotions and the home as the primary health space for people with chronic obstructive pulmonary disorder. Health place. 2013; 21 :140–147. [ PubMed ] [ Google Scholar ]
  • Rashidian A, Eccles MP, Russell I. Falling on stony ground? A qualitative study of implementation of clinical guidelines’ prescribing recommendations in primary care. Health policy. 2008; 85 (2):148–161. doi: 10.1016/j.healthpol.2007.07.011. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jones RK. The unsolicited diary as a qualitative research tool for advanced research capacity in the field of health and illness. Qualitative Health Res. 2000; 10 (4):555–567. doi: 10.1177/104973200129118543. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pope C, Ziebland S, Mays N. Analysing qualitative data. British Med J. 2000; 320 :114–116. doi: 10.1136/bmj.320.7227.114. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pope C, Mays N. Critical reflections on the rise of qualitative research. British Med J. 2009; 339 :737–739. [ Google Scholar ]
  • Fairclough N. Critical discourse analysis: the critical study of language. London: Longman; 2010. [ Google Scholar ]
  • Garfinkel H. Ethnomethodology’s program. Soc Psychol Quarter. 1996; 59 (1):5–21. doi: 10.2307/2787116. [ CrossRef ] [ Google Scholar ]
  • Merleau-Ponty M. The phenomenology of perception. London: Routledge and Kegan Paul; 1962. [ Google Scholar ]
  • Svenaeus F. Handbook of phenomenology and medicine. Netherlands: Springer; 2001. The phenomenology of health and illness; pp. 87–108. [ Google Scholar ]
  • Reissmann CK. Narrative methods for the human sciences. London: Sage; 2008. [ Google Scholar ]
  • Charmaz K. Constructing grounded theory: a practical guide through qualitative analysis. London: Sage; 2006. [ Google Scholar ]
  • Glaser A, Strauss AL. The discovery of grounded theory. Chicago: Aldine; 1967. [ Google Scholar ]
  • Crotty M. The foundations of social research: meaning and perspective in the research process. London: Sage; 1998. [ Google Scholar ]
  • Boeije H. A purposeful approach to the constant comparative method in the analysis of qualitative interviews. Qual Quant. 2002; 36 (4):391–409. doi: 10.1023/A:1020909529486. [ CrossRef ] [ Google Scholar ]
  • Hsieh H-F, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005; 15 (9):1277–1288. doi: 10.1177/1049732305276687. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Redwood S, Gale NK, Greenfield S. ‘You give us rangoli, we give you talk’: using an art-based activity to elicit data from a seldom heard group. BMC Med Res Methodol. 2012; 12 (1):7. doi: 10.1186/1471-2288-12-7. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mishler EG. In: The sociology of health and illness: critical perspectives. Third. Conrad P, Kern R, editor. New York: St Martins Press; 1990. The struggle between the voice of medicine and the voice of the lifeworld. [ Google Scholar ]
  • Hodges BD, Kuper A, Reeves S. Discourse analysis. British Med J. 2008; 337 :570–572. doi: 10.1136/bmj.39370.701782.DE. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sandelowski M, Barroso J. Writing the proposal for a qualitative research methodology project. Qual Health Res. 2003; 13 (6):781–820. doi: 10.1177/1049732303013006003. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ellins J. It’s better together: involving older people in research. HSMC Newsletter Focus Serv Users Publ. 2010; 16 (1):4. [ Google Scholar ]
  • Phillimore J, Goodson L, Hennessy D, Ergun E. Empowering Birmingham’s migrant and refugee community organisations: making a difference. York: Joseph Rowntree Foundation; 2009. [ Google Scholar ]
  • Leamy M, Clough R. How older people became researchers. York: Joseph Rowntree Foundation; 2006. [ Google Scholar ]
  • Glasby J, Miller R, Ellins J, Durose J, Davidson D, McIver S, Littlechild R, Tanner D, Snelling I, Spence K. Understanding and improving transitions of older people: a user and carer centred approach. London: The Stationery Office; 2012. (Final report NIHR service delivery and organisation programme). [ Google Scholar ]
  • Saldaña J. The coding manual for qualitative researchers. London: Sage; 2009. [ Google Scholar ]
  • Lincoln YS. Emerging criteria for quality in qualitative and interpretive research. Qual Inquiry. 1995; 1 (3):275–289. doi: 10.1177/107780049500100301. [ CrossRef ] [ Google Scholar ]
  • Mays N, Pope C. Qualitative research in health care: assessing quality in qualitative research. BMJ British Med J. 2000; 320 (7226):50. doi: 10.1136/bmj.320.7226.50. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Seale C. Quality in qualitative research. Qual Inquiry. 1999; 5 (4):465–478. doi: 10.1177/107780049900500402. [ CrossRef ] [ Google Scholar ]
  • Dingwall R, Murphy E, Watson P, Greatbatch D, Parker S. Catching goldfish: quality in qualitative research. J Health serv Res Policy. 1998; 3 (3):167–172. [ PubMed ] [ Google Scholar ]
  • Popay J, Rogers A, Williams G. Rationale and standards for the systematic review of qualitative literature in health services research. Qual Health Res. 1998; 8 (3):341–351. doi: 10.1177/104973239800800305. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Morse JM, Barrett M, Mayan M, Olson K, Spiers J. Verification strategies for establishing reliability and validity in qualitative research. Int J Qual Methods. 2008; 1 (2):13–22. [ Google Scholar ]
  • Smith JA. Reflecting on the development of interpretative phenomenological analysis and its contribution to qualitative research in psychology. Qual Res Psychol. 2004; 1 (1):39–54. [ Google Scholar ]
  • Polit DF, Beck CT. Generalization in quantitative and qualitative research: Myths and strategies. Int J Nurs Studies. 2010; 47 (11):1451–1458. doi: 10.1016/j.ijnurstu.2010.06.004. [ PubMed ] [ CrossRef ] [ Google Scholar ]

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • How to Do Thematic Analysis | Step-by-Step Guide & Examples

How to Do Thematic Analysis | Step-by-Step Guide & Examples

Published on September 6, 2019 by Jack Caulfield . Revised on June 22, 2023.

Thematic analysis is a method of analyzing qualitative data . It is usually applied to a set of texts, such as an interview or transcripts . The researcher closely examines the data to identify common themes – topics, ideas and patterns of meaning that come up repeatedly.

There are various approaches to conducting thematic analysis, but the most common form follows a six-step process: familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. Following this process can also help you avoid confirmation bias when formulating your analysis.

This process was originally developed for psychology research by Virginia Braun and Victoria Clarke . However, thematic analysis is a flexible method that can be adapted to many different kinds of research.

Table of contents

When to use thematic analysis, different approaches to thematic analysis, step 1: familiarization, step 2: coding, step 3: generating themes, step 4: reviewing themes, step 5: defining and naming themes, step 6: writing up, other interesting articles.

Thematic analysis is a good approach to research where you’re trying to find out something about people’s views, opinions, knowledge, experiences or values from a set of qualitative data – for example, interview transcripts , social media profiles, or survey responses .

Some types of research questions you might use thematic analysis to answer:

  • How do patients perceive doctors in a hospital setting?
  • What are young women’s experiences on dating sites?
  • What are non-experts’ ideas and opinions about climate change?
  • How is gender constructed in high school history teaching?

To answer any of these questions, you would collect data from a group of relevant participants and then analyze it. Thematic analysis allows you a lot of flexibility in interpreting the data, and allows you to approach large data sets more easily by sorting them into broad themes.

However, it also involves the risk of missing nuances in the data. Thematic analysis is often quite subjective and relies on the researcher’s judgement, so you have to reflect carefully on your own choices and interpretations.

Pay close attention to the data to ensure that you’re not picking up on things that are not there – or obscuring things that are.

Prevent plagiarism. Run a free check.

Once you’ve decided to use thematic analysis, there are different approaches to consider.

There’s the distinction between inductive and deductive approaches:

  • An inductive approach involves allowing the data to determine your themes.
  • A deductive approach involves coming to the data with some preconceived themes you expect to find reflected there, based on theory or existing knowledge.

Ask yourself: Does my theoretical framework give me a strong idea of what kind of themes I expect to find in the data (deductive), or am I planning to develop my own framework based on what I find (inductive)?

There’s also the distinction between a semantic and a latent approach:

  • A semantic approach involves analyzing the explicit content of the data.
  • A latent approach involves reading into the subtext and assumptions underlying the data.

Ask yourself: Am I interested in people’s stated opinions (semantic) or in what their statements reveal about their assumptions and social context (latent)?

After you’ve decided thematic analysis is the right method for analyzing your data, and you’ve thought about the approach you’re going to take, you can follow the six steps developed by Braun and Clarke .

The first step is to get to know our data. It’s important to get a thorough overview of all the data we collected before we start analyzing individual items.

This might involve transcribing audio , reading through the text and taking initial notes, and generally looking through the data to get familiar with it.

Next up, we need to code the data. Coding means highlighting sections of our text – usually phrases or sentences – and coming up with shorthand labels or “codes” to describe their content.

Let’s take a short example text. Say we’re researching perceptions of climate change among conservative voters aged 50 and up, and we have collected data through a series of interviews. An extract from one interview looks like this:

Coding qualitative data
Interview extract Codes
Personally, I’m not sure. I think the climate is changing, sure, but I don’t know why or how. People say you should trust the experts, but who’s to say they don’t have their own reasons for pushing this narrative? I’m not saying they’re wrong, I’m just saying there’s reasons not to 100% trust them. The facts keep changing – it used to be called global warming.

In this extract, we’ve highlighted various phrases in different colors corresponding to different codes. Each code describes the idea or feeling expressed in that part of the text.

At this stage, we want to be thorough: we go through the transcript of every interview and highlight everything that jumps out as relevant or potentially interesting. As well as highlighting all the phrases and sentences that match these codes, we can keep adding new codes as we go through the text.

After we’ve been through the text, we collate together all the data into groups identified by code. These codes allow us to gain a a condensed overview of the main points and common meanings that recur throughout the data.

Next, we look over the codes we’ve created, identify patterns among them, and start coming up with themes.

Themes are generally broader than codes. Most of the time, you’ll combine several codes into a single theme. In our example, we might start combining codes into themes like this:

Turning codes into themes
Codes Theme
Uncertainty
Distrust of experts
Misinformation

At this stage, we might decide that some of our codes are too vague or not relevant enough (for example, because they don’t appear very often in the data), so they can be discarded.

Other codes might become themes in their own right. In our example, we decided that the code “uncertainty” made sense as a theme, with some other codes incorporated into it.

Again, what we decide will vary according to what we’re trying to find out. We want to create potential themes that tell us something helpful about the data for our purposes.

Now we have to make sure that our themes are useful and accurate representations of the data. Here, we return to the data set and compare our themes against it. Are we missing anything? Are these themes really present in the data? What can we change to make our themes work better?

If we encounter problems with our themes, we might split them up, combine them, discard them or create new ones: whatever makes them more useful and accurate.

For example, we might decide upon looking through the data that “changing terminology” fits better under the “uncertainty” theme than under “distrust of experts,” since the data labelled with this code involves confusion, not necessarily distrust.

Now that you have a final list of themes, it’s time to name and define each of them.

Defining themes involves formulating exactly what we mean by each theme and figuring out how it helps us understand the data.

Naming themes involves coming up with a succinct and easily understandable name for each theme.

For example, we might look at “distrust of experts” and determine exactly who we mean by “experts” in this theme. We might decide that a better name for the theme is “distrust of authority” or “conspiracy thinking”.

Finally, we’ll write up our analysis of the data. Like all academic texts, writing up a thematic analysis requires an introduction to establish our research question, aims and approach.

We should also include a methodology section, describing how we collected the data (e.g. through semi-structured interviews or open-ended survey questions ) and explaining how we conducted the thematic analysis itself.

The results or findings section usually addresses each theme in turn. We describe how often the themes come up and what they mean, including examples from the data as evidence. Finally, our conclusion explains the main takeaways and shows how the analysis has answered our research question.

In our example, we might argue that conspiracy thinking about climate change is widespread among older conservative voters, point out the uncertainty with which many voters view the issue, and discuss the role of misinformation in respondents’ perceptions.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Measures of central tendency
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Discourse analysis
  • Cohort study
  • Peer review
  • Ethnography

Research bias

  • Implicit bias
  • Cognitive bias
  • Conformity bias
  • Hawthorne effect
  • Availability heuristic
  • Attrition bias
  • Social desirability bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Caulfield, J. (2023, June 22). How to Do Thematic Analysis | Step-by-Step Guide & Examples. Scribbr. Retrieved September 16, 2024, from https://www.scribbr.com/methodology/thematic-analysis/

Is this article helpful?

Jack Caulfield

Jack Caulfield

Other students also liked, what is qualitative research | methods & examples, inductive vs. deductive research approach | steps & examples, critical discourse analysis | definition, guide & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 12 September 2024

An open-source framework for end-to-end analysis of electronic health record data

  • Lukas Heumos 1 , 2 , 3 ,
  • Philipp Ehmele 1 ,
  • Tim Treis 1 , 3 ,
  • Julius Upmeier zu Belzen   ORCID: orcid.org/0000-0002-0966-4458 4 ,
  • Eljas Roellin 1 , 5 ,
  • Lilly May 1 , 5 ,
  • Altana Namsaraeva 1 , 6 ,
  • Nastassya Horlava 1 , 3 ,
  • Vladimir A. Shitov   ORCID: orcid.org/0000-0002-1960-8812 1 , 3 ,
  • Xinyue Zhang   ORCID: orcid.org/0000-0003-4806-4049 1 ,
  • Luke Zappia   ORCID: orcid.org/0000-0001-7744-8565 1 , 5 ,
  • Rainer Knoll 7 ,
  • Niklas J. Lang 2 ,
  • Leon Hetzel 1 , 5 ,
  • Isaac Virshup 1 ,
  • Lisa Sikkema   ORCID: orcid.org/0000-0001-9686-6295 1 , 3 ,
  • Fabiola Curion 1 , 5 ,
  • Roland Eils 4 , 8 ,
  • Herbert B. Schiller 2 , 9 ,
  • Anne Hilgendorff 2 , 10 &
  • Fabian J. Theis   ORCID: orcid.org/0000-0002-2419-1943 1 , 3 , 5  

Nature Medicine ( 2024 ) Cite this article

98 Altmetric

Metrics details

  • Epidemiology
  • Translational research

With progressive digitalization of healthcare systems worldwide, large-scale collection of electronic health records (EHRs) has become commonplace. However, an extensible framework for comprehensive exploratory analysis that accounts for data heterogeneity is missing. Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps, from data extraction and quality control to the generation of low-dimensional representations. Complemented by rich statistical modules, ehrapy facilitates associating patients with disease states, differential comparison between patient clusters, survival analysis, trajectory inference, causal inference and more. Leveraging ontologies, ehrapy further enables data sharing and training EHR deep learning models, paving the way for foundational models in biomedical research. We demonstrate ehrapy’s features in six distinct examples. We applied ehrapy to stratify patients affected by unspecified pneumonia into finer-grained phenotypes. Furthermore, we reveal biomarkers for significant differences in survival among these groups. Additionally, we quantify medication-class effects of pneumonia medications on length of stay. We further leveraged ehrapy to analyze cardiovascular risks across different data modalities. We reconstructed disease state trajectories in patients with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) based on imaging data. Finally, we conducted a case study to demonstrate how ehrapy can detect and mitigate biases in EHR data. ehrapy, thus, provides a framework that we envision will standardize analysis pipelines on EHR data and serve as a cornerstone for the community.

Similar content being viewed by others

analytical framework in research example

Data-driven identification of heart failure disease states and progression pathways using electronic health records

analytical framework in research example

EHR foundation models improve robustness in the presence of temporal distribution shift

analytical framework in research example

Harnessing EHR data for health research

Electronic health records (EHRs) are becoming increasingly common due to standardized data collection 1 and digitalization in healthcare institutions. EHRs collected at medical care sites serve as efficient storage and sharing units of health information 2 , enabling the informed treatment of individuals using the patient’s complete history 3 . Routinely collected EHR data are approaching genomic-scale size and complexity 4 , posing challenges in extracting information without quantitative analysis methods. The application of such approaches to EHR databases 1 , 5 , 6 , 7 , 8 , 9 has enabled the prediction and classification of diseases 10 , 11 , study of population health 12 , determination of optimal treatment policies 13 , 14 , simulation of clinical trials 15 and stratification of patients 16 .

However, current EHR datasets suffer from serious limitations, such as data collection issues, inconsistencies and lack of data diversity. EHR data collection and sharing problems often arise due to non-standardized formats, with disparate systems using exchange protocols, such as Health Level Seven International (HL7) and Fast Healthcare Interoperability Resources (FHIR) 17 . In addition, EHR data are stored in various on-disk formats, including, but not limited to, relational databases and CSV, XML and JSON formats. These variations pose challenges with respect to data retrieval, scalability, interoperability and data sharing.

Beyond format variability, inherent biases of the collected data can compromise the validity of findings. Selection bias stemming from non-representative sample composition can lead to skewed inferences about disease prevalence or treatment efficacy 18 , 19 . Filtering bias arises through inconsistent criteria for data inclusion, obscuring true variable relationships 20 . Surveillance bias exaggerates associations between exposure and outcomes due to differential monitoring frequencies 21 . EHR data are further prone to missing data 22 , 23 , which can be broadly classified into three categories: missing completely at random (MCAR), where missingness is unrelated to the data; missing at random (MAR), where missingness depends on observed data; and missing not at random (MNAR), where missingness depends on unobserved data 22 , 23 . Information and coding biases, related to inaccuracies in data recording or coding inconsistencies, respectively, can lead to misclassification and unreliable research conclusions 24 , 25 . Data may even contradict itself, such as when measurements were reported for deceased patients 26 , 27 . Technical variation and differing data collection standards lead to distribution differences and inconsistencies in representation and semantics across EHR datasets 28 , 29 . Attrition and confounding biases, resulting from differential patient dropout rates or unaccounted external variable effects, can significantly skew study outcomes 30 , 31 , 32 . The diversity of EHR data that comprise demographics, laboratory results, vital signs, diagnoses, medications, x-rays, written notes and even omics measurements amplifies all the aforementioned issues.

Addressing these challenges requires rigorous study design, careful data pre-processing and continuous bias evaluation through exploratory data analysis. Several EHR data pre-processing and analysis workflows were previously developed 4 , 33 , 34 , 35 , 36 , 37 , but none of them enables the analysis of heterogeneous data, provides in-depth documentation, is available as a software package or allows for exploratory visual analysis. Current EHR analysis pipelines, therefore, differ considerably in their approaches and are often commercial, vendor-specific solutions 38 . This is in contrast to strategies using community standards for the analysis of omics data, such as Bioconductor 39 or scverse 40 . As a result, EHR data frequently remain underexplored and are commonly investigated only for a particular research question 41 . Even in such cases, EHR data are then frequently input into machine learning models with serious data quality issues that greatly impact prediction performance and generalizability 42 .

To address this lack of analysis tooling, we developed the EHR Analysis in Python framework, ehrapy, which enables exploratory analysis of diverse EHR datasets. The ehrapy package is purpose-built to organize, analyze, visualize and statistically compare complex EHR data. ehrapy can be applied to datasets of different data types, sizes, diseases and origins. To demonstrate this versatility, we applied ehrapy to datasets obtained from EHR and population-based studies. Using the Pediatric Intensive Care (PIC) EHR database 43 , we stratified patients diagnosed with ‘unspecified pneumonia’ into distinct clinically relevant groups, extracted clinical indicators of pneumonia through statistical analysis and quantified medication-class effects on length of stay (LOS) with causal inference. Using the UK Biobank 44 (UKB), a population-scale cohort comprising over 500,000 participants from the United Kingdom, we employed ehrapy to explore cardiovascular risk factors using clinical predictors, metabolomics, genomics and retinal imaging-derived features. Additionally, we performed image analysis to project disease progression through fate mapping in patients affected by coronavirus disease 2019 (COVID-19) using chest x-rays. Finally, we demonstrate how exploratory analysis with ehrapy unveils and mitigates biases in over 100,000 visits by patients with diabetes across 130 US hospitals. We provide online links to additional use cases that demonstrate ehrapy’s usage with further datasets, including MIMIC-II (ref. 45 ), and for various medical conditions, such as patients subject to indwelling arterial catheter usage. ehrapy is compatible with any EHR dataset that can be transformed into vectors and is accessible as a user-friendly open-source software package hosted at https://github.com/theislab/ehrapy and installable from PyPI. It comes with comprehensive documentation, tutorials and further examples, all available at https://ehrapy.readthedocs.io .

ehrapy: a framework for exploratory EHR data analysis

The foundation of ehrapy is a robust and scalable data storage backend that is combined with a series of pre-processing and analysis modules. In ehrapy, EHR data are organized as a data matrix where observations are individual patient visits (or patients, in the absence of follow-up visits), and variables represent all measured quantities ( Methods ). These data matrices are stored together with metadata of observations and variables. By leveraging the AnnData (annotated data) data structure that implements this design, ehrapy builds upon established standards and is compatible with analysis and visualization functions provided by the omics scverse 40 ecosystem. Readers are also available in R, Julia and Javascript 46 . We additionally provide a dataset module with more than 20 public loadable EHR datasets in AnnData format to kickstart analysis and development with ehrapy.

For standardized analysis of EHR data, it is crucial that these data are encoded and stored in consistent, reusable formats. Thus, ehrapy requires that input data are organized in structured vectors. Readers for common formats, such as CSV, OMOP 47 or SQL databases, are available in ehrapy. Data loaded into AnnData objects can be mapped against several hierarchical ontologies 48 , 49 , 50 , 51 ( Methods ). Clinical keywords of free text notes can be automatically extracted ( Methods ).

Powered by scanpy, which scales to millions of observations 52 ( Methods and Supplementary Table 1 ) and the machine learning library scikit-learn 53 , ehrapy provides more than 100 composable analysis functions organized in modules from which custom analysis pipelines can be built. Each function directly interacts with the AnnData object and adds all intermediate results for simple access and reuse of information to it. To facilitate setting up these pipelines, ehrapy guides analysts through a general analysis pipeline (Fig. 1 ). At any step of an analysis pipeline, community software packages can be integrated without any vendor lock-in. Because ehrapy is built on open standards, it can be purposefully extended to solve new challenges, such as the development of foundational models ( Methods ).

figure 1

a , Heterogeneous health data are first loaded into memory as an AnnData object with patient visits as observational rows and variables as columns. Next, the data can be mapped against ontologies, and key terms are extracted from free text notes. b , The EHR data are subject to quality control where low-quality or spurious measurements are removed or imputed. Subsequently, numerical data are normalized, and categorical data are encoded. Data from different sources with data distribution shifts are integrated, embedded, clustered and annotated in a patient landscape. c , Further downstream analyses depend on the question of interest and can include the inference of causal effects and trajectories, survival analysis or patient stratification.

In the ehrapy analysis pipeline, EHR data are initially inspected for quality issues by analyzing feature distributions that may skew results and by detecting visits and features with high missing rates that ehrapy can then impute ( Methods ). ehrapy tracks all filtering steps while keeping track of population dynamics to highlight potential selection and filtering biases ( Methods ). Subsequently, ehrapy’s normalization and encoding functions ( Methods ) are applied to achieve a uniform numerical representation that facilitates data integration and corrects for dataset shift effects ( Methods ). Calculated lower-dimensional representations can subsequently be visualized, clustered and annotated to obtain a patient landscape ( Methods ). Such annotated groups of patients can be used for statistical comparisons to find differences in features among them to ultimately learn markers of patient states.

As analysis goals can differ between users and datasets, the ehrapy analysis pipeline is customizable during the final knowledge inference step. ehrapy provides statistical methods for group comparison and extensive support for survival analysis ( Methods ), enabling the discovery of biomarkers. Furthermore, ehrapy offers functions for causal inference to go from statistically determined associations to causal relations ( Methods ). Moreover, patient visits in aggregated EHR data can be regarded as snapshots where individual measurements taken at specific timepoints might not adequately reflect the underlying progression of disease and result from unrelated variation due to, for example, day-to-day differences 54 , 55 , 56 . Therefore, disease progression models should rely on analysis of the underlying clinical data, as disease progression in an individual patient may not be monotonous in time. ehrapy allows for the use of advanced trajectory inference methods to overcome sparse measurements 57 , 58 , 59 . We show that this approach can order snapshots to calculate a pseudotime that can adequately reflect the progression of the underlying clinical process. Given a sufficient number of snapshots, ehrapy increases the potential to understand disease progression, which is likely not robustly captured within a single EHR but, rather, across several.

ehrapy enables patient stratification in pneumonia cases

To demonstrate ehrapy’s capability to analyze heterogeneous datasets from a broad patient set across multiple care units, we applied our exploratory strategy to the PIC 43 database. The PIC database is a single-center database hosting information on children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. It contains 13,499 distinct hospital admissions of 12,881 individual pediatric patients admitted between 2010 and 2018 for whom demographics, diagnoses, doctors’ notes, vital signs, laboratory and microbiology tests, medications, fluid balances and more were collected (Extended Data Figs. 1 and 2a and Methods ). After missing data imputation and subsequent pre-processing (Extended Data Figs. 2b,c and 3 and Methods ), we generated a uniform manifold approximation and projection (UMAP) embedding to visualize variation across all patients using ehrapy (Fig. 2a ). This visualization of the low-dimensional patient manifold shows the heterogeneity of the collected data in the PIC database, with malformations, perinatal and respiratory being the most abundant International Classification of Diseases (ICD) chapters (Fig. 2b ). The most common respiratory disease categories (Fig. 2c ) were labeled pneumonia and influenza ( n  = 984). We focused on pneumonia to apply ehrapy to a challenging, broad-spectrum disease that affects all age groups. Pneumonia is a prevalent respiratory infection that poses a substantial burden on public health 60 and is characterized by inflammation of the alveoli and distal airways 60 . Individuals with pre-existing chronic conditions are particularly vulnerable, as are children under the age of 5 (ref. 61 ). Pneumonia can be caused by a range of microorganisms, encompassing bacteria, respiratory viruses and fungi.

figure 2

a , UMAP of all patient visits in the ICU with primary discharge diagnosis grouped by ICD chapter. b , The prevalence of respiratory diseases prompted us to investigate them further. c , Respiratory categories show the abundance of influenza and pneumonia diagnoses that we investigated more closely. d , We observed the ‘unspecified pneumonia’ subgroup, which led us to investigate and annotate it in more detail. e , The previously ‘unspecified pneumonia’-labeled patients were annotated using several clinical features (Extended Data Fig. 5 ), of which the most important ones are shown in the heatmap ( f ). g , Example disease progression of an individual child with pneumonia illustrating pharmacotherapy over time until positive A. baumannii swab.

We selected the age group ‘youths’ (13 months to 18 years of age) for further analysis, addressing a total of 265 patients who dominated the pneumonia cases and were diagnosed with ‘unspecified pneumonia’ (Fig. 2d and Extended Data Fig. 4 ). Neonates (0–28 d old) and infants (29 d to 12 months old) were excluded from the analysis as the disease context is significantly different in these age groups due to distinct anatomical and physical conditions. Patients were 61% male, had a total of 277 admissions, had a mean age at admission of 54 months (median, 38 months) and had an average LOS of 15 d (median, 7 d). Of these, 152 patients were admitted to the pediatric intensive care unit (PICU), 118 to the general ICU (GICU), four to the surgical ICU (SICU) and three to the cardiac ICU (CICU). Laboratory measurements typically had 12–14% missing data, except for serum procalcitonin (PCT), a marker for bacterial infections, with 24.5% missing, and C-reactive protein (CRP), a marker of inflammation, with 16.8% missing. Measurements assigned as ‘vital signs’ contained between 44% and 54% missing values. Stratifying patients with unspecified pneumonia further enables a more nuanced understanding of the disease, potentially facilitating tailored approaches to treatment.

To deepen clinical phenotyping for the disease group ‘unspecified pneumonia’, we calculated a k -nearest neighbor graph to cluster patients into groups and visualize these in UMAP space ( Methods ). Leiden clustering 62 identified four patient groupings with distinct clinical features that we annotated (Fig. 2e ). To identify the laboratory values, medications and pathogens that were most characteristic for these four groups (Fig. 2f ), we applied t -tests for numerical data and g -tests for categorical data between the identified groups using ehrapy (Extended Data Fig. 5 and Methods ). Based on this analysis, we identified patient groups with ‘sepsis-like, ‘severe pneumonia with co-infection’, ‘viral pneumonia’ and ‘mild pneumonia’ phenotypes. The ‘sepsis-like’ group of patients ( n  = 28) was characterized by rapid disease progression as exemplified by an increased number of deaths (adjusted P  ≤ 5.04 × 10 −3 , 43% ( n  = 28), 95% confidence interval (CI): 23%, 62%); indication of multiple organ failure, such as elevated creatinine (adjusted P  ≤ 0.01, 52.74 ± 23.71 μmol L −1 ) or reduced albumin levels (adjusted P  ≤ 2.89 × 10 −4 , 33.40 ± 6.78 g L −1 ); and increased expression levels and peaks of inflammation markers, including PCT (adjusted P  ≤ 3.01 × 10 −2 , 1.42 ± 2.03 ng ml −1 ), whole blood cell count, neutrophils, lymphocytes, monocytes and lower platelet counts (adjusted P  ≤ 6.3 × 10 −2 , 159.30 ± 142.00 × 10 9 per liter) and changes in electrolyte levels—that is, lower potassium levels (adjusted P  ≤ 0.09 × 10 −2 , 3.14 ± 0.54 mmol L −1 ). Patients whom we associated with the term ‘severe pneumonia with co-infection’ ( n  = 74) were characterized by prolonged ICU stays (adjusted P  ≤ 3.59 × 10 −4 , 15.01 ± 29.24 d); organ affection, such as higher levels of creatinine (adjusted P  ≤ 1.10 × 10 −4 , 52.74 ± 23.71 μmol L −1 ) and lower platelet count (adjusted P  ≤ 5.40 × 10 −23 , 159.30 ± 142.00 × 10 9 per liter); increased inflammation markers, such as peaks of PCT (adjusted P  ≤ 5.06 × 10 −5 , 1.42 ± 2.03 ng ml −1 ), CRP (adjusted P  ≤ 1.40 × 10 −6 , 50.60 ± 37.58 mg L −1 ) and neutrophils (adjusted P  ≤ 8.51 × 10 −6 , 13.01 ± 6.98 × 10 9 per liter); detection of bacteria in combination with additional pathogen fungals in sputum samples (adjusted P  ≤ 1.67 × 10 −2 , 26% ( n  = 74), 95% CI: 16%, 36%); and increased application of medication, including antifungals (adjusted P  ≤ 1.30 × 10 −4 , 15% ( n  = 74), 95% CI: 7%, 23%) and catecholamines (adjusted P  ≤ 2.0 × 10 −2 , 45% ( n  = 74), 95% CI: 33%, 56%). Patients in the ‘mild pneumonia’ group were characterized by positive sputum cultures in the presence of relatively lower inflammation markers, such as PCT (adjusted P  ≤ 1.63 × 10 −3 , 1.42 ± 2.03 ng ml −1 ) and CRP (adjusted P  ≤ 0.03 × 10 −1 , 50.60 ± 37.58 mg L −1 ), while receiving antibiotics more frequently (adjusted P  ≤ 1.00 × 10 −5 , 80% ( n  = 78), 95% CI: 70%, 89%) and additional medications (electrolytes, blood thinners and circulation-supporting medications) (adjusted P  ≤ 1.00 × 10 −5 , 82% ( n  = 78), 95% CI: 73%, 91%). Finally, patients in the ‘viral pneumonia’ group were characterized by shorter LOSs (adjusted P  ≤ 8.00 × 10 −6 , 15.01 ± 29.24 d), a lack of non-viral pathogen detection in combination with higher lymphocyte counts (adjusted P  ≤ 0.01, 4.11 ± 2.49 × 10 9 per liter), lower levels of PCT (adjusted P  ≤ 0.03 × 10 −2 , 1.42 ± 2.03 ng ml −1 ) and reduced application of catecholamines (adjusted P  ≤ 5.96 × 10 −7 , 15% (n = 97), 95% CI: 8%, 23%), antibiotics (adjusted P  ≤ 8.53 × 10 −6 , 41% ( n  = 97), 95% CI: 31%, 51%) and antifungals (adjusted P  ≤ 5.96 × 10 −7 , 0% ( n  = 97), 95% CI: 0%, 0%).

To demonstrate the ability of ehrapy to examine EHR data from different levels of resolution, we additionally reconstructed a case from the ‘severe pneumonia with co-infection’ group (Fig. 2g ). In this case, the analysis revealed that CRP levels remained elevated despite broad-spectrum antibiotic treatment until a positive Acinetobacter baumannii result led to a change in medication and a subsequent decrease in CRP and monocyte levels.

ehrapy facilitates extraction of pneumonia indicators

ehrapy’s survival analysis module allowed us to identify clinical indicators of disease stages that could be used as biomarkers through Kaplan–Meier analysis. We found strong variance in overall aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma-glutamyl transferase (GGT) and bilirubin levels (Fig. 3a ), including changes over time (Extended Data Fig. 6a,b ), in all four ‘unspecified pneumonia’ groups. Routinely used to assess liver function, studies provide evidence that AST, ALT and GGT levels are elevated during respiratory infections 63 , including severe pneumonia 64 , and can guide diagnosis and management of pneumonia in children 63 . We confirmed reduced survival in more severely affected children (‘sepsis-like pneumonia’ and ‘severe pneumonia with co-infection’) using Kaplan–Meier curves and a multivariate log-rank test (Fig. 3b ; P  ≤ 1.09 × 10 −18 ) through ehrapy. To verify the association of this trajectory with altered AST, ALT and GGT expression levels, we further grouped all patients based on liver enzyme reference ranges ( Methods and Supplementary Table 2 ). By Kaplan–Meier survival analysis, cases with peaks of GGT ( P  ≤ 1.4 × 10 −2 , 58.01 ± 2.03 U L −1 ), ALT ( P  ≤ 2.9 × 10 −2 , 43.59 ± 38.02 U L −1 ) and AST ( P  ≤ 4.8 × 10 −4 , 78.69 ± 60.03 U L −1 ) in ‘outside the norm’ were found to correlate with lower survival in all groups (Fig. 3c and Extended Data Fig. 6 ), in line with previous studies 63 , 65 . Bilirubin was not found to significantly affect survival ( P  ≤ 2.1 × 10 −1 , 12.57 ± 21.22 mg dl −1 ).

figure 3

a , Line plots of major hepatic system laboratory measurements per group show variance in the measurements per pneumonia group. b , Kaplan–Meier survival curves demonstrate lower survival for ‘sepsis-like’ and ‘severe pneumonia with co-infection’ groups. c , Kaplan–Meier survival curves for children with GGT measurements outside the norm range display lower survival.

ehrapy quantifies medication class effect on LOS

Pneumonia requires case-specific medications due to its diverse causes. To demonstrate the potential of ehrapy’s causal inference module, we quantified the effect of medication on ICU LOS to evaluate case-specific administration of medication. In contrast to causal discovery that attempts to find a causal graph reflecting the causal relationships, causal inference is a statistical process used to investigate possible effects when altering a provided system, as represented by a causal graph and observational data (Fig. 4a ) 66 . This approach allows identifying and quantifying the impact of specific interventions or treatments on outcome measures, thereby providing insight for evidence-based decision-making in healthcare. Causal inference relies on datasets incorporating interventions to accurately quantify effects.

figure 4

a , ehrapy’s causal module is based on the strategy of the tool ‘dowhy’. Here, EHR data containing treatment, outcome and measurements and a causal graph serve as input for causal effect quantification. The process includes the identification of the target estimand based on the causal graph, the estimation of causal effects using various models and, finally, refutation where sensitivity analyses and refutation tests are performed to assess the robustness of the results and assumptions. b , Curated causal graph using age, liver damage and inflammation markers as disease progression proxies together with medications as interventions to assess the causal effect on length of ICU stay. c , Determined causal effect strength on LOS in days of administered medication categories.

We manually constructed a minimal causal graph with ehrapy (Fig. 4b ) on records of treatment with corticosteroids, carbapenems, penicillins, cephalosporins and antifungal and antiviral medications as interventions (Extended Data Fig. 7 and Methods ). We assumed that the medications affect disease progression proxies, such as inflammation markers and markers of organ function. The selection of ‘interventions’ is consistent with current treatment standards for bacterial pneumonia and respiratory distress 67 , 68 . Based on the approach of the tool ‘dowhy’ 69 (Fig. 4a ), ehrapy’s causal module identified the application of corticosteroids, antivirals and carbapenems to be associated with shorter LOSs, in line with current evidence 61 , 70 , 71 , 72 . In contrast, penicillins and cephalosporins were associated with longer LOSs, whereas antifungal medication did not strongly influence LOS (Fig. 4c ).

ehrapy enables deriving population-scale risk factors

To illustrate the advantages of using a unified data management and quality control framework, such as ehrapy, we modeled myocardial infarction risk using Cox proportional hazards models on UKB 44 data. Large population cohort studies, such as the UKB, enable the investigation of common diseases across a wide range of modalities, including genomics, metabolomics, proteomics, imaging data and common clinical variables (Fig. 5a,b ). From these, we used a publicly available polygenic risk score for coronary heart disease 73 comprising 6.6 million variants, 80 nuclear magnetic resonance (NMR) spectroscopy-based metabolomics 74 features, 81 features derived from retinal optical coherence tomography 75 , 76 and the Framingham Risk Score 77 feature set, which includes known clinical predictors, such as age, sex, body mass index, blood pressure, smoking behavior and cholesterol levels. We excluded features with more than 10% missingness and imputed the remaining missing values ( Methods ). Furthermore, individuals with events up to 1 year after the sampling time were excluded from the analyses, ultimately selecting 29,216 individuals for whom all mentioned data types were available (Extended Data Figs. 8 and 9 and Methods ). Myocardial infarction, as defined by our mapping to the phecode nomenclature 51 , was defined as the endpoint (Fig. 5c ). We modeled the risk for myocardial infarction 1 year after either the metabolomic sample was obtained or imaging was performed.

figure 5

a , The UKB includes 502,359 participants from 22 assessment centers. Most participants have genetic data (97%) and physical measurement data (93%), but fewer have data for complex measures, such as metabolomics, retinal imaging or proteomics. b , We found a distinct cluster of individuals (bottom right) from the Birmingham assessment center in the retinal imaging data, which is an artifact of the image acquisition process and was, thus, excluded. c , Myocardial infarctions are recorded for 15% of the male and 7% of the female study population. Kaplan–Meier estimators with 95% CIs are shown. d , For every modality combination, a linear Cox proportional hazards model was fit to determine the prognostic potential of these for myocardial infarction. Cardiovascular risk factors show expected positive log hazard ratios (log (HRs)) for increased blood pressure or total cholesterol and negative ones for sampling age and systolic blood pressure (BP). log (HRs) with 95% CIs are shown. e , Combining all features yields a C-index of 0.81. c – e , Error bars indicate 95% CIs ( n  = 29,216).

Predictive performance for each modality was assessed by fitting Cox proportional hazards (Fig. 5c ) models on each of the feature sets using ehrapy (Fig. 5d ). The age of the first occurrence served as the time to event; alternatively, date of death or date of the last record in the EHR served as censoring times. Models were evaluated using the concordance index (C-index) ( Methods ). The combination of multiple modalities successfully improved the predictive performance for coronary heart disease by increasing the C-index from 0.63 (genetic) to 0.76 (genetics, age and sex) and to 0.77 (clinical predictors) with 0.81 (imaging and clinical predictors) for combinations of feature sets (Fig. 5e ). Our finding is in line with previous observations of complementary effects between different modalities, where a broader ‘major adverse cardiac event’ phenotype was modeled in the UKB achieving a C-index of 0.72 (ref. 78 ). Adding genetic data improves predictive potential, as it is independent of sampling age and has limited prediction of other modalities 79 . The addition of metabolomic data did not improve predictive power (Fig. 5e ).

Imaging-based disease severity projection via fate mapping

To demonstrate ehrapy’s ability to handle diverse image data and recover disease stages, we embedded pulmonary imaging data obtained from patients with COVID-19 into a lower-dimensional space and computationally inferred disease progression trajectories using pseudotemporal ordering. This describes a continuous trajectory or ordering of individual points based on feature similarity 80 . Continuous trajectories enable mapping the fate of new patients onto precise states to potentially predict their future condition.

In COVID-19, a highly contagious respiratory illness caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), symptoms range from mild flu-like symptoms to severe respiratory distress. Chest x-rays typically show opacities (bilateral patchy, ground glass) associated with disease severity 81 .

We used COVID-19 chest x-ray images from the BrixIA 82 dataset consisting of 192 images (Fig. 6a ) with expert annotations of disease severity. We used the BrixIA database scores, which are based on six regions annotated by radiologists, to classify disease severity ( Methods ). We embedded raw image features using a pre-trained DenseNet model ( Methods ) and further processed this embedding into a nearest-neighbors-based UMAP space using ehrapy (Fig. 6b and Methods ). Fate mapping based on imaging information ( Methods ) determined a severity ordering from mild to critical cases (Fig. 6b–d ). Images labeled as ‘normal’ are projected to stay within the healthy group, illustrating the robustness of our approach. Images of diseased patients were ordered by disease severity, highlighting clear trajectories from ‘normal’ to ‘critical’ states despite the heterogeneity of the x-ray images stemming from, for example, different zoom levels (Fig. 6a ).

figure 6

a , Randomly selected chest x-ray images from the BrixIA dataset demonstrate its variance. b , UMAP visualization of the BrixIA dataset embedding shows a separation of disease severity classes. c , Calculated pseudotime for all images increases with distance to the ‘normal’ images. d , Stream projection of fate mapping in UMAP space showcases disease severity trajectory of the COVID-19 chest x-ray images.

Detecting and mitigating biases in EHR data with ehrapy

To showcase how exploratory analysis using ehrapy can reveal and mitigate biases, we analyzed the Fairlearn 83 version of the Diabetes 130-US Hospitals 84 dataset. The dataset covers 10 years (1999–2008) of clinical records from 130 US hospitals, detailing 47 features of diabetes diagnoses, laboratory tests, medications and additional data from up to 14 d of inpatient care of 101,766 diagnosed patient visits ( Methods ). It was originally collected to explore the link between the measurement of hemoglobin A1c (HbA1c) and early readmission.

The cohort primarily consists of White and African American individuals, with only a minority of cases from Asian or Hispanic backgrounds (Extended Data Fig. 10a ). ehrapy’s cohort tracker unveiled selection and surveillance biases when filtering for Medicare recipients for further analysis, resulting in a shift of age distribution toward an age of over 60 years in addition to an increasing ratio of White participants. Using ehrapy’s visualization modules, our analysis showed that HbA1c was measured in only 18.4% of inpatients, with a higher frequency in emergency admissions compared to referral cases (Extended Data Fig. 10b ). Normalization biases can skew data relationships when standardization techniques ignore subgroup variability or assume incorrect distributions. The choice of normalization strategy must be carefully considered to avoid obscuring important factors. When normalizing the number of applied medications individually, differences in distributions between age groups remained. However, when normalizing both distributions jointly with age group as an additional group variable, differences between age groups were masked (Extended Data Fig. 10c ). To investigate missing data and imputation biases, we introduced missingness for the number of applied medications according to an MCAR mechanism, which we verified using ehrapy’s Little’s test ( P  ≤ 0.01 × 10 −2 ), and an MAR mechanism ( Methods ). Whereas imputing the mean in the MCAR case did not affect the overall location of the distribution, it led to an underestimation of the variance, with the standard deviation dropping from 8.1 in the original data to 6.8 in the imputed data (Extended Data Fig. 10d ). Mean imputation in the MAR case skewed both location and variance of the mean from 16.02 to 14.66, with a standard deviation of only 5.72 (Extended Data Fig. 10d ). Using ehrapy’s multiple imputation based MissForest 85 imputation on the MAR data resulted in a mean of 16.04 and a standard deviation of 6.45. To predict patient readmission in fewer than 30 d, we merged the three smallest race groups, ‘Asian’, ‘Hispanic’ and ‘Other’. Furthermore, we dropped the gender group ‘Unknown/Invalid’ owing to the small sample size making meaningful assessment impossible, and we performed balanced random undersampling, resulting in 5,677 cases from each condition. We observed an overall balanced accuracy of 0.59 using a logistic regression model. However, the false-negative rate was highest for the races ‘Other’ and ‘Unknown’, whereas their selection rate was lowest, and this model was, therefore, biased (Extended Data Fig. 10e ). Using ehrapy’s compatibility with existing machine learning packages, we used Fairlearn’s ThresholdOptimizer ( Methods ), which improved the selection rates for ‘Other’ from 0.32 to 0.38 and for ‘Unknown’ from 0.23 to 0.42 and the false-negative rates for ‘Other’ from 0.48 to 0.42 and for ‘Unknown’ from 0.61 to 0.45 (Extended Data Fig. 10e ).

Clustering offers a hypothesis-free alternative to supervised classification when clear hypotheses or labels are missing. It has enabled the identification of heart failure subtypes 86 and progression pathways 87 and COVID-19 severity states 88 . This concept, which is central to ehrapy, further allowed us to identify fine-grained groups of ‘unspecified pneumonia’ cases in the PIC dataset while discovering biomarkers and quantifying effects of medications on LOS. Such retroactive characterization showcases ehrapy’s ability to put complex evidence into context. This approach supports feedback loops to improve diagnostic and therapeutic strategies, leading to more efficiently allocated resources in healthcare.

ehrapy’s flexible data structures enabled us to integrate the heterogeneous UKB data for predictive performance in myocardial infarction. The different data types and distributions posed a challenge for predictive models that were overcome with ehrapy’s pre-processing modules. Our analysis underscores the potential of combining phenotypic and health data at population scale through ehrapy to enhance risk prediction.

By adapting pseudotime approaches that are commonly used in other omics domains, we successfully recovered disease trajectories from raw imaging data with ehrapy. The determined pseudotime, however, only orders data but does not necessarily provide a future projection per patient. Understanding the driver features for fate mapping in image-based datasets is challenging. The incorporation of image segmentation approaches could mitigate this issue and provide a deeper insight into the spatial and temporal dynamics of disease-related processes.

Limitations of our analyses include the lack of control for informative missingness where the absence of information represents information in itself 89 . Translation from Chinese to English in the PIC database can cause information loss and inaccuracies because the Chinese ICD-10 codes are seven characters long compared to the five-character English codes. Incompleteness of databases, such as the lack of radiology images in the PIC database, low sample sizes, underrepresentation of non-White ancestries and participant self-selection, cannot be accounted for and limit generalizability. This restricts deeper phenotyping of, for example, all ‘unspecified pneumonia’ cases with respect to their survival, which could be overcome by the use of multiple databases. Our causal inference use case is limited by unrecorded variables, such as Sequential Organ Failure Assessment (SOFA) scores, and pneumonia-related pathogens that are missing in the causal graph due to dataset constraints, such as high sparsity and substantial missing data, which risk overfitting and can lead to overinterpretation. We counterbalanced this by employing several refutation methods that statistically reject the causal hypothesis, such as a placebo treatment, a random common cause or an unobserved common cause. The longer hospital stays associated with penicillins and cephalosporins may be dataset specific and stem from higher antibiotic resistance, their use as first-line treatments, more severe initial cases, comorbidities and hospital-specific protocols.

Most analysis steps can introduce algorithmic biases where results are misleading or unfavorably affect specific groups. This is particularly relevant in the context of missing data 22 where determining the type of missing data is necessary to handle it correctly. ehrapy includes an implementation of Little’s test 90 , which tests whether data are distributed MCAR to discern missing data types. For MCAR data single-imputation approaches, such as mean, median or mode, imputation can suffice, but these methods are known to reduce variability 91 , 92 . Multiple imputation strategies, such as Multiple Imputation by Chained Equations (MICE) 93 and MissForest 85 , as implemented in ehrapy, are effective for both MCAR and MAR data 22 , 94 , 95 . MNAR data require pattern-mixture or shared-parameter models that explicitly incorporate the mechanism by which data are missing 96 . Because MNAR involves unobserved data, the assumptions about the missingness mechanism cannot be directly verified, making sensitivity analysis crucial 21 . ehrapy’s wide range of normalization functions and grouping functionality enables to account for intrinsic variability within subgroups, and its compatibility with Fairlearn 83 can potentially mitigate predictor biases. Generally, we recommend to assess all pre-processing in an iterative manner with respect to downstream applications, such as patient stratification. Moreover, sensitivity analysis can help verify the robustness of all inferred knowledge 97 .

These diverse use cases illustrate ehrapy’s potential to sufficiently address the need for a computationally efficient, extendable, reproducible and easy-to-use framework. ehrapy is compatible with major standards, such as Observational Medical Outcomes Partnership (OMOP), Common Data Model (CDM) 47 , HL7, FHIR or openEHR, with flexible support for common tabular data formats. Once loaded into an AnnData object, subsequent sharing of analysis results is made easy because AnnData objects can be stored and read platform independently. ehrapy’s rich documentation of the application programming interface (API) and extensive hands-on tutorials make EHR analysis accessible to both novices and experienced analysts.

As ehrapy remains under active development, users can expect ehrapy to continuously evolve. We are improving support for the joint analysis of EHR, genetics and molecular data where ehrapy serves as a bridge between the EHR and the omics communities. We further anticipate the generation of EHR-specific reference datasets, so-called atlases 98 , to enable query-to-reference mapping where new datasets get contextualized by transferring annotations from the reference to the new dataset. To promote the sharing and collective analysis of EHR data, we envision adapted versions of interactive single-cell data explorers, such as CELLxGENE 99 or the UCSC Cell Browser 100 , for EHR data. Such web interfaces would also include disparity dashboards 20 to unveil trends of preferential outcomes for distinct patient groups. Additional modules specifically for high-frequency time-series data, natural language processing and other data types are currently under development. With the widespread availability of code-generating large language models, frameworks such as ehrapy are becoming accessible to medical professionals without coding expertise who can leverage its analytical power directly. Therefore, ehrapy, together with a lively ecosystem of packages, has the potential to enhance the scientific discovery pipeline to shape the era of EHR analysis.

All datasets that were used during the development of ehrapy and the use cases were used according to their terms of use as indicated by each provider.

Design and implementation of ehrapy

A unified pipeline as provided by our ehrapy framework streamlines the analysis of EHR data by providing an efficient, standardized approach, which reduces the complexity and variability in data pre-processing and analysis. This consistency ensures reproducibility of results and facilitates collaboration and sharing within the research community. Additionally, the modular structure allows for easy extension and customization, enabling researchers to adapt the pipeline to their specific needs while building on a solid foundational framework.

ehrapy was designed from the ground up as an open-source effort with community support. The package, as well as all associated tutorials and dataset preparation scripts, are open source. Development takes place publicly on GitHub where the developers discuss feature requests and issues directly with users. This tight interaction between both groups ensures that we implement the most pressing needs to cater the most important use cases and can guide users when difficulties arise. The open-source nature, extensive documentation and modular structure of ehrapy are designed for other developers to build upon and extend ehrapy’s functionality where necessary. This allows us to focus ehrapy on the most important features to keep the number of dependencies to a minimum.

ehrapy was implemented in the Python programming language and builds upon numerous existing numerical and scientific open-source libraries, specifically matplotlib 101 , seaborn 102 , NumPy 103 , numba 104 , Scipy 105 , scikit-learn 53 and Pandas 106 . Although taking considerable advantage of all packages implemented, ehrapy also shares the limitations of these libraries, such as a lack of GPU support or small performance losses due to the translation layer cost for operations between the Python interpreter and the lower-level C language for matrix operations. However, by building on very widely used open-source software, we ensure seamless integration and compatibility with a broad range of tools and platforms to promote community contributions. Additionally, by doing so, we enhance security by allowing a larger pool of developers to identify and address vulnerabilities 107 . All functions are grouped into task-specific modules whose implementation is complemented with additional dependencies.

Data preparation

Dataloaders.

ehrapy is compatible with any type of vectorized data, where vectorized refers to the data being stored in structured tables in either on-disk or database form. The input and output module of ehrapy provides readers for common formats, such as OMOP, CSV tables or SQL databases through Pandas. When reading in such datasets, the data are stored in the appropriate slots in a new AnnData 46 object. ehrapy’s data module provides access to more than 20 public EHR datasets that feature diseases, including, but not limited to, Parkinson’s disease, breast cancer, chronic kidney disease and more. All dataloaders return AnnData objects to allow for immediate analysis.

AnnData for EHR data

Our framework required a versatile data structure capable of handling various matrix formats, including Numpy 103 for general use cases and interoperability, Scipy 105 sparse matrices for efficient storage, Dask 108 matrices for larger-than-memory analysis and Awkward array 109 for irregular time-series data. We needed a single data structure that not only stores data but also includes comprehensive annotations for thorough contextual analysis. It was essential for this structure to be widely used and supported, which ensures robustness and continual updates. Interoperability with other analytical packages was a key criterion to facilitate seamless integration within existing tools and workflows. Finally, the data structure had to support both in-memory operations and on-disk storage using formats such as HDF5 (ref. 110 ) and Zarr 111 , ensuring efficient handling and accessibility of large datasets and the ability to easily share them with collaborators.

All of these requirements are fulfilled by the AnnData format, which is a popular data structure in single-cell genomics. At its core, an AnnData object encapsulates diverse components, providing a holistic representation of data and metadata that are always aligned in dimensions and easily accessible. A data matrix (commonly referred to as ‘ X ’) stands as the foundational element, embodying the measured data. This matrix can be dense (as Numpy array), sparse (as Scipy sparse matrix) or ragged (as Awkward array) where dimensions do not align within the data matrix. The AnnData object can feature several such data matrices stored in ‘layers’. Examples of such layers can be unnormalized or unencoded data. These data matrices are complemented by an observations (commonly referred to as ‘obs’) segment where annotations on the level of patients or visits are stored. Patients’ age or sex, for instance, are often used as such annotations. The variables (commonly referred to as ‘var’) section complements the observations, offering supplementary details about the features in the dataset, such as missing data rates. The observation-specific matrices (commonly referred to as ‘obsm’) section extends the capabilities of the AnnData structure by allowing the incorporation of observation-specific matrices. These matrices can represent various types of information at the individual cell level, such as principal component analysis (PCA) results, t-distributed stochastic neighbor embedding (t-SNE) coordinates or other dimensionality reduction outputs. Analogously, AnnData features a variables-specific variables (commonly referred to as ‘varm’) component. The observation-specific pairwise relationships (commonly referred to as ‘obsp’) segment complements the ‘obsm’ section by accommodating observation-specific pairwise relationships. This can include connectivity matrices, indicating relationships between patients. The inclusion of an unstructured annotations (commonly referred to as ‘uns’) component further enhances flexibility. This segment accommodates unstructured annotations or arbitrary data that might not conform to the structured observations or variables categories. Any AnnData object can be stored on disk in h5ad or Zarr format to facilitate data exchange.

ehrapy natively interfaces with the scientific Python ecosystem via Pandas 112 and Numpy 103 . The development of deep learning models for EHR data 113 is further accelerated through compatibility with pathml 114 , a unified framework for whole-slide image analysis in pathology, and scvi-tools 115 , which provides data loaders for loading tensors from AnnData objects into PyTorch 116 or Jax arrays 117 to facilitate the development of generalizing foundational models for medical artificial intelligence 118 .

Feature annotation

After AnnData creation, any metadata can be mapped against ontologies using Bionty ( https://github.com/laminlabs/bionty-base ). Bionty provides access to the Human Phenotype, Phecodes, Phenotype and Trait, Drug, Mondo and Human Disease ontologies.

Key medical terms stored in an AnnData object in free text can be extracted using the Medical Concept Annotation Toolkit (MedCAT) 119 .

Data processing

Cohort tracking.

ehrapy provides a CohortTracker tool that traces all filtering steps applied to an associated AnnData object. To calculate cohort summary statistics, the implementation makes use of tableone 120 and can subsequently be plotted as bar charts together with flow diagrams 121 that visualize the order and reasoning of filtering operations.

Basic pre-processing and quality control

ehrapy encompasses a suite of functionalities for fundamental data processing that are adopted from scanpy 52 but adapted to EHR data:

Regress out: To address unwanted sources of variation, a regression procedure is integrated, enhancing the dataset’s robustness.

Subsample: Selects a specified fraction of observations.

Balanced sample: Balances groups in the dataset by random oversampling or undersampling.

Highly variable features: The identification and annotation of highly variable features following the ‘highly variable genes’ function of scanpy is seamlessly incorporated, providing users with insights into pivotal elements influencing the dataset.

To identify and minimize quality issues, ehrapy provides several quality control functions:

Basic quality control: Determines the relative and absolute number of missing values per feature and per patient.

Winsorization: For data refinement, ehrapy implements a winsorization process, creating a version of the input array less susceptible to extreme values.

Feature clipping: Imposes limits on features to enhance dataset reliability.

Detect biases: Computes pairwise correlations between features, standardized mean differences for numeric features between groups of sensitive features, categorical feature value count differences between groups of sensitive features and feature importances when predicting a target variable.

Little’s MCAR test: Applies Little’s MCAR test whose null hypothesis is that data are MCAR. Rejecting the null hypothesis may not always mean that data are not MCAR, nor is accepting the null hypothesis a guarantee that data are MCAR. For more details, see Schouten et al. 122 .

Summarize features: Calculates statistical indicators per feature, including minimum, maximum and average values. This can be especially useful to reduce complex data with multiple measurements per feature per patient into sets of columns with single values.

Imputation is crucial in data analysis to address missing values, ensuring the completeness of datasets that can be required for specific algorithms. The ‘ehrapy’ pre-processing module offers a range of imputation techniques:

Explicit Impute: Replaces missing values, in either all columns or a user-specified subset, with a designated replacement value.

Simple Impute: Imputes missing values in numerical data using mean, median or the most frequent value, contributing to a more complete dataset.

KNN Impute: Uses k -nearest neighbor imputation to fill in missing values in the input AnnData object, preserving local data patterns.

MissForest Impute: Implements the MissForest strategy for imputing missing data, providing a robust approach for handling complex datasets.

MICE Impute: Applies the MICE algorithm for imputing data. This implementation is based on the miceforest ( https://github.com/AnotherSamWilson/miceforest ) package.

Data encoding can be required if categoricals are a part of the dataset to obtain numerical values only. Most algorithms in ehrapy are compatible only with numerical values. ehrapy offers two encoding algorithms based on scikit-learn 53 :

One-Hot Encoding: Transforms categorical variables into binary vectors, creating a binary feature for each category and capturing the presence or absence of each category in a concise representation.

Label Encoding: Assigns a unique numerical label to each category, facilitating the representation of categorical data as ordinal values and supporting algorithms that require numerical input.

To ensure that the distributions of the heterogeneous data are aligned, ehrapy offers several normalization procedures:

Log Normalization: Applies the natural logarithm function to the data, useful for handling skewed distributions and reducing the impact of outliers.

Max-Abs Normalization: Scales each feature by its maximum absolute value, ensuring that the maximum absolute value for each feature is 1.

Min-Max Normalization: Transforms the data to a specific range (commonly (0, 1)) by scaling each feature based on its minimum and maximum values.

Power Transformation Normalization: Applies a power transformation to make the data more Gaussian like, often useful for stabilizing variance and improving the performance of models sensitive to distributional assumptions.

Quantile Normalization: Aligns the distributions of multiple variables, ensuring that their quantiles match, which can be beneficial for comparing datasets or removing batch effects.

Robust Scaling Normalization: Scales data using the interquartile range, making it robust to outliers and suitable for datasets with extreme values.

Scaling Normalization: Standardizes data by subtracting the mean and dividing by the standard deviation, creating a distribution with a mean of 0 and a standard deviation of 1.

Offset to Positive Values: Shifts all values by a constant offset to make all values non-negative, with the lowest negative value becoming 0.

Dataset shifts can be corrected using the scanpy implementation of the ComBat 123 algorithm, which employs a parametric and non-parametric empirical Bayes framework for adjusting data for batch effects that is robust to outliers.

Finally, a neighbors graph can be efficiently computed using scanpy’s implementation.

To obtain meaningful lower-dimensional embeddings that can subsequently be visualized and reused for downstream algorithms, ehrapy provides the following algorithms based on scanpy’s implementation:

t-SNE: Uses a probabilistic approach to embed high-dimensional data into a lower-dimensional space, emphasizing the preservation of local similarities and revealing clusters in the data.

UMAP: Embeds data points by modeling their local neighborhood relationships, offering an efficient and scalable technique that captures both global and local structures in high-dimensional data.

Force-Directed Graph Drawing: Uses a physical simulation to position nodes in a graph, with edges representing pairwise relationships, creating a visually meaningful representation that emphasizes connectedness and clustering in the data.

Diffusion Maps: Applies spectral methods to capture the intrinsic geometry of high-dimensional data by modeling diffusion processes, providing a way to uncover underlying structures and patterns.

Density Calculation in Embedding: Quantifies the density of observations within an embedding, considering conditions or groups, offering insights into the concentration of data points in different regions and aiding in the identification of densely populated areas.

ehrapy further provides algorithms for clustering and trajectory inference based on scanpy:

Leiden Clustering: Uses the Leiden algorithm to cluster observations into groups, revealing distinct communities within the dataset with an emphasis on intra-cluster cohesion.

Hierarchical Clustering Dendrogram: Constructs a dendrogram through hierarchical clustering based on specified group by categories, illustrating the hierarchical relationships among observations and facilitating the exploration of structured patterns.

Feature ranking

ehrapy provides two ways of ranking feature contributions to clusters and target variables:

Statistical tests: To compare any obtained clusters to obtain marker features that are significantly different between the groups, ehrapy extends scanpy’s ‘rank genes groups’. The original implementation, which features a t -test for numerical data, is complemented by a g -test for categorical data.

Feature importance: Calculates feature rankings for a target variable using linear regression, support vector machine or random forest models from scikit-learn. ehrapy evaluates the relative importance of each predictor by fitting the model and extracting model-specific metrics, such as coefficients or feature importances.

Dataset integration

Based on scanpy’s ‘ingest’ function, ehrapy facilitates the integration of labels and embeddings from a well-annotated reference dataset into a new dataset, enabling the mapping of cluster annotations and spatial relationships for consistent comparative analysis. This process ensures harmonized clinical interpretations across datasets, especially useful when dealing with multiple experimental diseases or batches.

Knowledge inference

Survival analysis.

ehrapy’s implementation of survival analysis algorithms is based on lifelines 124 :

Ordinary Least Squares (OLS) Model: Creates a linear regression model using OLS from a specified formula and an AnnData object, allowing for the analysis of relationships between variables and observations.

Generalized Linear Model (GLM): Constructs a GLM from a given formula, distribution and AnnData, providing a versatile framework for modeling relationships with nonlinear data structures.

Kaplan–Meier: Fits the Kaplan–Meier curve to generate survival curves, offering a visual representation of the probability of survival over time in a dataset.

Cox Hazard Model: Constructs a Cox proportional hazards model using a specified formula and an AnnData object, enabling the analysis of survival data by modeling the hazard rates and their relationship to predictor variables.

Log-Rank Test: Calculates the P value for the log-rank test, comparing the survival functions of two groups, providing statistical significance for differences in survival distributions.

GLM Comparison: Given two fit GLMs, where the larger encompasses the parameter space of the smaller, this function returns the P value, indicating the significance of the larger model and adding explanatory power beyond the smaller model.

Trajectory inference

Trajectory inference is a computational approach that reconstructs and models the developmental paths and transitions within heterogeneous clinical data, providing insights into the temporal progression underlying complex systems. ehrapy offers several inbuilt algorithms for trajectory inference based on scanpy:

Diffusion Pseudotime: Infers the progression of observations by measuring geodesic distance along the graph, providing a pseudotime metric that represents the developmental trajectory within the dataset.

Partition-based Graph Abstraction (PAGA): Maps out the coarse-grained connectivity structures of complex manifolds using a partition-based approach, offering a comprehensive visualization of relationships in high-dimensional data and aiding in the identification of macroscopic connectivity patterns.

Because ehrapy is compatible with scverse, further trajectory inference-based algorithms, such as CellRank, can be seamlessly applied.

Causal inference

ehrapy’s causal inference module is based on ‘dowhy’ 69 . It is based on four key steps that are all implemented in ehrapy:

Graphical Model Specification: Define a causal graphical model representing relationships between variables and potential causal effects.

Causal Effect Identification: Automatically identify whether a causal effect can be inferred from the given data, addressing confounding and selection bias.

Causal Effect Estimation: Employ automated tools to estimate causal effects, using methods such as matching, instrumental variables or regression.

Sensitivity Analysis and Testing: Perform sensitivity analysis to assess the robustness of causal inferences and conduct statistical testing to determine the significance of the estimated causal effects.

Patient stratification

ehrapy’s complete pipeline from pre-processing to the generation of lower-dimensional embeddings, clustering, statistical comparison between determined groups and more facilitates the stratification of patients.

Visualization

ehrapy features an extensive visualization pipeline that is customizable and yet offers reasonable defaults. Almost every analysis function is matched with at least one visualization function that often shares the name but is available through the plotting module. For example, after importing ehrapy as ‘ep’, ‘ep.tl.umap(adata)’ runs the UMAP algorithm on an AnnData object, and ‘ep.pl.umap(adata)’ would then plot a scatter plot of the UMAP embedding.

ehrapy further offers a suite of more generally usable and modifiable plots:

Scatter Plot: Visualizes data points along observation or variable axes, offering insights into the distribution and relationships between individual data points.

Heatmap: Represents feature values in a grid, providing a comprehensive overview of the data’s structure and patterns.

Dot Plot: Displays count values of specified variables as dots, offering a clear depiction of the distribution of counts for each variable.

Filled Line Plot: Illustrates trends in data with filled lines, emphasizing variations in values over a specified axis.

Violin Plot: Presents the distribution of data through mirrored density plots, offering a concise view of the data’s spread.

Stacked Violin Plot: Combines multiple violin plots, stacked to allow for visual comparison of distributions across categories.

Group Mean Heatmap: Creates a heatmap displaying the mean count per group for each specified variable, providing insights into group-wise trends.

Hierarchically Clustered Heatmap: Uses hierarchical clustering to arrange data in a heatmap, revealing relationships and patterns among variables and observations.

Rankings Plot: Visualizes rankings within the data, offering a clear representation of the order and magnitude of values.

Dendrogram Plot: Plots a dendrogram of categories defined in a group by operation, illustrating hierarchical relationships within the dataset.

Benchmarking ehrapy

We generated a subset of the UKB data selecting 261 features and 488,170 patient visits. We removed all features with missingness rates greater than 70%. To demonstrate speed and memory consumption for various scenarios, we subsampled the data to 20%, 30% and 50%. We ran a minimal ehrapy analysis pipeline on each of those subsets and the full data, including the calculation of quality control metrics, filtering of variables by a missingness threshold, nearest neighbor imputation, normalization, dimensionality reduction and clustering (Supplementary Table 1 ). We conducted our benchmark on a single CPU with eight threads and 60 GB of maximum memory.

ehrapy further provides out-of-core implementations using Dask 108 for many algorithms in ehrapy, such as our normalization functions or our PCA implementation. Out-of-core computation refers to techniques that process data that do not fit entirely in memory, using disk storage to manage data overflow. This approach is crucial for handling large datasets without being constrained by system memory limits. Because the principal components get reused for other computationally expensive algorithms, such as the neighbors graph calculation, it effectively enables the analysis of very large datasets. We are currently working on supporting out-of-core computation for all computationally expensive algorithms in ehrapy.

We demonstrate the memory benefits in a hosted tutorial where the in-memory pipeline for 50,000 patients with 1,000 features required about 2 GB of memory, and the corresponding out-of-core implementation required less than 200 MB of memory.

The code for benchmarking is available at https://github.com/theislab/ehrapy-reproducibility . The implementation of ehrapy is accessible at https://github.com/theislab/ehrapy together with extensive API documentation and tutorials at https://ehrapy.readthedocs.io .

PIC database analysis

Study design.

We collected clinical data from the PIC 43 version 1.1.0 database. PIC is a single-center, bilingual (English and Chinese) database hosting information of children admitted to critical care units at the Children’s Hospital of Zhejiang University School of Medicine in China. The requirement for individual patient consent was waived because the study did not impact clinical care, and all protected health information was de-identified. The database contains 13,499 distinct hospital admissions of 12,881 distinct pediatric patients. These patients were admitted to five ICU units with 119 total critical care beds—GICU, PICU, SICU, CICU and NICU—between 2010 and 2018. The mean age of the patients was 2.5 years, of whom 42.5% were female. The in-hospital mortality was 7.1%; the mean hospital stay was 17.6 d; the mean ICU stay was 9.3 d; and 468 (3.6%) patients were admitted multiple times. Demographics, diagnoses, doctors’ notes, laboratory and microbiology tests, prescriptions, fluid balances, vital signs and radiographics reports were collected from all patients. For more details, see the original publication of Zeng et al. 43 .

Study participants

Individuals older than 18 years were excluded from the study. We grouped the data into three distinct groups: ‘neonates’ (0–28 d of age; 2,968 patients), ‘infants’ (1–12 months of age; 4,876 patients) and ‘youths’ (13 months to 18 years of age; 6,097 patients). We primarily analyzed the ‘youths’ group with the discharge diagnosis ‘unspecified pneumonia’ (277 patients).

Data collection

The collected clinical data included demographics, laboratory and vital sign measurements, diagnoses, microbiology and medication information and mortality outcomes. The five-character English ICD-10 codes were used, whose values are based on the seven-character Chinese ICD-10 codes.

Dataset extraction and analysis

We downloaded the PIC database of version 1.1.0 from Physionet 1 to obtain 17 CSV tables. Using Pandas, we selected all information with more than 50% coverage rate, including demographics and laboratory and vital sign measurements (Fig. 2 ). To reduce the amount of noise, we calculated and added only the minimum, maximum and average of all measurements that had multiple values per patient. Examination reports were removed because they describe only diagnostics and not detailed findings. All further diagnoses and microbiology and medication information were included into the observations slot to ensure that the data were not used for the calculation of embeddings but were still available for the analysis. This ensured that any calculated embedding would not be divided into treated and untreated groups but, rather, solely based on phenotypic features. We imputed all missing data through k -nearest neighbors imputation ( k  = 20) using the knn_impute function of ehrapy. Next, we log normalized the data with ehrapy using the log_norm function. Afterwards, we winsorized the data using ehrapy’s winsorize function to obtain 277 ICU visits ( n  = 265 patients) with 572 features. Of those 572 features, 254 were stored in the matrix X and the remaining 318 in the ‘obs’ slot in the AnnData object. For clustering and visualization purposes, we calculated 50 principal components using ehrapy’s pca function. The obtained principal component representation was then used to calculate a nearest neighbors graph using the neighbors function of ehrapy. The nearest neighbors graph then served as the basis for a UMAP embedding calculation using ehrapy’s umap function.

We applied the community detection algorithm Leiden with resolution 0.6 on the nearest neighbor graph using ehrapy’s leiden function. The four obtained clusters served as input for two-sided t -tests for all numerical values and two-sided g -tests for all categorical values for all four clusters against the union of all three other clusters, respectively. This was conducted using ehrapy’s rank_feature_groups function, which also corrects P values for multiple testing with the Benjamini–Hochberg method 125 . We presented the four groups and the statistically significantly different features between the groups to two pediatricians who annotated the groups with labels.

Our determined groups can be confidently labeled owing to their distinct clinical profiles. Nevertheless, we could only take into account clinical features that were measured. Insightful features, such as lung function tests, are missing. Moreover, the feature representation of the time-series data is simplified, which can hide some nuances between the groups. Generally, deciding on a clustering resolution is difficult. However, more fine-grained clusters obtained via higher clustering resolutions may become too specific and not generalize well enough.

Kaplan–Meier survival analysis

We selected patients with up to 360 h of total stay for Kaplan–Meier survival analysis to ensure a sufficiently high number of participants. We proceeded with the AnnData object prepared as described in the ‘Patient stratification’ subsection to conduct Kaplan–Meier analysis among all four determined pneumonia groups using ehrapy’s kmf function. Significance was tested through ehrapy’s test_kmf_logrank function, which tests whether two Kaplan–Meier series are statistically significant, employing a chi-squared test statistic under the null hypothesis. Let h i (t) be the hazard ratio of group i at time t and c a constant that represents a proportional change in the hazard ratio between the two groups, then:

This implicitly uses the log-rank weights. An additional Kaplan–Meier analysis was conducted for all children jointly concerning the liver markers AST, ALT and GGT. To determine whether measurements were inside or outside the norm range, we used reference ranges (Supplementary Table 2 ). P values less than 0.05 were labeled significant.

Our Kaplan–Meier curve analysis depends on the groups being well defined and shares the same limitations as the patient stratification. Additionally, the analysis is sensitive to the reference table where we selected limits that generalize well for the age ranges, but, due to children of different ages being examined, they may not necessarily be perfectly accurate for all children.

Causal effect of mechanism of action on LOS

Although the dataset was not initially intended for investigating causal effects of interventions, we adapted it for this purpose by focusing on the LOS in the ICU, measured in months, as the outcome variable. This choice aligns with the clinical aim of stabilizing patients sufficiently for ICU discharge. We constructed a causal graph to explore how different drug administrations could potentially reduce the LOS. Based on consultations with clinicians, we included several biomarkers of liver damage (AST, ALT and GGT) and inflammation (CRP and PCT) in our model. Patient age was also considered a relevant variable.

Because several different medications act by the same mechanisms, we grouped specific medications by their drug classes This grouping was achieved by cross-referencing the drugs listed in the dataset with DrugBank release 5.1 (ref. 126 ), using Levenshtein distances for partial string matching. After manual verification, we extracted the corresponding DrugBank categories, counted the number of features per category and compiled a list of commonly prescribed medications, as advised by clinicians. This approach facilitated the modeling of the causal graph depicted in Fig. 4 , where an intervention is defined as the administration of at least one drug from a specified category.

Causal inference was then conducted with ehrapy’s ‘dowhy’ 69 -based causal inference module using the expert-curated causal graph. Medication groups were designated as causal interventions, and the LOS was the outcome of interest. Linear regression served as the estimation method for analyzing these causal effects. We excluded four patients from the analysis owing to their notably long hospital stays exceeding 90 d, which were deemed outliers. To validate the robustness of our causal estimates, we incorporated several refutation methods:

Placebo Treatment Refuter: This method involved replacing the treatment assignment with a placebo to test the effect of the treatment variable being null.

Random Common Cause: A randomly generated variable was added to the data to assess the sensitivity of the causal estimate to the inclusion of potential unmeasured confounders.

Data Subset Refuter: The stability of the causal estimate was tested across various random subsets of the data to ensure that the observed effects were not dependent on a specific subset.

Add Unobserved Common Cause: This approach tested the effect of an omitted variable by adding a theoretically relevant unobserved confounder to the model, evaluating how much an unmeasured variable could influence the causal relationship.

Dummy Outcome: Replaces the true outcome variable with a random variable. If the causal effect nullifies, it supports the validity of the original causal relationship, indicating that the outcome is not driven by random factors.

Bootstrap Validation: Employs bootstrapping to generate multiple samples from the dataset, testing the consistency of the causal effect across these samples.

The selection of these refuters addresses a broad spectrum of potential biases and model sensitivities, including unobserved confounders and data dependencies. This comprehensive approach ensures robust verification of the causal analysis. Each refuter provides an orthogonal perspective, targeting specific vulnerabilities in causal analysis, which strengthens the overall credibility of the findings.

UKB analysis

Study population.

We used information from the UKB cohort, which includes 502,164 study participants from the general UK population without enrichment for specific diseases. The study involved the enrollment of individuals between 2006 and 2010 across 22 different assessment centers throughout the United Kingdom. The tracking of participants is still ongoing. Within the UKB dataset, metabolomics, proteomics and retinal optical coherence tomography data are available for a subset of individuals without any enrichment for specific diseases. Additionally, EHRs, questionnaire responses and other physical measures are available for almost everyone in the study. Furthermore, a variety of genotype information is available for nearly the entire cohort, including whole-genome sequencing, whole-exome sequencing, genotyping array data as well as imputed genotypes from the genotyping array 44 . Because only the latter two are available for download, and are sufficient for polygenic risk score calculation as performed here, we used the imputed genotypes in the present study. Participants visited the assessment center up to four times for additional and repeat measurements and completed additional online follow-up questionnaires.

In the present study, we restricted the analyses to data obtained from the initial assessment, including the blood draw, for obtaining the metabolomics data and the retinal imaging as well as physical measures. This restricts the study population to 33,521 individuals for whom all of these modalities are available. We have a clear study start point for each individual with the date of their initial assessment center visit. The study population has a mean age of 57 years, is 54% female and is censored at age 69 years on average; 4.7% experienced an incident myocardial infarction; and 8.1% have prevalent type 2 diabetes. The study population comes from six of the 22 assessment centers due to the retinal imaging being performed only at those.

For the myocardial infarction endpoint definition, we relied on the first occurrence data available in the UKB, which compiles the first date that each diagnosis was recorded for a participant in a hospital in ICD-10 nomenclature. Subsequently, we mapped these data to phecodes and focused on phecode 404.1 for myocardial infarction.

The Framingham Risk Score was developed on data from 8,491 participants in the Framingham Heart Study to assess general cardiovascular risk 77 . It includes easily obtainable predictors and is, therefore, easily applicable in clinical practice, although newer and more specific risk scores exist and might be used more frequently. It includes age, sex, smoking behavior, blood pressure, total and low-density lipoprotein cholesterol as well as information on insulin, antihypertensive and cholesterol-lowering medications, all of which are routinely collected in the UKB and used in this study as the Framingham feature set.

The metabolomics data used in this study were obtained using proton NMR spectroscopy, a low-cost method with relatively low batch effects. It covers established clinical predictors, such as albumin and cholesterol, as well as a range of lipids, amino acids and carbohydrate-related metabolites.

The retinal optical coherence tomography–derived features were returned by researchers to the UKB 75 , 76 . They used the available scans and determined the macular volume, macular thickness, retinal pigment epithelium thickness, disc diameter, cup-to-disk ratio across different regions as well as the thickness between the inner nuclear layer and external limiting membrane, inner and outer photoreceptor segments and the retinal pigment epithelium across different regions. Furthermore, they determined a wide range of quality metrics for each scan, including the image quality score, minimum motion correlation and inner limiting membrane (ILM) indicator.

Data analysis

After exporting the data from the UKB, all timepoints were transformed into participant age entries. Only participants without prevalent myocardial infarction (relative to the first assessment center visit at which all data were collected) were included.

The data were pre-processed for retinal imaging and metabolomics subsets separately, to enable a clear analysis of missing data and allow for the k -nearest neighbors–based imputation ( k  = 20) of missing values when less than 10% were missing for a given participant. Otherwise, participants were dropped from the analyses. The imputed genotypes and Framingham analyses were available for almost every participant and, therefore, not imputed. Individuals without them were, instead, dropped from the analyses. Because genetic risk modeling poses entirely different methodological and computational challenges, we applied a published polygenic risk score for coronary heart disease using 6.6 million variants 73 . This was computed using the plink2 score option on the imputed genotypes available in the UKB.

UMAP embeddings were computed using default parameters on the full feature sets with ehrapy’s umap function. For all analyses, the same time-to-event and event-indicator columns were used. The event indicator is a Boolean variable indicating whether a myocardial infarction was observed for a study participant. The time to event is defined as the timespan between the start of the study, in this case the date of the first assessment center visit. Otherwise, it is the timespan from the start of the study to the start of censoring; in this case, this is set to the last date for which EHRs were available, unless a participant died, in which case the date of death is the start of censoring. Kaplan–Meier curves and Cox proportional hazards models were fit using ehrapy’s survival analysis module and the lifelines 124 package’s Cox-PHFitter function with default parameters. For Cox proportional hazards models with multiple feature sets, individually imputed and quality-controlled feature sets were concatenated, and the model was fit on the resulting matrix. Models were evaluated using the C-index 127 as a metric. It can be seen as an extension of the common area under the receiver operator characteristic score to time-to-event datasets, in which events are not observed for every sample and which ranges from 0.0 (entirely false) over 0.5 (random) to 1.0 (entirely correct). CIs for the C-index were computed based on bootstrapping by sampling 1,000 times with replacement from all computed partial hazards and computing the C-index over each of these samples. The percentiles at 2.5% and 97.5% then give the upper and lower confidence bound for the 95% CIs.

In all UKB analyses, the unit of study for a statistical test or predictive model is always an individual study participant.

The generalizability of the analysis is limited as the UK Biobank cohort may not represent the general population, with potential selection biases and underrepresentation of the different demographic groups. Additionally, by restricting analysis to initial assessment data and censoring based on the last available EHR or date of death, our analysis does not account for longitudinal changes and can introduce follow-up bias, especially if participants lost to follow-up have different risk profiles.

In-depth quality control of retina-derived features

A UMAP plot of the retina-derived features indicating the assessment centers shows a cluster of samples that lie somewhat outside the general population and mostly attended the Birmingham assessment center (Fig. 5b ). To further investigate this, we performed Leiden clustering of resolution 0.3 (Extended Data Fig. 9a ) and isolated this group in cluster 5. When comparing cluster 5 to the rest of the population in the retina-derived feature space, we noticed that many individuals in cluster 5 showed overall retinal pigment epithelium (RPE) thickness measures substantially elevated over the rest of the population in both eyes (Extended Data Fig. 9b ), which is mostly a feature of this cluster (Extended Data Fig. 9c ). To investigate potential confounding, we computed ratios between cluster 5 and the rest of the population over the ‘obs’ DataFrame containing the Framingham features, diabetes-related phecodes and genetic principal components. Out of the top and bottom five highest ratios observed, six are in genetic principal components, which are commonly used to represent genetic ancestry in a continuous space (Extended Data Fig. 9d ). Additionally, diagnoses for type 1 and type 2 diabetes and antihypertensive use are enriched in cluster 5. Further investigating the ancestry, we computed log ratios for self-reported ancestries and absolute counts, which showed no robust enrichment and depletion effects.

A closer look at three quality control measures of the imaging pipeline revealed that cluster 5 was an outlier in terms of either image quality (Extended Data Fig. 9e ) or minimum motion correlation (Extended Data Fig. 9f ) and the ILM indicator (Extended Data Fig. 9g ), all of which can be indicative of artifacts in image acquisition and downstream processing 128 . Subsequently, we excluded 301 individuals from cluster 5 from all analyses.

COVID-19 chest-x-ray fate determination

Dataset overview.

We used the public BrixIA COVID-19 dataset, which contains 192 chest x-ray images annotated with BrixIA scores 82 . Hereby, six regions were annotated by a senior radiologist with more than 20 years of experience and a junior radiologist with a disease severity score ranging from 0 to 3. A global score was determined as the sum of all of these regions and, therefore, ranges from 0 to 18 (S-Global). S-Global scores of 0 were classified as normal. Images that only had severity values up to 1 in all six regions were classified as mild. Images with severity values greater than or equal to 2, but a S-Global score of less than 7, were classified as moderate. All images that contained at least one 3 in any of the six regions with a S-Global score between 7 and 10 were classified as severe, and all remaining images with S-Global scores greater than 10 with at least one 3 were labeled critical. The dataset and instructions to download the images can be found at https://github.com/ieee8023/covid-chestxray-dataset .

We first resized all images to 224 × 224. Afterwards, the images underwent a random affine transformation that involved rotation, translation and scaling. The rotation angle was randomly selected from a range of −45° to 45°. The images were also subject to horizontal and vertical translation, with the maximum translation being 15% of the image size in either direction. Additionally, the images were scaled by a factor ranging from 0.85 to 1.15. The purpose of applying these transformations was to enhance the dataset and introduce variations, ultimately improving the robustness and generalization of the model.

To generate embeddings, we used a pre-trained DenseNet model with weights densenet121-res224-all of TorchXRayVision 129 . A DenseNet is a convolutional neural network that makes use of dense connections between layers (Dense Blocks) where all layers (with matching feature map sizes) directly connect with each other. To maintain a feed-forward nature, every layer in the DenseNet architecture receives supplementary inputs from all preceding layers and transmits its own feature maps to all subsequent layers. The model was trained on the nih-pc- chex-mimic_ch-google-openi-rsna dataset 130 .

Next, we calculated 50 principal components on the feature representation of the DenseNet model of all images using ehrapy’s pca function. The principal component representation served as input for a nearest neighbors graph calculation using ehrapy’s neighbors function. This graph served as the basis for the calculation of a UMAP embedding with three components that was finally visualized using ehrapy.

We randomly picked a root in the group of images that was labeled ‘Normal’. First, we calculated so-called pseudotime by fitting a trajectory through the calculated UMAP space using diffusion maps as implemented in ehrapy’s dpt function 57 . Each image’s pseudotime value represents its estimated position along this trajectory, serving as a proxy for its severity stage relative to others in the dataset. To determine fates, we employed CellRank 58 , 59 with the PseudotimeKernel . This kernel computes transition probabilities for patient visits based on the connectivity of the k -nearest neighbors graph and the pseudotime values of patient visits, which resembles their progression through a process. Directionality is infused in the nearest neighbors graph in this process where the kernel either removes or downweights edges in the graph that contradict the directional flow of increasing pseudotime, thereby refining the graph to better reflect the developmental trajectory. We computed the transition matrix with a soft threshold scheme (Parameter of the PseudotimeKernel ), which downweights edges that point against the direction of increasing pseudotime. Finally, we calculated a projection on top of the UMAP embedding with CellRank using the plot_projection function of the PseudotimeKernel that we subsequently plotted.

This analysis is limited by the small dataset of 192 chest x-ray images, which may affect the model’s generalizability and robustness. Annotation subjectivity from radiologists can further introduce variability in severity scores. Additionally, the random selection of a root from ‘Normal’ images can introduce bias in pseudotime calculations and subsequent analyses.

Diabetes 130-US hospitals analysis

We used data from the Diabetes 130-US hospitals dataset that were collected between 1999 and 2008. It contains clinical care information at 130 hospitals and integrated delivery networks. The extracted database information pertains to hospital admissions specifically for patients diagnosed with diabetes. These encounters required a hospital stay ranging from 1 d to 14 d, during which both laboratory tests and medications were administered. The selection criteria focused exclusively on inpatient encounters with these defined characteristics. More specifically, we used a version that was curated by the Fairlearn team where the target variable ‘readmitted’ was binarized and a few features renamed or binned ( https://fairlearn.org/main/user_guide/datasets/diabetes_hospital_data.html ). The dataset contains 101,877 patient visits and 25 features. The dataset predominantly consists of White patients (74.8%), followed by African Americans (18.9%), with other racial groups, such as Hispanic, Asian and Unknown categories, comprising smaller percentages. Females make up a slight majority in the data at 53.8%, with males accounting for 46.2% and a negligible number of entries listed as unknown or invalid. A substantial majority of the patients are over 60 years of age (67.4%), whereas those aged 30–60 years represent 30.2%, and those 30 years or younger constitute just 2.5%.

All of the following descriptions start by loading the Fairlearn version of the Diabetes 130-US hospitals dataset using ehrapy’s dataloader as an AnnData object.

Selection and filtering bias

An overview of sensitive variables was generated using tableone. Subsequently, ehrapy’s CohortTracker was used to track the age, gender and race variables. The cohort was filtered for all Medicare recipients and subsequently plotted.

Surveillance bias

We plotted the HbA1c measurement ratios using ehrapy’s catplot .

Missing data and imputation bias

MCAR-type missing data for the number of medications variable (‘num_medications‘) were introduced by randomly setting 30% of the variables to be missing using Numpy’s choice function. We tested that the data are MCAR by applying ehrapy’s implementation of Little’s MCAR test, which returned a non-significant P value of 0.71. MAR data for the number of medications variable (‘num_medications‘) were introduced by scaling the ‘time_in_hospital’ variable to have a mean of 0 and a standard deviation of 1, adjusting these values by multiplying by 1.2 and subtracting 0.6 to influence overall missingness rate, and then using these values to generate MAR data in the ‘num_medications’ variable via a logistic transformation and binomial sampling. We verified that the newly introduced missing values are not MCAR with respect to the ‘time_in_hospital’ variable by applying ehrapy’s implementation of Little’s test, which was significant (0.01 × 10 −2 ). The missing data were imputed using ehrapy’s mean imputation and MissForest implementation.

Algorithmic bias

Variables ‘race’, ‘gender’, ‘age’, ‘readmitted’, ‘readmit_binary’ and ‘discharge_disposition_id’ were moved to the ‘obs’ slot of the AnnData object to ensure that they were not used for model training. We built a binary label ‘readmit_30_days’ indicating whether a patient had been readmitted in fewer than 30 d. Next, we combined the ‘Asian’ and ‘Hispanic’ categories into a single ‘Other’ category within the ‘race’ column of our AnnData object and then filtered out and discarded any samples labeled as ‘Unknown/Invalid’ under the ‘gender‘ column and subsequently moved the ‘gender’ data to the variable matrix X of the AnnData object. All categorical variables got encoded. The data were split into train and test groups with a test size of 50%. The data were scaled, and a logistic regression model was trained using scikit-learn, which was also used to determine the balanced accuracy score. Fairlearn’s MetricFrame function was used to inspect the target model performance against the sensitive variable ‘race’. We subsequently fit Fairlearn’s ThresholdOptimizer using the logistic regression estimator with balanced_accuracy_score as the target object. The algorithmic demonstration of Fairlearn’s abilities on this dataset is shown here: https://github.com/fairlearn/talks/tree/main/2021_scipy_tutorial .

Normalization bias

We one-hot encoded all categorical variables with ehrapy using the encode function. We applied ehrapy’s implementation of scaling normalization with and without the ‘Age group’ variable as group key to scale the data jointly and separately using ehrapy’s scale_norm function.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Physionet provides access to the PIC database 43 at https://physionet.org/content/picdb/1.1.0 for credentialed users. The BrixIA images 82 are available at https://github.com/BrixIA/Brixia-score-COVID-19 . The data used in this study were obtained from the UK Biobank 44 ( https://www.ukbiobank.ac.uk/ ). Access to the UK Biobank resource was granted under application number 49966. The data are available to researchers upon application to the UK Biobank in accordance with their data access policies and procedures. The Diabetes 130-US Hospitals dataset is available at https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008 .

Code availability

The ehrapy source code is available at https://github.com/theislab/ehrapy under an Apache 2.0 license. Further documentation, tutorials and examples are available at https://ehrapy.readthedocs.io . We are actively developing the software and invite contributions from the community.

Jupyter notebooks to reproduce our analysis and figures, including Conda environments that specify all versions, are available at https://github.com/theislab/ehrapy-reproducibility .

Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 , E215–E220 (2000).

Article   CAS   PubMed   Google Scholar  

Atasoy, H., Greenwood, B. N. & McCullough, J. S. The digitization of patient care: a review of the effects of electronic health records on health care quality and utilization. Annu. Rev. Public Health 40 , 487–500 (2019).

Article   PubMed   Google Scholar  

Jamoom, E. W., Patel, V., Furukawa, M. F. & King, J. EHR adopters vs. non-adopters: impacts of, barriers to, and federal initiatives for EHR adoption. Health (Amst.) 2 , 33–39 (2014).

Google Scholar  

Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1 , 18 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Wolf, A. et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. Int. J. Epidemiol. 48 , 1740–1740g (2019).

Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12 , e1001779 (2015).

Pollard, T. J. et al. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 5 , 180178 (2018).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26 , 364–373 (2020).

Rasmy, L. et al. Recurrent neural network models (CovRNN) for predicting outcomes of patients with COVID-19 on admission to hospital: model development and validation using electronic health record data. Lancet Digit. Health 4 , e415–e425 (2022).

Marcus, J. L. et al. Use of electronic health record data and machine learning to identify candidates for HIV pre-exposure prophylaxis: a modelling study. Lancet HIV 6 , e688–e695 (2019).

Kruse, C. S., Stein, A., Thomas, H. & Kaur, H. The use of electronic health records to support population health: a systematic review of the literature. J. Med. Syst. 42 , 214 (2018).

Sheikh, A., Jha, A., Cresswell, K., Greaves, F. & Bates, D. W. Adoption of electronic health records in UK hospitals: lessons from the USA. Lancet 384 , 8–9 (2014).

Sheikh, A. et al. Health information technology and digital innovation for national learning health and care systems. Lancet Digit. Health 3 , e383–e396 (2021).

Cord, K. A. M., Mc Cord, K. A. & Hemkens, L. G. Using electronic health records for clinical trials: where do we stand and where can we go? Can. Med. Assoc. J. 191 , E128–E133 (2019).

Article   Google Scholar  

Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3 , 96 (2020).

Ayaz, M., Pasha, M. F., Alzahrani, M. Y., Budiarto, R. & Stiawan, D. The Fast Health Interoperability Resources (FHIR) standard: systematic literature review of implementations, applications, challenges and opportunities. JMIR Med. Inform. 9 , e21929 (2021).

Peskoe, S. B. et al. Adjusting for selection bias due to missing data in electronic health records-based research. Stat. Methods Med. Res. 30 , 2221–2238 (2021).

Haneuse, S. & Daniels, M. A general framework for considering selection bias in EHR-based studies: what data are observed and why? EGEMS (Wash. DC) 4 , 1203 (2016).

PubMed   Google Scholar  

Gallifant, J. et al. Disparity dashboards: an evaluation of the literature and framework for health equity improvement. Lancet Digit. Health 5 , e831–e839 (2023).

Sauer, C. M. et al. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit. Health 4 , e893–e898 (2022).

Li, J. et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit. Med. 4 , 147 (2021).

Rubin, D. B. Inference and missing data. Biometrika 63 , 581 (1976).

Scheid, L. M., Brown, L. S., Clark, C. & Rosenfeld, C. R. Data electronically extracted from the electronic health record require validation. J. Perinatol. 39 , 468–474 (2019).

Phelan, M., Bhavsar, N. A. & Goldstein, B. A. Illustrating informed presence bias in electronic health records data: how patient interactions with a health system can impact inference. EGEMS (Wash. DC). 5 , 22 (2017).

PubMed   PubMed Central   Google Scholar  

Secondary Analysis of Electronic Health Records (ed MIT Critical Data) (Springer, 2016).

Jetley, G. & Zhang, H. Electronic health records in IS research: quality issues, essential thresholds and remedial actions. Decis. Support Syst. 126 , 113137 (2019).

McCormack, J. P. & Holmes, D. T. Your results may vary: the imprecision of medical measurements. BMJ 368 , m149 (2020).

Hobbs, F. D. et al. Is the international normalised ratio (INR) reliable? A trial of comparative measurements in hospital laboratory and primary care settings. J. Clin. Pathol. 52 , 494–497 (1999).

Huguet, N. et al. Using electronic health records in longitudinal studies: estimating patient attrition. Med. Care 58 Suppl 6 Suppl 1 , S46–S52 (2020).

Zeng, J., Gensheimer, M. F., Rubin, D. L., Athey, S. & Shachter, R. D. Uncovering interpretable potential confounders in electronic medical records. Nat. Commun. 13 , 1014 (2022).

Getzen, E., Ungar, L., Mowery, D., Jiang, X. & Long, Q. Mining for equitable health: assessing the impact of missing data in electronic health records. J. Biomed. Inform. 139 , 104269 (2023).

Tang, S. et al. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J. Am. Med. Inform. Assoc. 27 , 1921–1934 (2020).

Dagliati, A. et al. A process mining pipeline to characterize COVID-19 patients’ trajectories and identify relevant temporal phenotypes from EHR data. Front. Public Health 10 , 815674 (2022).

Sun, Y. & Zhou, Y.-H. A machine learning pipeline for mortality prediction in the ICU. Int. J. Digit. Health 2 , 3 (2022).

Article   CAS   Google Scholar  

Mandyam, A., Yoo, E. C., Soules, J., Laudanski, K. & Engelhardt, B. E. COP-E-CAT: cleaning and organization pipeline for EHR computational and analytic tasks. In Proc. of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3459930.3469536 (Association for Computing Machinery, 2021).

Gao, C. A. et al. A machine learning approach identifies unresolving secondary pneumonia as a contributor to mortality in patients with severe pneumonia, including COVID-19. J. Clin. Invest. 133 , e170682 (2023).

Makam, A. N. et al. The good, the bad and the early adopters: providers’ attitudes about a common, commercial EHR. J. Eval. Clin. Pract. 20 , 36–42 (2014).

Amezquita, R. A. et al. Orchestrating single-cell analysis with Bioconductor. Nat. Methods 17 , 137–145 (2020).

Virshup, I. et al. The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 41 , 604–606 (2023).

Zou, Q. et al. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 9 , 515 (2018).

Cios, K. J. & William Moore, G. Uniqueness of medical data mining. Artif. Intell. Med. 26 , 1–24 (2002).

Zeng, X. et al. PIC, a paediatric-specific intensive care database. Sci. Data 7 , 14 (2020).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Lee, J. et al. Open-access MIMIC-II database for intensive care research. Annu. Int. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2011 , 8315–8318 (2011).

Virshup, I., Rybakov, S., Theis, F. J., Angerer, P. & Alexander Wolf, F. anndata: annotated data. Preprint at bioRxiv https://doi.org/10.1101/2021.12.16.473007 (2021).

Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 22 , 553–564 (2015).

Vasilevsky, N. A. et al. Mondo: unifying diseases for the world, by the world. Preprint at medRxiv https://doi.org/10.1101/2022.04.13.22273750 (2022).

Harrison, J. E., Weber, S., Jakob, R. & Chute, C. G. ICD-11: an international classification of diseases for the twenty-first century. BMC Med. Inform. Decis. Mak. 21 , 206 (2021).

Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47 , D1018–D1027 (2019).

Wu, P. et al. Mapping ICD-10 and ICD-10-CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7 , e14325 (2019).

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19 , 15 (2018).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res . 12 , 2825–2830 (2011).

de Haan-Rietdijk, S., de Haan-Rietdijk, S., Kuppens, P. & Hamaker, E. L. What’s in a day? A guide to decomposing the variance in intensive longitudinal data. Front. Psychol. 7 , 891 (2016).

Pedersen, E. S. L., Danquah, I. H., Petersen, C. B. & Tolstrup, J. S. Intra-individual variability in day-to-day and month-to-month measurements of physical activity and sedentary behaviour at work and in leisure-time among Danish adults. BMC Public Health 16 , 1222 (2016).

Roffey, D. M., Byrne, N. M. & Hills, A. P. Day-to-day variance in measurement of resting metabolic rate using ventilated-hood and mouthpiece & nose-clip indirect calorimetry systems. JPEN J. Parenter. Enter. Nutr. 30 , 426–432 (2006).

Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs lineage branching. Nat. Methods 13 , 845–848 (2016).

Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Methods 19 , 159–170 (2022).

Weiler, P., Lange, M., Klein, M., Pe'er, D. & Theis, F. CellRank 2: unified fate mapping in multiview single-cell data. Nat. Methods 21 , 1196–1205 (2024).

Zhang, S. et al. Cost of management of severe pneumonia in young children: systematic analysis. J. Glob. Health 6 , 010408 (2016).

Torres, A. et al. Pneumonia. Nat. Rev. Dis. Prim. 7 , 25 (2021).

Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9 , 5233 (2019).

Kamin, W. et al. Liver involvement in acute respiratory infections in children and adolescents—results of a non-interventional study. Front. Pediatr. 10 , 840008 (2022).

Shi, T. et al. Risk factors for mortality from severe community-acquired pneumonia in hospitalized children transferred to the pediatric intensive care unit. Pediatr. Neonatol. 61 , 577–583 (2020).

Dudnyk, V. & Pasik, V. Liver dysfunction in children with community-acquired pneumonia: the role of infectious and inflammatory markers. J. Educ. Health Sport 11 , 169–181 (2021).

Charpignon, M.-L. et al. Causal inference in medical records and complementary systems pharmacology for metformin drug repurposing towards dementia. Nat. Commun. 13 , 7652 (2022).

Grief, S. N. & Loza, J. K. Guidelines for the evaluation and treatment of pneumonia. Prim. Care 45 , 485–503 (2018).

Paul, M. Corticosteroids for pneumonia. Cochrane Database Syst. Rev. 12 , CD007720 (2017).

Sharma, A. & Kiciman, E. DoWhy: an end-to-end library for causal inference. Preprint at arXiv https://doi.org/10.48550/ARXIV.2011.04216 (2020).

Khilnani, G. C. et al. Guidelines for antibiotic prescription in intensive care unit. Indian J. Crit. Care Med. 23 , S1–S63 (2019).

Harris, L. K. & Crannage, A. J. Corticosteroids in community-acquired pneumonia: a review of current literature. J. Pharm. Technol. 37 , 152–160 (2021).

Dou, L. et al. Decreased hospital length of stay with early administration of oseltamivir in patients hospitalized with influenza. Mayo Clin. Proc. Innov. Qual. Outcomes 4 , 176–182 (2020).

Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50 , 1219–1224 (2018).

Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14 , 604 (2023).

Ko, F. et al. Associations with retinal pigment epithelium thickness measures in a large cohort: results from the UK Biobank. Ophthalmology 124 , 105–117 (2017).

Patel, P. J. et al. Spectral-domain optical coherence tomography imaging in 67 321 adults: associations with macular thickness in the UK Biobank study. Ophthalmology 123 , 829–840 (2016).

D’Agostino Sr, R. B. et al. General cardiovascular risk profile for use in primary care: the Framingham Heart Study. Circulation 117 , 743–753 (2008).

Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28 , 2309–2320 (2022).

Xu, Y. et al. An atlas of genetic scores to predict multi-omic traits. Nature 616 , 123–131 (2023).

Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37 , 547–554 (2019).

Rousan, L. A., Elobeid, E., Karrar, M. & Khader, Y. Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia. BMC Pulm. Med. 20 , 245 (2020).

Signoroni, A. et al. BS-Net: learning COVID-19 pneumonia severity on a large chest X-ray dataset. Med. Image Anal. 71 , 102046 (2021).

Bird, S. et al. Fairlearn: a toolkit for assessing and improving fairness in AI. https://www.microsoft.com/en-us/research/publication/fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ (2020).

Strack, B. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed. Res. Int. 2014 , 781670 (2014).

Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28 , 112–118 (2012).

Banerjee, A. et al. Identifying subtypes of heart failure from three electronic health record sources with machine learning: an external, prognostic, and genetic validation study. Lancet Digit. Health 5 , e370–e379 (2023).

Nagamine, T. et al. Data-driven identification of heart failure disease states and progression pathways using electronic health records. Sci. Rep. 12 , 17871 (2022).

Da Silva Filho, J. et al. Disease trajectories in hospitalized COVID-19 patients are predicted by clinical and peripheral blood signatures representing distinct lung pathologies. Preprint at bioRxiv https://doi.org/10.1101/2023.09.08.23295024 (2023).

Haneuse, S., Arterburn, D. & Daniels, M. J. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Netw. Open 4 , e210184 (2021).

Little, R. J. A. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 83 , 1198–1202 (1988).

Jakobsen, J. C., Gluud, C., Wetterslev, J. & Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—a practical guide with flowcharts. BMC Med. Res. Methodol. 17 , 162 (2017).

Dziura, J. D., Post, L. A., Zhao, Q., Fu, Z. & Peduzzi, P. Strategies for dealing with missing data in clinical trials: from design to analysis. Yale J. Biol. Med. 86 , 343–358 (2013).

White, I. R., Royston, P. & Wood, A. M. Multiple imputation using chained equations: issues and guidance for practice. Stat. Med. 30 , 377–399 (2011).

Jäger, S., Allhorn, A. & Bießmann, F. A benchmark for data imputation methods. Front. Big Data 4 , 693674 (2021).

Waljee, A. K. et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 , e002847 (2013).

Ibrahim, J. G. & Molenberghs, G. Missing data methods in longitudinal studies: a review. Test (Madr.) 18 , 1–43 (2009).

Li, C., Alsheikh, A. M., Robinson, K. A. & Lehmann, H. P. Use of recommended real-world methods for electronic health record data analysis has not improved over 10 years. Preprint at bioRxiv https://doi.org/10.1101/2023.06.21.23291706 (2023).

Regev, A. et al. The Human Cell Atlas. eLife 6 , e27041 (2017).

Megill, C. et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv https://doi.org/10.1101/2021.04.05.438318 (2021).

Speir, M. L. et al. UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37 , 4578–4580 (2021).

Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9 , 90–95 (2007).

Waskom, M. seaborn: statistical data visualization. J. Open Source Softw. 6 , 3021 (2021).

Harris, C. R. et al. Array programming with NumPy. Nature 585 , 357–362 (2020).

Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVM-based Python JIT compiler. In Proc. of the Second Workshop on the LLVM Compiler Infrastructure in HPC. https://doi.org/10.1145/2833157.2833162 (Association for Computing Machinery, 2015).

Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272 (2020).

McKinney, W. Data structures for statistical computing in Python. In Proc. of the 9th Python in Science Conference (eds van der Walt, S. & Millman, J.). https://doi.org/10.25080/majora-92bf1922-00a (SciPy, 2010).

Boulanger, A. Open-source versus proprietary software: is one more reliable and secure than the other? IBM Syst. J. 44 , 239–248 (2005).

Rocklin, M. Dask: parallel computation with blocked algorithms and task scheduling. In Proc. of the 14th Python in Science Conference. https://doi.org/10.25080/majora-7b98e3ed-013 (SciPy, 2015).

Pivarski, J. et al. Awkward Array. https://doi.org/10.5281/ZENODO.4341376

Collette, A. Python and HDF5: Unlocking Scientific Data (‘O’Reilly Media, Inc., 2013).

Miles, A. et al. zarr-developers/zarr-python: v2.13.6. https://doi.org/10.5281/zenodo.7541518 (2023).

The pandas development team. pandas-dev/pandas: Pandas. https://doi.org/10.5281/ZENODO.3509134 (2024).

Weberpals, J. et al. Deep learning-based propensity scores for confounding control in comparative effectiveness research: a large-scale, real-world data study. Epidemiology 32 , 378–388 (2021).

Rosenthal, J. et al. Building tools for machine learning and artificial intelligence in cancer research: best practices and a case study with the PathML toolkit for computational pathology. Mol. Cancer Res. 20 , 202–206 (2022).

Gayoso, A. et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 40 , 163–166 (2022).

Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.). 8024–8035 (Curran Associates, 2019).

Frostig, R., Johnson, M. & Leary, C. Compiling machine learning programs via high-level tracing. https://cs.stanford.edu/~rfrostig/pubs/jax-mlsys2018.pdf (2018).

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616 , 259–265 (2023).

Kraljevic, Z. et al. Multi-domain clinical natural language processing with MedCAT: the Medical Concept Annotation Toolkit. Artif. Intell. Med. 117 , 102083 (2021).

Pollard, T. J., Johnson, A. E. W., Raffa, J. D. & Mark, R. G. An open source Python package for producing summary statistics for research papers. JAMIA Open 1 , 26–31 (2018).

Ellen, J. G. et al. Participant flow diagrams for health equity in AI. J. Biomed. Inform. 152 , 104631 (2024).

Schouten, R. M. & Vink, G. The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50 , 1243–1258 (2021).

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 , 118–127 (2007).

Davidson-Pilon, C. lifelines: survival analysis in Python. J. Open Source Softw. 4 , 1317 (2019).

Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 , 289–300 (1995).

Wishart, D. S. et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 34 , D668–D672 (2006).

Harrell, F. E. Jr, Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247 , 2543–2546 (1982).

Currant, H. et al. Genetic variation affects morphological retinal phenotypes extracted from UK Biobank optical coherence tomography images. PLoS Genet. 17 , e1009497 (2021).

Cohen, J. P. et al. TorchXRayVision: a library of chest X-ray datasets and models. In Proc. of the 5th International Conference on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.). 172 , 231–249 (PMLR, 2022).

Cohen, J.P., Hashir, M., Brooks, R. & Bertrand, H. On the limits of cross-domain generalization in automated X-ray prediction. In Proceedings of Machine Learning Research , Vol. 121 (eds Arbel, T. et al.) 136–155 (PMLR, 2020).

Download references

Acknowledgements

We thank M. Ansari who designed the ehrapy logo. The authors thank F. A. Wolf, M. Lücken, J. Steinfeldt, B. Wild, G. Rätsch and D. Shung for feedback on the project. We further thank L. Halle, Y. Ji, M. Lücken and R. K. Rubens for constructive comments on the paper. We thank F. Hashemi for her help in implementing the survival analysis module. This research was conducted using data from the UK Biobank, a major biomedical database ( https://www.ukbiobank.ac.uk ), under application number 49966. This work was supported by the German Center for Lung Research (DZL), the Helmholtz Association and the CRC/TRR 359 Perinatal Development of Immune Cell Topology (PILOT). N.H. and F.J.T. acknowledge support from the German Federal Ministry of Education and Research (BMBF) (LODE, 031L0210A), co-funded by the European Union (ERC, DeepCell, 101054957). A.N. is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD program Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. This work was also supported by the Chan Zuckerberg Initiative (CZIF2022-007488; Human Cell Atlas Data Ecosystem).

Open access funding provided by Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH).

Author information

Authors and affiliations.

Institute of Computational Biology, Helmholtz Munich, Munich, Germany

Lukas Heumos, Philipp Ehmele, Tim Treis, Eljas Roellin, Lilly May, Altana Namsaraeva, Nastassya Horlava, Vladimir A. Shitov, Xinyue Zhang, Luke Zappia, Leon Hetzel, Isaac Virshup, Lisa Sikkema, Fabiola Curion & Fabian J. Theis

Institute of Lung Health and Immunity and Comprehensive Pneumology Center with the CPC-M bioArchive; Helmholtz Zentrum Munich; member of the German Center for Lung Research (DZL), Munich, Germany

Lukas Heumos, Niklas J. Lang, Herbert B. Schiller & Anne Hilgendorff

TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany

Lukas Heumos, Tim Treis, Nastassya Horlava, Vladimir A. Shitov, Lisa Sikkema & Fabian J. Theis

Health Data Science Unit, Heidelberg University and BioQuant, Heidelberg, Germany

Julius Upmeier zu Belzen & Roland Eils

Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany

Eljas Roellin, Lilly May, Luke Zappia, Leon Hetzel, Fabiola Curion & Fabian J. Theis

Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA), Darmstadt, Germany

Altana Namsaraeva

Systems Medicine, Deutsches Zentrum für Neurodegenerative Erkrankungen (DZNE), Bonn, Germany

Rainer Knoll

Center for Digital Health, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Berlin, Germany

Roland Eils

Research Unit, Precision Regenerative Medicine (PRM), Helmholtz Munich, Munich, Germany

Herbert B. Schiller

Center for Comprehensive Developmental Care (CDeCLMU) at the Social Pediatric Center, Dr. von Hauner Children’s Hospital, LMU Hospital, Ludwig Maximilian University, Munich, Germany

Anne Hilgendorff

You can also search for this author in PubMed   Google Scholar

Contributions

L. Heumos and F.J.T. conceived the study. L. Heumos, P.E., X.Z., E.R., L.M., A.N., L.Z., V.S., T.T., L. Hetzel, N.H., R.K. and I.V. implemented ehrapy. L. Heumos, P.E., N.L., L.S., T.T. and A.H. analyzed the PIC database. J.U.z.B. and L. Heumos analyzed the UK Biobank database. X.Z. and L. Heumos analyzed the COVID-19 chest x-ray dataset. L. Heumos, P.E. and J.U.z.B. wrote the paper. F.J.T., A.H., H.B.S. and R.E. supervised the work. All authors read, corrected and approved the final paper.

Corresponding author

Correspondence to Fabian J. Theis .

Ethics declarations

Competing interests.

L. Heumos is an employee of LaminLabs. F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd. and Omniscope Ltd. and has ownership interest in Dermagnostix GmbH and Cellarity. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Medicine thanks Leo Anthony Celi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary handling editor: Lorenzo Righetto, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended data fig. 1 overview of the paediatric intensive care database (pic)..

The database consists of several tables corresponding to several data modalities and measurement types. All tables colored in green were selected for analysis and all tables in blue were discarded based on coverage rate. Despite the high coverage rate, we discarded the ‘OR_EXAM_REPORTS’ table because of the lack of detail in the exam reports.

Extended Data Fig. 2 Preprocessing of the Paediatric Intensive Care (PIC) dataset with ehrapy.

( a ) Heterogeneous data of the PIC database was stored in ‘data’ (matrix that is used for computations) and ‘observations’ (metadata per patient visit). During quality control, further annotations are added to the ‘variables’ (metadata per feature) slot. ( b ) Preprocessing steps of the PIC dataset. ( c ) Example of the function calls in the data analysis pipeline that resembles the preprocessing steps in (B) using ehrapy.

Extended Data Fig. 3 Missing data distribution for the ‘youths’ group of the PIC dataset.

The x-axis represents the percentage of missing values in each feature. The y-axis reflects the number of features in each bin with text labels representing the names of the individual features.

Extended Data Fig. 4 Patient selection during analysis of the PIC dataset.

Filtering for the pneumonia cohort of the youths filters out care units except for the general intensive care unit and the pediatric intensive care unit.

Extended Data Fig. 5 Feature rankings of stratified patient groups.

Scores reflect the z-score underlying the p-value per measurement for each group. Higher scores (above 0) reflect overrepresentation of the measurement compared to all other groups and vice versa. ( a ) By clinical chemistry. ( b ) By liver markers. ( c ) By medication type. ( d ) By infection markers.

Extended Data Fig. 6 Liver marker value progression for the ‘youths’ group and Kaplan-Meier curves.

( a ) Viral and severe pneumonia with co-infection groups display enriched gamma-glutamyl transferase levels in blood serum. ( b ) Aspartate transferase (AST) and Alanine transaminase (ALT) levels are enriched for severe pneumonia with co-infection during early ICU stay. ( c ) and ( d ) Kaplan-Meier curves for ALT and AST demonstrate lower survivability for children with measurements outside the norm.

Extended Data Fig. 7 Overview of medication categories used for causal inference.

( a ) Feature engineering process to group administered medications into medication categories using drugbank. ( b ) Number of medications per medication category. ( c ) Number of patients that received (dark blue) and did not receive specific medication categories (light blue).

Extended Data Fig. 8 UK-Biobank data overview and quality control across modalities.

( a ) UMAP plot of the metabolomics data demonstrating a clear gradient with respect to age at sampling, and ( b ) type 2 diabetes prevalence. ( c ) Analogously, the features derived from retinal imaging show a less pronounced age gradient, and ( d ) type 2 diabetes prevalence gradient. ( e ) Stratifying myocardial infarction risk by the type 2 diabetes comorbidity confirms vastly increased risk with a prior type 2 (T2D) diabetes diagnosis. Kaplan-Meier estimators with 95 % confidence intervals are shown. ( f ) Similarly, the polygenic risk score for coronary heart disease used in this work substantially enriches myocardial infarction risk in its top 5% percentile. Kaplan-Meier estimators with 95 % confidence intervals are shown. ( g ) UMAP visualization of the metabolomics features colored by the assessment center shows no discernable biases. (A-G) n = 29,216.

Extended Data Fig. 9 UK-Biobank retina derived feature quality control.

( a ) Leiden Clustering of retina derived feature space. ( b ) Comparison of ‘overall retinal pigment epithelium (RPE) thickness’ values between cluster 5 (n = 301) and the rest of the population (n = 28,915). ( c ) RPE thickness in the right eye outliers on the UMAP largely corresponds to cluster 5. ( d ) Log ratio of top and bottom 5 fields in obs dataframe between cluster 5 and the rest of the population. ( e ) Image Quality of the optical coherence tomography scan as reported in the UKB. ( f ) Minimum motion correlation quality control indicator. ( g ) Inner limiting membrane (ILM) quality control indicator. (D-G) Data are shown for the right eye only, comparable results for the left eye are omitted. (A-G) n = 29,216.

Extended Data Fig. 10 Bias detection and mitigation study on the Diabetes 130-US hospitals dataset (n = 101,766 hospital visits, one patient can have multiple visits).

( a ) Filtering to the visits of Medicare recipients results in an increase of Caucasians. ( b ) Proportion of visits where Hb1Ac measurements are recorded, stratified by admission type. Adjusted P values were calculated with Chi squared tests and Bonferroni correction (Adjusted P values: Emergency vs Referral 3.3E-131, Emergency vs Other 1.4E-101, Referral vs Other 1.6E-4.) ( c ) Normalizing feature distributions jointly vs. separately can mask distribution differences. ( d ) Imputing the number of medications for visits. Onto the complete data (blue), MCAR (30% missing data) and MAR (38% missing data) were introduced (orange), with the MAR mechanism depending on the time in hospital. Mean imputation (green) can reduce the variance of the distribution under MCAR and MAR mechanisms, and bias the center of the distribution under an MAR mechanism. Multiple imputation, such as MissForest imputation can impute meaningfully even in MAR cases, when having access to variables involved in the MAR mechanism. Each boxplot represents the IQR of the data, with the horizontal line inside the box indicating the median value. The left and right bounds of the box represent the first and third quartiles, respectively. The ‘whiskers’ extend to the minimum and maximum values within 1.5 times the IQR from the lower and upper quartiles, respectively. ( e ) Predicting the early readmission within 30 days after release on a per-stay level. Balanced accuracy can mask differences in selection and false negative rate between sensitive groups.

Supplementary information

Supplementary tables 1 and 2, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Heumos, L., Ehmele, P., Treis, T. et al. An open-source framework for end-to-end analysis of electronic health record data. Nat Med (2024). https://doi.org/10.1038/s41591-024-03214-0

Download citation

Received : 11 December 2023

Accepted : 25 July 2024

Published : 12 September 2024

DOI : https://doi.org/10.1038/s41591-024-03214-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

analytical framework in research example

IMAGES

  1. Example of analytical framework.

    analytical framework in research example

  2. Figure. Analytical Framework. Analytical Framework: Definitions Idea

    analytical framework in research example

  3. Analytical Framework.

    analytical framework in research example

  4. Analytical Framework In Qualitative Research

    analytical framework in research example

  5. An illustration of the analytical framework.

    analytical framework in research example

  6. 1: Analytical framework of the research

    analytical framework in research example

VIDEO

  1. What is a Theoretical Framework really? simple explanation

  2. Research Alignment

  3. Research Methodology Logic Part 6

  4. OUtCoMES: A New Framework to Guide Analytical Evidence-based Projects

  5. Introduction to the theoretical/ analytical framework

  6. Analytical framework for mapping the professional education ecosystem for out-of-field teachers

COMMENTS

  1. Writing theoretical frameworks, analytical frameworks and conceptual

    An analytical framework is, the way I see it, a model that helps explain how a certain type of analysis will be conducted. For example, in this paper, Franks and Cleaver develop an analytical framework that includes scholarship on poverty measurement to help us understand how water governance and poverty are interrelated .

  2. Framework Analysis

    Framework Analysis. Definition: Framework Analysis is a qualitative research method that involves organizing and analyzing data using a predefined analytical framework. The analytical framework is a set of predetermined themes or categories that are derived from the research questions or objectives. The framework provides a structured approach ...

  3. What is a Theoretical Framework? How to Write It (with Examples)

    A theoretical framework guides the research process like a roadmap for the study, so you need to get this right. Theoretical framework 1,2 is the structure that supports and describes a theory. A theory is a set of interrelated concepts and definitions that present a systematic view of phenomena by describing the relationship among the variables for explaining these phenomena.

  4. Beginner's Guide to Analytical Frameworks

    Real-world examples of analytical framework application. Tips for choosing the right framework for your needs . ... This could range from internal performance metrics to external market research. Apply the Framework: Methodically fill in the sections of the chosen framework with the collected data. This process often reveals insights that were ...

  5. Analytical Research: What is it, Importance + Examples

    For example, it can look into why the value of the Japanese Yen has decreased. This is so that an analytical study can consider "how" and "why" questions. Another example is that someone might conduct analytical research to identify a study's gap. It presents a fresh perspective on your data.

  6. Analytical Approach and Framework

    An analytical framework is a structure that helps us make sense of data in an organized way. ... There are a few qualitative research analytical frameworks we could use depending on the context of the business environment. ... Let's look go through each of those framework steps with a business example of an online media company that wants to ...

  7. Theoretical Framework

    Theoretical Framework. Definition: Theoretical framework refers to a set of concepts, theories, ideas, and assumptions that serve as a foundation for understanding a particular phenomenon or problem. It provides a conceptual framework that helps researchers to design and conduct their research, as well as to analyze and interpret their findings.

  8. PDF Chapter 3 Analytical Framework and Research Methodology 3.1

    In Chapter 1 (par. 1.2), it was stated that the general aim of this study is to revise and improve the African Language Translation Facilitation Course (ALTFC) presented at the Directorate Language Services (D Lang). More specifically, the aim of this study was divided into two main and two secondary aims.

  9. Theoretical Framework Example for a Thesis or Dissertation

    Theoretical Framework Example for a Thesis or Dissertation. Published on October 14, 2015 by Sarah Vinz. Revised on July 18, 2023 by Tegan George. Your theoretical framework defines the key concepts in your research, suggests relationships between them, and discusses relevant theories based on your literature review.

  10. Creating tables and diagrams to describe theoretical, conceptual, and

    Doctoral supervisors (and often, editors!) will ask you to create a conceptual, theoretical and/or analytical framework for your book, dissertation, chapter, or journal article. This is a good idea. I used to get confused by all the "framework"-associated terms, so I wrote. THIS blog post:

  11. Using the framework method for the analysis of qualitative data in

    The Framework Method is becoming an increasingly popular approach to the management and analysis of qualitative data in health research. However, there is confusion about its potential application and limitations. The article discusses when it is appropriate to adopt the Framework Method and explains the procedure for using it in multi-disciplinary health research teams, or those that involve ...

  12. Using Framework Analysis in Applied Qualitative Research

    through participating in framework analysis research led by an experienced qualitative researcher (Gale et al., 2013) and through exposure to detailed examples of research using framework analysis. This paper is an example of the latter form of support.

  13. (PDF) Using the framework approach to analyse qualitative data: a

    Framework analysis is an approach to qualitative research that is increasingly used across multiple disciplines, including psychology, social policy, and nursing research. The stages of framework ...

  14. Full article: Developing an analytical framework for multiple

    Example from our own empirical research. ... To exemplify the use of the analytical framework suggested, we focus on respondents' perception and anticipation of task share in childcare before first-time parenthood, and any anticipation of changes over time as the child grows up. For illustrative purposes, we selected two couples with similar ...

  15. What Is a Theoretical Framework?

    A theoretical framework is a foundational review of existing theories that serves as a roadmap for developing the arguments you will use in your own work. Theories are developed by researchers to explain phenomena, draw connections, and make predictions. In a theoretical framework, you explain the existing theories that support your research ...

  16. Framework Analysis: Methods and Use Cases

    Thematic framework: Central to framework analysis is the development of a framework identifying key themes, concepts, and relationships in the data. The framework guides the subsequent stages of coding and charting. Flexibility: While it provides a clear structure, framework analysis is also adaptable. Depending on the objectives of the study ...

  17. Developing an analytical framework for your literature review

    So you have read tons of papers for your lit review. Now what? Watch this video to learn how to structure this information effectively. It looks at natural a...

  18. Using framework analysis methods for qualitative research: AMEE Guide

    A defining feature of FAMs is the development and application of a matrix-based analytical framework. These methods can be used across research paradigms and are thus particularly useful tools in the health professions education (HPE) researcher's toolbox. Despite their utility, FAMs are not frequently used in HPE research.

  19. Using Framework Analysis in nursing research: a worked example

    Framework Analysis is flexible, systematic, and rigorous, offering clarity, transparency, an audit trail, an option for theme-based and case-based analysis and for readily retrievable data. This paper offers further explanation of the process undertaken which is illustrated with a worked example. Data source and research design: Data were ...

  20. A Step-by-Step Process of Thematic Analysis to Develop a Conceptual

    Thematic analysis is a research method used to identify and interpret patterns or themes in a data set; it often leads to new insights and understanding (Boyatzis, 1998; Elliott, 2018; Thomas, 2006).However, it is critical that researchers avoid letting their own preconceptions interfere with the identification of key themes (Morse & Mitcham, 2002; Patton, 2015).

  21. Analytical studies: a framework for quality improvement design and

    An analytical study is one in which action will be taken on a cause system to improve the future performance of the system of interest. The aim of an enumerative study is estimation, while an analytical study focuses on prediction. Because of the temporal nature of improvement, the theory and methods for analytical studies are a critical ...

  22. Using the framework method for the analysis of qualitative data in

    The Framework Method has been developed and used successfully in research for over 25 years, and has recently become a popular analysis method in qualitative health research. The issue of how to assess quality in qualitative research has been highly debated [ 20 , 34 - 40 ], but ensuring rigour and transparency in analysis is a vital component.

  23. How to Do Thematic Analysis

    When to use thematic analysis. Thematic analysis is a good approach to research where you're trying to find out something about people's views, opinions, knowledge, experiences or values from a set of qualitative data - for example, interview transcripts, social media profiles, or survey responses. Some types of research questions you might use thematic analysis to answer:

  24. An open-source framework for end-to-end analysis of electronic ...

    Here we introduce ehrapy, a modular open-source Python framework designed for exploratory analysis of heterogeneous epidemiology and EHR data. ehrapy incorporates a series of analytical steps ...