top of page
Writer's pictureSquareShift Engineering Team

Evaluating Retrieval-Augmented Generation (RAG) Systems in Generative AI

Retrieval-Augmented Generation (RAG) systems represent a significant advancement in natural language processing, combining the strengths of information retrieval and text generation. However, evaluating these systems effectively is crucial to ensure they deliver accurate, relevant, and high-quality responses. In this article, we will explore various techniques for evaluating RAG systems, illustrated with anecdotal examples to highlight their practical applications.

image

Understanding RAG Systems


Before diving into evaluation techniques, it’s essential to understand what RAG systems do. These systems retrieve relevant documents from a knowledge base and then generate responses based on the retrieved content. The challenge lies in ensuring that both the retrieval and generation components work seamlessly together.


1. TRIAD Framework


The TRIAD framework provides a structured approach to evaluate RAG systems by focusing on three critical components: Context Relevance, Faithfulness (Groundedness), and Answer Relevance.


Context Relevance


Example: Imagine a RAG system designed to assist users with medical inquiries. When a user asks about the side effects of a specific medication, the system retrieves several documents from its database. Context relevance evaluates how many of these documents are actually pertinent to the user's query.

In practice, if the system retrieves five documents but only two are directly related to the side effects of the medication, the precision would be low. Evaluators can use metrics like Precision and Recall to quantify this aspect. For instance, if out of 100 retrieved documents, 70 were relevant, the precision is 0.7 (or 70%).


Another significant reason why controlling these metrics can be extremely crucial in the context of LLMs, is that you want to avoid a situation where the context window (the input token count capacity of the model) gets unnecessarily filled up with irrelevant extra documents preventing room for any more tokens.


Faithfulness (Groundedness)


Example: Consider a scenario where a user queries about climate change impacts on polar bears. The RAG system retrieves several articles discussing climate change but generates a response that inaccurately states that polar bears are thriving due to increased tourism in their habitats.


Faithfulness evaluation checks whether the generated response is grounded in the retrieved documents. This can be assessed through human evaluations or automated fact-checking tools. If evaluators find that the response aligns with the content of the retrieved articles, it scores high on groundedness.


Answer Relevance


Example: A user might ask, “What are the benefits of meditation?” If the RAG system generates a response that includes irrelevant information about yoga instead of focusing on meditation's benefits, it fails in answer relevance.


While human evaluation is one method for assessing answer relevance, there are automated alternatives that can reduce manual effort. For instance, embedding-based similarity measures can compare generated responses with reference answers using cosine similarity metrics. This method allows for a more scalable evaluation process by leveraging vector representations of text.


2. eRAG Evaluation Approach


This method revolves around evaluating the retriever model in a RAG system.

The eRAG method emphasizes evaluating individual documents retrieved by the RAG system for a specific query during the generation process.


Example: Suppose a user queries about recent advancements in renewable energy technology. The RAG system retrieves ten articles but generates an answer based on only three of them. The eRAG approach would involve assessing how well each document contributes to generating an accurate response.


By using an LLM to generate outputs for each document independently and then evaluating these outputs based on downstream task metrics (like accuracy), developers can identify which documents are most useful for generating high-quality responses. This method has been shown to correlate better with actual performance than traditional end-to-end evaluations.


Example: 


Take the query: "Who discovered penicillin?"

The retrieval model returns a ranked list of documents:

  1. Document 1: "Alexander Fleming discovered penicillin in 1928."

  2. Document 2: "Penicillin was a breakthrough antibiotic..."

  3. Document 3: "Marie Curie won the Nobel Prize..."


The way a traditional evaluation method would work for the retriever model, would be to use cosine similarity or an LLM to capture the performance of the query-document link capability. However, since this seems to give a poor correlation to the actual performance in real-world cases, eRag proposes the approach of first evaluating the downstream performance of the entire RAG system against each of the retrieved documents from the retriever. This way, we can arrive at a real-world performance score for each document, thereby generating a ‘relevance’ score for each document.


This can be a one time job, where each query now has a relevance score for each document- This is now a dataset.


This dataset can be used to measure the retriever model’s ability to retrieve only relevant documents.


3. Human-in-the-Loop Evaluation


Incorporating human feedback is invaluable for assessing RAG systems effectively.


Spot Checks


Example: A tech support chatbot powered by a RAG system might provide solutions based on retrieved manuals and forums. SMEs can perform spot checks by reviewing random interactions and assessing whether the solutions provided were accurate and helpful.

This qualitative feedback can guide further training and refinement of the model, ensuring it meets user needs effectively.


Online Experimentation


Example: Consider an e-commerce platform using a RAG system to answer customer queries about products. By conducting A/B testing—where one group interacts with an older version of the system while another uses an updated version—developers can gather data on user satisfaction and engagement metrics.


If users interacting with the updated version report higher satisfaction rates or spend more time engaging with responses, it indicates improved performance over previous iterations.


4. Domain-Specific RAG Techniques


Key Components of Domain-Specific RAG Evaluation


Domain-specific RAG evaluation involves several critical components:


  1. Rubric-Based Evaluation Metrics:


    • Rubrics: These are structured scoring systems that provide detailed descriptions for each score level, typically ranging from 1 to 5. They help evaluators assess model outputs based on relevance and accuracy concerning domain-specific questions. For instance, a custom rubric might define scores based on how closely an answer matches a known ground truth or its relevance to the question asked1.

    • Reference-Based and Reference-Free Approaches: Evaluations can be conducted using reference-based methods (comparing outputs to known correct answers) or reference-free methods (assessing quality without direct comparisons) to provide a comprehensive view of model performance.


  2. Scenario-Specific Datasets:


    • The creation of datasets tailored to specific domains is essential for effective evaluation. For example, the RAGEval framework generates scenario-specific datasets through a multi-stage process that includes schema summarization, document generation, question-reference-answer generation, keypoint extraction, and metric evaluation. This ensures that the evaluation captures the unique characteristics of the domain being assessed.


Specific Metrics for Evaluation


Retrieval Metrics


Precision and Recall are foundational metrics for evaluating how well a RAG system retrieves relevant documents.


Example: If a legal assistant AI retrieves 50 documents in response to a query about contract law but only 20 are relevant, its precision is 0.4 (40%). This insight helps developers understand retrieval effectiveness.


Response Metrics


Metrics like BLEU and ROUGE compare generated text against reference texts to measure quality but have limitations regarding context comparison.


  • BLEU focuses primarily on precision by measuring how many words in the generated output appear in reference texts.

  • ROUGE, particularly ROUGE-N (which includes ROUGE-1 and ROUGE-2), measures recall by assessing how many words from reference summaries appear in generated summaries.


While these metrics provide valuable insights into surface-level similarity, they do not inherently capture context or semantic meaning between sentences. To address this gap, embedding-based methods can be employed, where sentence embeddings from models like BERT or Sentence-BERT are compared using cosine similarity or other distance metrics to assess contextual relevance more effectively.


Advanced Techniques


Re-ranking techniques can significantly enhance retrieval performance by improving context understanding.


Example: After retrieving initial results for a query about travel destinations, re-ranking could prioritize documents that provide more recent or more comprehensive information based on user preferences or past interactions.

Embedding models can also improve retrieval accuracy by capturing semantic relationships between words better than traditional keyword-based methods.


Conclusion


Evaluating Retrieval-Augmented Generation systems requires a comprehensive approach that combines quantitative metrics with qualitative assessments and human feedback. By employing frameworks like TRIAD and eRAG alongside various retrieval and response metrics, developers can gain valuable insights into both retrieval effectiveness and response quality. Continuous refinement based on evaluation results is crucial for optimizing RAG system performance, ensuring they remain reliable tools for users seeking accurate information in an increasingly complex digital landscape.


Comments


bottom of page