Evaluating the Accuracy and Choosing the Right Model for a RAG-based Knowledge Base

Hello Habr! My name is Tanya, and I'm a quality analyst on the Just AI knowledge base team. We're developing a product for customer knowledge bases based on Retrieval Augmented Generation (RAG) and the creation of turnkey knowledge bases. A crucial part of our Proof of Concept (POC) for customers involves assessing the quality and accuracy of system responses and selecting the optimal language model (LLM) to achieve the desired performance. The higher the accuracy of the answers, the greater the user trust and the less manual work required to find additional information. Ninety percent accuracy is a key requirement for most of our clients when choosing a knowledge base. This article details how we conduct accuracy assessments and choose the right model. Due to the length of this article, I've included a table of contents for easier navigation.

Creating a Dataset for Quality Assessment
- General Rules for Compiling Synthetic and Manual Datasets
- Synthetic Datasets: Advantages and Disadvantages
- Manual Datasets: Advantages and Disadvantages
- Hybrid Approach: Combining Synthetic and Manual Data
- Determining the Optimal Dataset Size
Case Study: Knauf Knowledge Base
- Project Overview and Initial Setup
- Dataset Creation and Characteristics
- The Testing Process: A Step-by-Step Guide
Prompt Engineering for LLM Generation
- Importance of Effective Prompting
- Structure and Components of Our Prompt
- Considerations for Consistency and Fairness in Evaluation
Choosing an Independent LLM Judge
- Reasons for Automated Evaluation
- Selection of the Evaluation Model: GPT-4 vs. Claude
- Addressing Bias and Ensuring Objective Assessment
Metrics for Evaluating LLM Generation
- Defining Key Metrics: Correctness, Faithfulness, Relevance, and Response Time
- Importance of Clear and Reproducible Evaluation Criteria
Assessment Scales and Results Analysis
- Comparing 1-10 and 1-5 Scales
- Visualizing Results: Box Plots and Column Charts
- Interpreting Results and Identifying Trends
Model Selection and Performance Comparison
- Participating Models and Their Performance
- Evaluating Models Across Metrics
- Choosing the Best Model: GPT-4O-Mini
Conclusion and Call to Action

1. Creating a Dataset for Quality Assessment

To assess the quality of RAG generation, we often utilize both synthetic and manual datasets. Let's examine each approach:

General Rules for Compiling Synthetic and Manual Datasets

Regardless of the dataset type, several principles should guide its creation:

Diversity of Subjects: Questions should cover a wide range of topics within the knowledge base.
Inclusion of Specialized Terminology: If the documentation contains abbreviations, acronyms, product names, or terms in other languages (e.g., German in the Knauf example), questions incorporating these elements are essential.
Incorporation of Errors: Include questions with misspellings and typos to mimic real-world user queries.
Question Complexity: The dataset should include simple, factual questions as well as more complex ones requiring descriptive, reasoning, or comparative answers.
Informal Language: Use questions that reflect the informal language often found in real user queries.
Multi-part Questions: Include questions with multiple parts or sub-questions to test the LLM's ability to handle complex inquiries.
Dialogue-Based Questions: Incorporate questions that mimic conversational interactions.

Synthetic Datasets: Advantages and Disadvantages

Synthetic datasets, consisting of question-answer pairs generated using LLMs, offer several advantages:

Ease and Speed of Creation: Generating large numbers of question-answer pairs is quick and straightforward.
Scalability: Almost limitless numbers of questions can be generated for each document.

However, synthetic datasets also have limitations:

Lack of Real-World Context: The questions tend to be overly precise and artificial, unlike real user queries. The high quality of answers to these questions doesn't guarantee that the system will perform well with authentic user questions. This "sterility" limits their effectiveness in capturing the nuances of real-world user interactions.

Manual Datasets: Advantages and Disadvantages

Manual datasets, created by human annotators or provided by the client, offer a more realistic representation of user queries:

Real-World Relevance: Questions reflect the actual queries users ask, providing a strong indication of the system's performance in a real-world setting.

The drawbacks include:

Time and Resource Intensive: Creating, verifying, and updating a manual dataset is considerably more time-consuming and expensive than generating a synthetic one.
Potential Bias: The dataset may not capture the full range of user queries if the creators lack a thorough understanding of user needs.

Hybrid Approach: Combining Synthetic and Manual Data

Often, the most effective approach combines both synthetic and manual datasets. A synthetic dataset can be used for initial testing and parameter tuning, followed by a manual dataset to evaluate performance with more realistic user queries. This hybrid approach balances the speed and scalability of synthetic data with the realism of manual data.

Determining the Optimal Dataset Size

The ideal dataset size depends on several factors, including the number of documents in the knowledge base, the size and complexity of the documents, and the desired level of detail in the quality assessment. We generally recommend a range of 100-200 to 500 questions. For more thorough testing, consider using 3-5 times the number of documents or 2-3 questions per document/knowledge unit. However, the optimal number can vary. One client provided a dataset of 2000 questions. The overall quality percentage was similar to that obtained from a 200-question sample, but the larger dataset allowed for a more precise evaluation of areas where the system underperformed.

2. Case Study: Knauf Knowledge Base

To demonstrate LLM performance on our platform, we used a knowledge base and dataset from our client, Knauf.

Project Overview and Initial Setup

The project involved migrating a classic FAQ-based chatbot to a new RAG and LLM architecture. The initial knowledge base comprised 125 documents (491 MB, 2.9 million characters, or 400,000 words). A second iteration added 40 documents (140,000 characters) for 8 new products. Our initial quality test used a 320-question dataset, supplemented by a client-provided 456-question dataset covering products and systems.

Dataset Creation and Characteristics

The Knauf knowledge base uses specialized construction terminology and product names, often incorporating German words (e.g., KNAUF GOLDBAND, KNAUF FEGEN HID, KNAUF-SATENGIPS, KNAUF PERLFIX). The knowledge base aims to provide detailed information about material characteristics, applications, and relevant standards to assist customers in product selection.

The Testing Process: A Step-by-Step Guide

Our testing process followed these steps:

LLM Selection: Choosing the LLM to generate responses.
Query Processing: Submitting questions to the RAG system using the selected LLM and recording the responses.
Quality Assessment: Evaluating the responses based on predefined metrics.

3. Prompt Engineering for LLM Generation

Effective prompting significantly impacts response quality. Different models often require different prompting strategies to achieve optimal results. To ensure a fair comparison, we used a single, consistent prompt across all models.

Importance of Effective Prompting

A well-crafted prompt guides the LLM towards generating accurate, relevant, and contextually appropriate responses. Poorly designed prompts can lead to inaccurate or irrelevant answers.

Structure and Components of Our Prompt

Our prompt included several components:

Instructions: Clear instructions on how to generate the response.
Client Requirements: Specifications about communication style and desired information.
Contextual Information: Additional tips and comments, such as details about product assortment.
Channel-Specific Instructions: Guidelines related to the knowledge base's publication channel (e.g., indicating the lack of formatting due to channel limitations).

Considerations for Consistency and Fairness in Evaluation

Using a single prompt ensures a fair comparison by eliminating variations in prompt design as a source of performance differences between models. However, this approach assumes that all models respond equally well to the chosen prompt, which may not always be true. Differences in model sensitivity to prompt wording or length could influence the perceived quality of responses. Further assessments with different prompts and datasets may be necessary for a more comprehensive evaluation.

4. Choosing an Independent LLM Judge

To evaluate response quality, we used an independent LLM as a judge. The judge received the question, the reference answer, and extracted context, and provided a quality assessment based on predefined metrics.

Reasons for Automated Evaluation

Automated evaluation offers significant advantages over human assessment:

Speed and Efficiency: LLMs can quickly process large datasets and provide assessments in a fraction of the time it would take human annotators.
Consistency: LLMs apply the same criteria to all examples, ensuring consistent evaluation.
Handling Specialized Terminology: For knowledge bases with complex terminology, an LLM judge can more effectively assess the accuracy of responses than a human unfamiliar with the specific domain.

Selection of the Evaluation Model: GPT-4 vs. Claude

We primarily used GPT-4 as the evaluation model, except when assessing GPT-4 and GPT-4-Mini. To avoid bias, we used Claude-Sonnet-3.7 to evaluate the OpenAI family of models, as Claude is considered more performant in current benchmarks.

Addressing Bias and Ensuring Objective Assessment

Using an independent model for evaluation helps mitigate bias and improve objectivity. Self-evaluation is avoided to prevent the judge from favoring its own outputs.

5. Metrics for Evaluating LLM Generation

We used several key metrics to evaluate LLM-generated answers:

Answer Correctness: Accuracy of the answer compared to the reference answer, evaluating the transmission of main facts and the absence of distortions.
Faithfulness: How accurately the generated answer reflects information from the extracted documents. A high score indicates that the model does not fabricate information ("hallucinations").
Answer Relevance: How well the answer addresses the question, assessing the absence of irrelevant or extraneous information.
Response Time: The time required to process the request.

6. Assessment Scales and Results Analysis

To ensure reproducibility and consistent interpretation, we tested two scales: 1-10 and 1-5.

Comparing 1-10 and 1-5 Scales

We conducted three assessments on the same 100 question-answer pairs using both scales. The 1-10 scale showed minor variations in average scores between attempts (0.06), but greater variability in specific score ranges (6, 8, and 9). Interpreting the differences between close values (e.g., 6 vs. 7) proved challenging. The higher granularity increased the risk of the evaluator "blurring" its scores, selecting intermediate values that don't accurately reflect quality.

The 1-5 scale showed more pronounced differences between attempts, suggesting clearer interpretation by the model. The more evaluation options (1-10), the greater the chance of inconsistent interpretation and scoring. We chose the 1-5 scale.

Visualizing Results: Box Plots and Column Charts

We used box plots to visualize the distribution of scores (min, max, median, quartiles) for each model and metric. Wider boxes indicate greater score variability. Column charts displayed the distribution of scores across the 1-5 scale, facilitating comparisons between models.

Interpreting Results and Identifying Trends

Box plots helped assess the consistency of model performance. Narrow boxes indicated consistent scores, while wider boxes suggested less consistent results. Column charts provided a quick overview of the number of positive and negative assessments for each model.

7. Model Selection and Performance Comparison

Nineteen models from seven families participated in our assessment: OpenAI, Qwen, Llama, Claude, DeepSeek, GigaChat, YandexGPT, and T-Tech.

Participating Models and Their Performance

The models included: GPT-4O Mini, GPT-4O, various Qwen, Llama, T-Tech, DeepSeek, Claude, and YandexGPT models. Integration was achieved using our MLOps platform, Caila.

Evaluating Models Across Metrics

We analyzed each model's performance across the four metrics (Correctness, Faithfulness, Relevance, and Response Time). Claude-Sonnet-3.7 consistently performed well, though it sometimes provided excessive detail. YandexGPT models often received low scores due to internal censorship.

Choosing the Best Model: GPT-4O-Mini

GPT-4O-Mini provided a good balance of performance across metrics and response time (under 10 seconds), making it the most cost-effective choice. Its quality was comparable to GPT-4O, but at a lower cost.

8. Conclusion and Call to Action

This article detailed our process of evaluating LLM performance for a RAG-based knowledge base. We employed a combination of synthetic and manual datasets, carefully designed prompts, and an independent LLM judge to ensure objectivity and consistency. The choice of GPT-4O-Mini as the optimal model highlighted the importance of balancing performance with cost-effectiveness. We encourage sharing your experiences in the comments, including the metrics you use, preferred models, and challenges encountered.

in Education

Digg's Groundbreakers: A First Look at the Rebooted Social News Platform