Hello Habr! My name is Tanya, and I'm a quality analyst on the Just AI knowledge base team. We develop a product for customer knowledge bases using Retrieval Augmented Generation (RAG) and offer turnkey base creation. A crucial aspect of our Proof of Concept (POC) for clients is evaluating the system's response quality and accuracy, as well as selecting the optimal language model (LLM). Higher accuracy translates to increased user trust and reduced manual information searches. Achieving 90% accuracy is a primary requirement for many of our clients. This article details our accuracy assessment methodology and model selection process.
Creating a Dataset for Quality Assessment
To effectively assess RAG generation quality, we utilize a hybrid approach combining synthetic and manual datasets.
General Rules for Compiling Synthetic and Manual Datasets
Synthetic Datasets: These are question-answer pairs generated using an LLM based on existing documents. While creating synthetic datasets is quick and easy, allowing for rapid generation of numerous questions per document, their inherent "sterility" is a limitation. The questions tend to be overly precise and correct, similar to exam questions. This high accuracy on synthetic data doesn't guarantee strong performance with real user queries.
Manual Datasets: Alternatively, we use manually created or client-provided datasets. Clients often possess a better understanding of typical user queries. Accurate responses to real-world questions significantly increase the likelihood of the knowledge base performing effectively in a production environment. However, manual dataset creation is time-consuming, requiring creation, verification, and potential future updates as document content changes. Furthermore, new clients might lack the data needed for comprehensive testing.
The choice between synthetic and client datasets depends on the resources and capabilities of both the development team and the client.
Best Practices for Dataset Creation:
- Diversity of Questions: Include a broad range of question types.
- Handling Abbreviations and Terminology: Questions should include abbreviations, acronyms, and product names found within the documentation. For example, include questions that require explaining abbreviations or deciphering product names, potentially including transliterations.
- Incorporating Errors: Include questions with misspellings and typos to simulate real-world scenarios (e.g., "Configure the Face Audi" instead of "Configure Face ID").
- Question Complexity: Utilize a mix of simple (e.g., "What is a loan?") and complex questions, including descriptive, reasoning, and comparative questions.
- Informal Language: Include informally phrased questions, mirroring real user interactions (e.g., "I can't find it, can you tell me about BIG (bin) bank?").
- Multi-part Questions: Incorporate questions with sub-questions (e.g., "I transferred money; where is it? How do I lift the hold?").
- Dialogue Support: Include questions simulating conversational interactions (e.g., "Hello, how do I get a 200k loan? What about a 300k loan?").
Optimal Dataset Size:
We've found that 100-200 to 500 questions is optimal. For more thorough testing, we recommend 3-5 times the number of documents or 2-3 questions per knowledge base section. The optimal number can vary depending on document/section size, text volume, and content. If a document contains many points needing verification, the number of questions per document can be increased. One client provided a dataset of 2000 questions. While the overall quality percentage was similar to that obtained with a 200-question sample, the larger dataset allowed for more precise identification of areas where our solution was lacking.
Case Study: Knauf
To evaluate LLM response generation on our platform, we used a knowledge base and dataset from our client, Knauf.
Project Overview
Initial tests revealed substantial improvements in bot knowledge. To optimize the Knauf project, we migrated their classic FAQ bot to a new architecture using RAG and an LLM.
- Iteration 1: The database contained 125 documents (491 MB total, 2.9 million characters, or 400,000 words).
- Iteration 2: 40 additional documents (140,000 characters) were added for 8 new products.
For quality testing, we generated a 320-question dataset, and the client provided a 456-question dataset covering products and systems.
Knauf Knowledge Base Characteristics
The Knauf knowledge base features specialized construction terminology and product names incorporating German words (e.g., KNAUF GOLDBAND, KNAUF FEGEN HID, KNAUF-SATENGIPS, KNAUF PERLFIX). The knowledge base aims to provide detailed information on material properties, applications, and relevant standards to aid customer decision-making.
Knauf Testing Process
Our Knauf quality testing process involved:
- LLM Selection: Choosing an LLM to generate responses.
- Query Processing: Submitting questions through the RAG system using the selected LLM and recording the responses.
- Quality Assessment: Evaluating response quality using established metrics.
Prompt Engineering for Generation
Prompt engineering significantly impacts response quality. Different models often require different prompts to achieve optimal results. However, to ensure consistent evaluation across models, we used a single prompt for all models. This allows for a fair comparison of how each model interprets and responds to the same input data. This approach doesn't provide universal conclusions about each LLM's overall generation capabilities but allows us to understand its effectiveness for the client's specific tasks. A limitation of this approach is the potential sensitivity of LLMs to specific phrasing or prompt length. For increased accuracy, reevaluation with different datasets and prompt variations may be necessary.
Our prompt incorporated several elements:
- Basic Response Generation Instructions: Clear instructions on how to format the response.
- Client Communication Style: Client-specified preferences for communication style.
- Product Information: Additional tips and comments regarding the client's product range.
- Channel-Specific Considerations: Instructions addressing any response formatting limitations imposed by the knowledge base's publication channel.
Choosing an Independent LLM Judge
To assess response quality, we employed an independent LLM judge. The LLM received the question, the reference answer, and the generated response. It then provided a quality assessment based on pre-defined metrics and a final overall score.
Why Use an LLM Judge?
We chose an automated LLM evaluation rather than human evaluation due to cost and efficiency. LLM judges quickly process large datasets and provide evaluations faster than humans. Furthermore, LLMs provide greater assessment consistency by adhering to the same criteria for all examples.
Another reason for choosing an LLM judge is the specialized terminology and product names in the knowledge base, which might be challenging for someone unfamiliar with the company's catalog.
We used GPT-4 as the judge for all models except for GPT-4 and GPT-4 Mini. Self-evaluation is avoided to prevent bias. For evaluating the OpenAI model family, we used Claude-Sonnet-3.7, considered a superior model in current benchmarks.
Evaluation Metrics
We used the following metrics:
- Answer Correctness: Accuracy of the answer compared to the reference answer; whether key facts are correctly conveyed and undistorted.
- Faithfulness: How accurately the generated answer reflects information from the source documents. High faithfulness indicates the absence of fabricated information; low faithfulness suggests potential hallucinations.
- Answer Relevance: How well the answer addresses the question; whether the answer contains irrelevant or extraneous information.
- Response Time: The time taken to process the request.
Assessment Scale
Since LLM assessments can be subjective, we tested two scales: 1-10 and 1-5. We performed three evaluations on the same 100 question-answer pairs using both scales. Histograms visualized the results, showing the frequency of each score across the three attempts. Significant variation in score frequency indicates evaluator instability.
The 1-10 scale showed minor differences in average scores (0.06) across attempts. However, the graph revealed greater differences in scores around 6, 8, and 9. Interpreting these close values (e.g., 6 vs. 7) proved difficult for both models and humans. The detailed scale might lead to the judge "blurring" its estimates by selecting intermediate values that don't accurately reflect the model's quality.
The 1-5 scale yielded more pronounced differences, suggesting that the model interprets the 1-5 scale more definitively than the 1-10 scale. We concluded that more evaluation options increase the likelihood of inconsistent interpretations and evaluations. We chose the 1-5 scale over the 1-10 scale.
Model Evaluation Results
Nineteen models from seven families participated in our assessment: OpenAI, Qwen, Llama, Claude, DeepSeek, GigaChat, YandexGPT, and T-Tech.
(Insert table here listing all 19 models)
Model integration was handled using our Caila MLOps platform, which can directly access models via internal APIs if they're on our servers or use third-party solutions like OpenRouter.
We used two visualization methods to present the results: box plots and column charts.
Box Plots
Box plots show the distribution of scores (minimum, first quartile, median, third quartile, maximum). The box represents the interquartile range (IQR), the line within the box marks the median, and points outside the box represent outliers. For Answer Correctness, Faithfulness, and Answer Relevance, a smaller box indicates more consistent scores (homogeneous answer quality). For Response Time, a lower box and median indicate faster response times. DeepSeek-R1 showed the longest response times, while GPT-4, GigaChat Lite, and GigaChat 2 Lite were the fastest.
Column Charts
Visualizing results on a 5-point scale for many models showed low interquartile ranges, with most scores clustered around 5. Many box plots were reduced to a single line with numerous outliers. To compare score distributions, we used column charts to quickly evaluate positive and negative assessments.
(Insert relevant charts and graphs here, visualizing the results for Answer Correctness, Faithfulness, Answer Relevance, and Response Time across the different models.)
Model Selection
GPT-4 Mini was selected for the project. It demonstrated good performance across all three quality metrics, and its median response time did not exceed 10 seconds. Its performance was comparable to GPT-4, but at a lower cost.
Conclusion and Call to Action
This comprehensive methodology allows for robust evaluation of RAG-based knowledge bases. We encourage sharing your experiences in the comments. What metrics do you use? Which models do you find most suitable for RAG architecture, and why? What challenges have you encountered?