Dataster Documentation

Dataster helps you build Generative AI applications with better accuracy and lower latency.

Human Evaluation Tests Overview

When developing a GenAI application, ensuring the accuracy of the generated answers is paramount. Accuracy directly impacts user trust and satisfaction, as users rely on the application to provide reliable and correct information. Evaluating the quality of answers involves rigorous testing and validation against a diverse set of queries to ensure the AI can handle various contexts and nuances. High accuracy not only enhances the user experience but also reduces the risk of misinformation, which can have significant consequences. Therefore, continuous monitoring and improvement of answer quality are essential to maintain the application's credibility and effectiveness.

To address the challenge of ensuring high-quality outputs, Dataster offers a comprehensive quality assessment framework. This framework enables builders to evaluate their use case at scale by generating and analyzing hundreds of responses from various combinations of system prompts, Large Language Models (LLMs), and optionally vector stores for RAG. By doing so, builders can create numerous RAG combinations to determine which configuration consistently delivers the most accurate and reliable answers for their use case and users. This systematic approach helps in fine-tuning the application to meet the highest standards of answer quality.

This capability functions by presenting both inputs and outputs to a human evaluator, whose task is to assess the quality of the outputs in relation to the inputs. Jobs can be configured to rate outputs either in a binary manner (good or bad) or on a scale from one to five. Additionally, Dataster allows the human evaluator to select options to help reduce bias. For example, they can choose to hide the model or RAG name during the evaluation. Similarly, they can opt to conceal the system prompt used to generate the outputs. Finally, they can randomize the order in which input and output pairs are presented, making it virtually impossible to determine which model, RAG, and system prompts produced a particular output.

Once all the results are received, Dataster compiles statistics for each model, each RAG, and each system prompt, presenting the average score.

Conclusion

Dataster's human evaluation framework empowers builders to rigorously assess the quality of their GenAI applications' outputs. By facilitating thorough testing across various RAG combinations and LLMs, Dataster ensures that developers can identify the optimal setup for delivering accurate and reliable answers. This meticulous approach enhances user trust and satisfaction, ultimately contributing to the overall success and credibility of GenAI applications.

If you encounter any issues or need further assistance, please contact our support team at support@dataster.com.