How Do We Know if an AI Model Is Good?
Not all AI models are created equal. Some are fast but imprecise, others are accurate but slow or expensive to run. To decide which models are worth using—and how to improve them—AI developers and researchers use a range of tests, scores, and real-world indicators. In this section, we explain the most common methods for evaluating AI performance. These include standardized academic benchmarks, speed metrics, and human feedback. Together, they help teams ensure models are not just powerful, but also useful and safe.Benchmarks (e.g., MMLU, HellaSwag, TruthfulQA)
What it means: Benchmarks are standardized sets of test questions used to measure and compare how well different models perform across specific types of tasks or reasoning. Examples:- MMLU: Evaluates knowledge and reasoning across 57 academic and professional subjects.
- HellaSwag: Tests the model’s ability to apply common sense and continue sentences in plausible ways.
- TruthfulQA: Measures whether a model produces accurate and truthful responses.
Latency
What it means: Latency is the time it takes for an AI model to start responding after a prompt is submitted. Why it matters: Low latency feels fast and responsive. High latency can make even a smart model frustrating to use in real-time applications. How it shows up in Gloo: Responsive interactions are essential for ministry and organizational workflows. Gloo monitors latency across each model provider we integrate and routes requests to the fastest reliable option. This helps Chat for Teams feel immediate, especially when retrieving larger documents from the Data Engine.Throughput
What it means: Throughput measures how many prompts a model can handle per second when running at scale. Why it matters: Higher throughput means the system can serve more users at once—critical for apps that rely on AI for live chat, search, or customer service. How it shows up in Gloo: High throughput is important for partners who rely on Gloo for large scale search, enrichment, or chat workloads. Gloo manages throughput behind the scenes by autoscaling infrastructure when necessary, ensuring consistent performance regardless of volume.Tokens per Second (TPS)
What it means: TPS tells you how quickly a model can generate text. AI models process and output content in units called tokens (which are usually chunks of words or syllables). Example: If a model runs at 50 TPS, it can generate a 250-token reply (about 2–3 paragraphs) in 5 seconds. Why it matters: Faster TPS leads to better user experiences, especially for tools that generate long or complex responses. How it shows up in Gloo: Gloo tracks TPS to ensure that longer answers, such as sermon summaries, and policy explanations generate quickly during live use. Faster TPS creates a smoother experience in Chat for Teams and helps Studio users iterate on content without waiting for slow responses.BLEU / ROUGE / F1 Score
What it means: These are automated scores that measure how closely a model’s outputs match expected answers.- BLEU: Used mostly in translation tasks.
- ROUGE: Commonly used to evaluate summarization.
- F1 Score: Balances precision (how much of the output is correct) with recall (how much of the correct output was included).
Human Eval
What it means: Human evaluation is when people judge model outputs based on qualities like helpfulness, clarity, and tone. Why it matters: Humans can catch things machines can’t—like whether an answer is polite, well-structured, or socially appropriate. Human eval is especially important for AI used in education, counseling, or creative writing. How it shows up in Gloo: Human evaluation plays a major role in shaping Gloo’s alignment system. Our team of Safety reviewers assess model outputs to ensure responses are pastorally appropriate, theologically sound, and supportive of flourishing. These evaluations continuously inform Gloo’s safety instructions, grounding strategy, and model routing.Next Up: Review and What’s Next In the next section, we’ll answer: “What have you learned and what can I do next with Gloo and AI?”

