How Do We Know if an AI Model Is Good?

Not all AI models are created equal. Some are fast but imprecise, others are accurate but slow or expensive to run. To decide which models are worth using—and how to improve them—AI developers and researchers use a range of tests, scores, and real-world indicators. In this section, we explain the most common methods for evaluating AI performance. These include standardized academic benchmarks, speed metrics, and human feedback. Together, they help teams ensure models are not just powerful, but also useful and safe.

Benchmarks (e.g., MMLU, HellaSwag, TruthfulQA)

What it means: Benchmarks are standardized sets of test questions used to measure and compare how well different models perform across specific types of tasks or reasoning. Examples:

MMLU: Evaluates knowledge and reasoning across 57 academic and professional subjects.
HellaSwag: Tests the model’s ability to apply common sense and continue sentences in plausible ways.
TruthfulQA: Measures whether a model produces accurate and truthful responses.

Why it matters: Benchmarks let researchers compare different models in a consistent way and track improvements over time. How it shows up in Gloo: Gloo reviews external academic benchmarks when selecting foundation and frontier models, but we go further by developing our own Flourishing AI Benchmarks. These proprietary evaluations measure how well a model supports the dimensions of human flourishing such as emotional health, relational strength, purpose, and resilience. Early results show that Gloo aligned models score significantly higher on these measures than off the shelf systems.

Latency

What it means: Latency is the time it takes for an AI model to start responding after a prompt is submitted. Why it matters: Low latency feels fast and responsive. High latency can make even a smart model frustrating to use in real-time applications. How it shows up in Gloo: Responsive interactions are essential for ministry and organizational workflows. Gloo monitors latency across each model provider we integrate and routes requests to the fastest reliable option. This helps Chat for Teams feel immediate, especially when retrieving larger documents from the Data Engine.

Throughput

What it means: Throughput measures how many prompts a model can handle per second when running at scale. Why it matters: Higher throughput means the system can serve more users at once—critical for apps that rely on AI for live chat, search, or customer service. How it shows up in Gloo: High throughput is important for partners who rely on Gloo for large scale search, enrichment, or chat workloads. Gloo manages throughput behind the scenes by autoscaling infrastructure when necessary, ensuring consistent performance regardless of volume.

Tokens per Second (TPS)

What it means: TPS tells you how quickly a model can generate text. AI models process and output content in units called tokens (which are usually chunks of words or syllables). Example: If a model runs at 50 TPS, it can generate a 250-token reply (about 2–3 paragraphs) in 5 seconds. Why it matters: Faster TPS leads to better user experiences, especially for tools that generate long or complex responses. How it shows up in Gloo: Gloo tracks TPS to ensure that longer answers, such as sermon summaries, and policy explanations generate quickly during live use. Faster TPS creates a smoother experience in Chat for Teams and helps Studio users iterate on content without waiting for slow responses.

BLEU / ROUGE / F1 Score

What it means: These are automated scores that measure how closely a model’s outputs match expected answers.

BLEU: Used mostly in translation tasks.
ROUGE: Commonly used to evaluate summarization.
F1 Score: Balances precision (how much of the output is correct) with recall (how much of the correct output was included).

Why it matters: These scores provide fast, repeatable ways to evaluate accuracy, especially during model development. How it shows up in Gloo: Traditional accuracy metrics matter most for enrichment and document processing tasks. When Gloo extracts topics, classifies documents, or creates metadata, automated scoring methods help validate that enrichment is consistent and high quality across an organization’s full content library.

Human Eval

What it means: Human evaluation is when people judge model outputs based on qualities like helpfulness, clarity, and tone. Why it matters: Humans can catch things machines can’t—like whether an answer is polite, well-structured, or socially appropriate. Human eval is especially important for AI used in education, counseling, or creative writing. How it shows up in Gloo: Human evaluation plays a major role in shaping Gloo’s alignment system. Our team of Safety reviewers assess model outputs to ensure responses are pastorally appropriate, theologically sound, and supportive of flourishing. These evaluations continuously inform Gloo’s safety instructions, grounding strategy, and model routing.

Next Up: Review and What’s Next In the next section, we’ll answer: “What have you learned and what can I do next with Gloo and AI?”

Documentation Index

​How Do We Know if an AI Model Is Good?

​Benchmarks (e.g., MMLU, HellaSwag, TruthfulQA)

​Latency

​Throughput

​Tokens per Second (TPS)

​BLEU / ROUGE / F1 Score

​Human Eval