Benchmarking

Gloo AI is committed to building transparent, value-aligned, and rigorous evaluation systems for large language models. Our benchmarks are not just technical—they are human, faith-aware, and centered on real-world use in communities.

What We’re Building

We’re developing a shared benchmark framework that is:

Open: Designed to invite contribution from the wider faith-tech and AI ethics community.
Collaborative: Built alongside values-aligned partners, not just for internal use.
Multidimensional: Goes beyond accuracy to measure value alignment, relevance, and judgment.

What We Evaluate

Each model is scored against a broad range of criteria across three primary dimensions:

Objective – Factual correctness, reasoning quality.
Subjective – Tone, empathy, value alignment.
Tangential – Off-topic drift, misunderstanding, or indirect errors.

Our Testing Approach

~3,000+ curated evaluation prompts mapped to 7 dimensions of human flourishing.
Faith-specific QA and worldview-sensitive questions sourced from real communities.
Evaluated by LLMs and humans, with checks for model self-awareness and consistency.
Support for comparing open-source and proprietary models side-by-side.

We run evaluations across top models like Gemini, DeepSeek, Mistral, Grok, and others to identify strengths, failure modes, and opportunities for alignment.

Results That Matter

Our benchmark results are designed to:

Inform partners choosing models for high-trust environments.
Guide internal training and fine-tuning decisions.
Track changes over time as the AI landscape evolves.
Provide transparent model performance reports for trust-building.

Tools You Can Use

We’re also building tools for:

Automating model evaluation at scale.
Embedding benchmarks into CI/CD pipelines for ML.
Visualizing results across multiple dimensions and LLM types.

We’re not just measuring what models can say — we’re measuring what they should say, and how well they serve the communities we care about.

Get Started

Platform Concepts

Studio User Guide

Product Guides

API Guides

Learn More

External

Benchmarking

What We’re Building

What We Evaluate

Our Testing Approach

Results That Matter

Tools You Can Use

Get Started

Platform Concepts

Studio User Guide

Product Guides

API Guides

Learn More

External

​What We’re Building

​What We Evaluate

​Our Testing Approach

​Results That Matter

​Tools You Can Use

What We’re Building

What We Evaluate

Our Testing Approach

Results That Matter

Tools You Can Use