Gloo AI is committed to building transparent, value-aligned, and rigorous evaluation systems for large language models. Our benchmarks are not just technical—they are human, faith-aware, and centered on real-world use in communities.

What We’re Building

We’re developing a shared benchmark framework that is:
  • Open: Designed to invite contribution from the wider faith-tech and AI ethics community.
  • Collaborative: Built alongside values-aligned partners, not just for internal use.
  • Multidimensional: Goes beyond accuracy to measure value alignment, relevance, and judgment.

What We Evaluate

Each model is scored against a broad range of criteria across three primary dimensions:
  1. Objective – Factual correctness, reasoning quality.
  2. Subjective – Tone, empathy, value alignment.
  3. Tangential – Off-topic drift, misunderstanding, or indirect errors.

Our Testing Approach

  • ~3,000+ curated evaluation prompts mapped to 7 dimensions of human flourishing.
  • Faith-specific QA and worldview-sensitive questions sourced from real communities.
  • Evaluated by LLMs and humans, with checks for model self-awareness and consistency.
  • Support for comparing open-source and proprietary models side-by-side.
We run evaluations across top models like Gemini, DeepSeek, Mistral, Grok, and others to identify strengths, failure modes, and opportunities for alignment.

Results That Matter

Our benchmark results are designed to:
  • Inform partners choosing models for high-trust environments.
  • Guide internal training and fine-tuning decisions.
  • Track changes over time as the AI landscape evolves.
  • Provide transparent model performance reports for trust-building.

Tools You Can Use

We’re also building tools for:
  • Automating model evaluation at scale.
  • Embedding benchmarks into CI/CD pipelines for ML.
  • Visualizing results across multiple dimensions and LLM types.
We’re not just measuring what models can say — we’re measuring what they should say, and how well they serve the communities we care about.