Skip to main content
Gloo AI is committed to building transparent, value-aligned evaluation systems. We evaluate Large Language Models (LLMs) using the Flourishing AI Initiative (FAI) framework. This is an empirical standard designed to measure how well AI models align with human well-being in general. An application of FAI known as FAI-Christian (FAI-C), evaluates how models align to the Christian worldview.

The Flourishing AI Initiative (FAI) Framework

Unlike standard benchmarks that focus solely on reasoning or code generation, FAI evaluates models against seven dimensions of human flourishing. This allows you as the developer to understand how models perform across specific domains of life.

The 7 Dimensions

  1. Character: Virtue, moral reasoning, and self-regulation.
  2. Happiness: Life satisfaction and emotional well-being.
  3. Relationships: Interpersonal connection and community building.
  4. Meaning: Purpose, significance, and goal orientation.
  5. Health: Physical and mental wellness.
  6. Finances: Economic stability and stewardship.
  7. Faith: Spiritual formation, theological coherence, and religious engagement.

Scoring Methodology

The benchmark utilizes a Judge-LLM approach. A “Judge Persona” (an LLM with defined expert criteria) evaluates responses based on a curated dataset of ~800+ prompts. To prevent overfitting, we calculate a Geometric Mean across three distinct metric types for every response:
  1. Objective Score: Accuracy on fact-based questions.
  2. Subjective Score: Alignment with the specific dimension’s flourishing criteria (e.g., Does this advice promote long-term character growth?).
  3. Tangential Score: Relevance and safety (checking for off-topic drift or hallucinations).

Dual Standards: FAI-G vs. FAI-C

Depending on the target audience that you are building for, you may want to evaluate models against two distinct standards:
StandardDescriptionJudge Persona
FAI-G (General)Measures broad human well-being and general flourishing.Secular Domain Experts
FAI-C (Christian)Measures alignment with biblical grounding and Christian moral clarity.Christian Domain Experts

Key Insights for Developers

Our latest Insights Report highlights critical performance deltas relevant to application development:
  • Frontier models (like GPT-4o or Claude 3.5 Sonnet) often perform well on pragmatic dimensions (Health, Finances) but systematically underperform on values-based dimensions (Faith, Meaning) due to safety-tuning that forces neutral responses.
  • There is often a significant score drop when comparing a model’s FAI-G score to its FAI-C score, indicating that generic models struggle with theological distinctiveness.
  • Gloo-hybrid models (fine-tuned on the Gloo stack) have demonstrated a ~30+ point increase in the Faith dimension compared to base frontier models.

Additional Resources

If you want to learn more, check out the full research findings and access the leaderboards below: