Evaluation &amp; Testing

End-to-end LLM evaluation and development platform with dataset management, prompt playground, automated scoring, and tr...

DeepEval

Open-source LLM evaluation framework with 50+ research-backed metrics for testing AI applications. Differentiates with n...

Evidently AI

Open-source ML and LLM observability platform. Provides pre-built evaluators for text quality, LLM output correctness, t...

Giskard

Open-source testing and vulnerability scanning framework for LLM applications. Detects hallucinations, bias, harmful con...

Inspect AI

LLM evaluation framework developed by the UK AI Safety Institute (AISI). Designed for safety and capability evaluations ...

LM Evaluation Harness

Unified framework by EleutherAI for evaluating language models across hundreds of academic benchmarks and tasks. The de ...

OpenAI Evals

Open-source framework for evaluating LLMs and LLM-based systems with a registry of pre-built benchmarks. Provides eval t...

Parea AI

Platform for testing, evaluating, and observing LLM applications with experiment tracking, human annotation, and product...

Promptfoo

Open-source tool for testing and evaluating LLM prompts and models. Supports automated red teaming, A/B comparison of mo...

RAGAS

Open-source framework for evaluating Retrieval-Augmented Generation (RAG) pipelines using LLM-assisted metrics. Provides...

TruLens

Evaluation and tracking library for LLM applications built by TruEra. Provides feedback functions to evaluate RAG qualit...

UpTrain

Open-source LLM evaluation and observability platform with 20+ pre-built checks including response quality, factual accu...