Testing LLM-enabled applications through evaluations

Cloud

Server

Evaluations overview

Evaluations, also known as evals, are a methodology for assessing the quality of AI software.

Evaluations provide insights into the performance and efficiency of applications based on LLMs, and allow teams to quantify how well their AI implementation works, measure improvements, and catch regressions.

The term evaluation initially referred to a way to rate and compare AI models. It has since expanded to include application-level testing, including Retrieval Augmented Generation (RAG), function calling, and agent-based applications.

The evaluation process involves using a dataset of inputs to an LLM or application code, and a method to determine if the returned response matches the expected response. There are many evaluation methodologies, such as:

LLM-assisted evaluations
Human evaluations
Intrinsic and extrinsic evaluations
Bias and fairness checks
Readability evaluations

Evaluations can cover many aspects of a model performance, including:

Ability to comprehend specific jargon
Make accurate predictions
Avoid hallucinations and generate relevant content
Respond in a fair and unbiased way, and within a specific style
Avoid certain expressions

Automating evaluations with CircleCI

Evaluations can be expressed as classic software tests, typically characterised by the "input, expected output, assertion" format, and as such they can be automated into CircleCI pipelines.

There are two important differences between evals and classic software tests to keep in mind:

LLMs are predominantly non-deterministic, leading to flaky evaluations, unlike deterministic software tests.
Evaluation results are subjective. Small regressions in a metric might not necessarily be a cause for concern, unlike failing tests in regular software testing.

With CircleCI, you can define, automate, and run evaluations using your preferred evaluation framework. Through declaring the necessary commands in your config.yml, you can ensure these evaluations are run within your CircleCI pipeline.

Using an open source library or third-party tools can simplify defining evaluations, tracking progress, and reviewing the evaluation outcomes.

The CircleCI Evals orb

The official Evals orb supports integrations with Braintrust and LangSmith.

If you run evaluations using a different tool, let us know at ai-feedback@circleci.com. You can also contribute directly to the official orb by opening a PR on the public repository.

CircleCI provides an official Evals orb that simplifies the definition and execution of evaluation jobs using popular third-party tools, and generates reports of evaluation results.

Given the volatile nature of evaluations, evaluations orchestrated by the CircleCI Evals orb do not halt the pipeline if an evaluation fails. This approach ensures that the inherent flakiness of evaluations does not disrupt the development cycle.

Instead, a summary of the evaluation results is created and presented:

As a comment on the corresponding GitHub pull request (currently available only for projects integrated with Github OAuth):
As an artifact within the CircleCI User Interface:

You can review the summary and, if required, proceed to a detailed analysis of the individual evaluation on your third party evaluation provider.

Further documentation on how to use the orb is available on the Orb page. Orb usage examples are available in the sample repository [LLM-eval-examples.

Storing credentials for your evaluations

CircleCI makes it easy to store your credentials for LLM providers as well as LLMOps tools. Navigate to Project Settings LLMOps to enter, verify, and access your OpenAI secrets. You will also find a starting template for your config.yml file.

You can also save the credentials to your preferred evaluation platform, including Braintrust and LangSmith. These credentials can then be used when setting up a pipeline that leverages the Evals orb.

To get started, navigate to Project Settings LLMOps:

Suggest an edit to this page

Make a contribution

Learn how to contribute

Still need help?

Ask the CircleCI community

Join the research community

Visit our Support site