How to Add Confidence Intervals to LLM Judges

Precision-Based Sampling for LLM Judges

May 17, 2025

Overview

In this article, we will answer the questions:

how to counteract sampling bias in LLM judges?
how many evaluator repetitions are enough?
how does cost scale with reliability and quality?

We will use confidence intervals to quantify the required level of score precision, and sample statistics to estimate the number of samples required to achieve that precision.

We will also derive some rules-of-thumb scaling laws for how this strategy affects the time, cost, and quality of the evaluation.

Problem: LLM evaluators are stochastic
Sequential Sampling Algorithm
Mixed‑Expert Sampling
Estimating the Number of Samples to Poll
Scaling Laws — how each knob changes the bottom line
Impact on Time, Cost and Quality of Evaluation
Discussion — practical knobs & tips

We will derive that if we collect $n$ samples of a scalar evaluator's score, the expected number of samples required to achieve a $(1-\alpha)$ % confidence interval for the mean is:

n \;=\; \;9 \;z_{\alpha/2}^{2}\; K^{2}\; \delta^{2}

and the mean score has the following confidence interval:

X \pm d \;=\; \bar{X} \pm \frac{R}{3K}

where:

$n$ is the total number of samples collected
$z_{\alpha/2}$ is the z-score for the confidence level $\alpha$ (statistical confidence)
$K$ is the number of bins or classes (granularity of score)
$R$ is the range of the score (max - min)
$\delta$ is the normalized sample standard deviation (noisiness/discriminative power of the evaluator)

For example, if we have the following (hidden) Likert scale with mean = 4 and std = 0.6:

p(x) = \begin{cases} 0.0000 & x = 1 \\ 0.0010 & x = 2 \\ 0.1772 & x = 3 \\ 0.6429 & x = 4 \\ 0.1789 & x = 5 \\ \end{cases}

Then the expected number of samples required for a 90% confidence interval for the mean $\pm d$ is:

n \;=\; \;9 \;(1.645)^{2}\; 5^{2}\; (0.6/4)^{2} \;\approx\; 14

We will use simulation to see that this holds true.

Git repo with code: https://github.com/sunnybak/precision-based-sampling

Problem: LLM evaluators are stochastic

Let's say you wish to evaluate the quality of your agent's output using an LLM as a judge. You might see something as follows:

	evaluator 1	evaluator 2	evaluator 3
scores	4	5	5

At face value, this might seem great. However, when you repeat the evaluators with the same prompt 5 times, the results might tell a different story

	evaluator 1	evaluator 2	evaluator 3
scores	4, 4, 5, 5, 5	5, 4, 3, 4, 2	5, 4, 5, 4, 5
mean scores	4.6	3.6	4.6
std scores	0.55	1.14	0.55

Evaluator 2 seems to be producing scores that have a higher variance than the other evaluators, whereas evaluator 3 seems to be more consistent.

This does not necessarily mean that evaluator 2 is worse than evaluator 3. There are some explanations as to why evaluator 2 has more variance than evaluator 3:

the metric measured by evaluator 2 is harder to grade, more subjective, or general
evaluator 2 is set up to discriminate with more granularity than evaluator 3

An example of evaluator 2 could be "How well does the agent understand the user's intent?".

Evaluator 3 could be "Is the output toxic?", a metric that is easier to grade and likely has a lower variance.

Now, let's define the problem and work through the solution step by step.

Setup

Let's assume that an evaluator maps outputs to a 1D scalar score on a specific dimension (e.g., helpfulness, correctness, creativity).

This scalar score is then discretized into $k$ bins or classes — e.g., Likert 1–5 (5 classes), or binary "Pass/Fail" (2 classes).

Note: it's not recommended to evaluate on more than 1 dimension at a time, since the evaluator scores will likely be biased if multiple dimensions are evaluated at once.

Let's also assume that there is some objective truth to the score, and that the evaluator's score is a noisy estimate of that truth.

If we take $n$ samples of the evaluator's score, IID, we can estimate the mean and variance of the score.

Precision Criteria

We want to keep sampling until the two-sided confidence interval half-width is $d$ . In other words:

P\left( \mu - d \leq \bar{X} \leq \mu + d \right) \geq 1 - \alpha

For example, if we want to be 95% confident that the true mean is within $\pm d$ of the derived mean, we can set $\alpha = 0.05$ .

Now let's give more thought to the value of $d$ .

Value of the half-interval $d$

The value of $d$ must be chosen carefully to be able to distinguish between the classes. This number depends on 2 factors:

the range $R$ of the scale being used
the number of bins or classes $K$ we want to sort into

The number of intervals we require are thus $R/K$ . The maximum half-width to ensure no overlap between the intervals is $R/2K$ . It is best practice to also leave some buffer between the intervals so that the CI sits comfortably inside its bin, and so we use thirds instead of halves. This means that our intervals are not only disjointed, but also leave some buffer between each other, increasing precision. The formula we will use is

d = \frac{R}{3K}

Here are some examples for different scales and their corresponding half-widths:

Scale	Range $R$	Classes $K$	Half-width $d$	Intervals
1-5	4	5	0.267	(0.733, 1.267), (1.733, 2.267), ...
1-10	9	10	0.3	(0.7, 1.3), (1.7, 2.3), ...
0-1	1	3	0.111	(0.899, 1.111), (1.899, 2.111), ...

Sequential Sampling Algorithm

The sequential sampling algorithm works by iteratively collecting samples until we achieve the desired precision. Here's how it works:

Start with a small batch of pilot samples (n0=10) to get an initial estimate of the variance
Calculate the current confidence interval half-width (h)
If h is small enough for the required precision (≤ d), we're done
Otherwise, estimate how many more samples we need based on the current variance
Collect those additional samples and repeat from step 2

The algorithm is efficient because it adapts to the variance of the data - if the evaluator is very noisy (high variance), it will collect more samples, but if the evaluator is very precise (low variance), it will collect fewer samples.

The following is a Python implementation of the sequential sampling algorithm:

async def batch_eval(n):
    # run the evaluator n times in parallel
    return run_concurrently(run_evaluator, n)

async def seq_sample(sample_batch, alpha, K, R):
    n0 = 10 # number of pilot samples
    z = st.norm.ppf(1 - alpha/2)  # z-score for the confidence interval
    x = list(await batch_eval(response, evaluator, n0))  # pilot samples
    d = R / (3 * K) # half-width of the confidence interval
    while True:
        n = len(x) # current sample size
        std = np.std(x, ddof=1) # sample standard deviation

        # Current confidence interval half-width
        h = z * std / np.sqrt(n)
        if h <= d:
            # we have enough samples
            break

		delta = std / R # normalized sample standard deviation
        n_req = math.ceil((3 * z * K * delta) ** 2) # total number of samples required
        # Get more samples, but limit batch size
        n_additional = max(1, min(n_req - n, n0))
        additional_samples = await batch_eval(response, evaluator, n_additional)
        x.extend(additional_samples)
    return np.mean(x) # return the mean of the samples

You can find the full implementation in the repo: https://github.com/sunnybak/precision-based-sampling.

Mixed‑Expert Sampling

Given that we assume some true mean which is (theoretically) model-agnostic, one way to improve robustness is to sample from multiple LLMs as judges in the same batch and treat each judge's vote as another IID sample. This will lead to robustness while requiring no change to the algorithm or scaling laws. Here's how we can implement this in Python:

make a list of LLM models to sample from and set up their keys
use konfigure for managing prompts and parameters
use Python cycle to cycle through the models in a round-robin fashion (also add asyncio/thread locking)
use litellm for model routing
sample from each model in the list using the cycle

Here's a simple implementation of mixed-expert sampling:

import konfigure
from litellm import acompletion

config = konfigure.load('/path/to/config.yaml')

model_cycle = cycle(config.models)
prompt_cycle = cycle(config.prompts)
cycle_lock = asyncio.Lock()

async def get_llm_rating(response: str) -> Optional[int]:
    rating = None
    try:
        # Get next model in a thread-safe way
        async with cycle_lock:
            selected_model = next(model_cycle)
            selected_prompt = next(prompt_cycle)

        response = await acompletion(
            model=selected_model,
            messages=[{
                "role": "user",
                "content": selected_prompt.render(
                    response=response,
                    likert_scale_min=LIKERT_SCALE_MIN,
                    likert_scale_max=LIKERT_SCALE_MAX,
                )
            }],
        )
        content = response.choices[0].message.content
        rating = int(content)
    except Exception as e:
        return None
    
    return rating

The full implementation is in the repo: https://github.com/sunnybak/precision-based-sampling.

Estimating the Number of Samples to Poll

CI width (large‑ $n$ normal approximation)

h_n \;=\; z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}}

Precision target from the one‑third‑gap rule

d = \frac{R}{3K}, \qquad\text{where}\; R=b-a

Solve for $n_{req}$

h_n \le d \;\Longrightarrow\; n_{req} \;\ge\; \Bigl(z_{\alpha/2}\,\frac{\sigma}{d}\Bigr)^2

n_{req} \;\ge\; \Bigl(z_{\alpha/2}\,\frac{3K\sigma}{R}\Bigr)^2

Since we don't know the true standard deviation $\sigma$ , we can replace it by the sample standard deviation $S$ to get the estimated number of samples required to achieve the desired precision:

n_{\text{req}} = \Bigl\lceil \Bigl(3 z_{\alpha/2} K \frac{S}{R}\Bigr)^2 \Bigr\rceil

Let us define the normalized scale-invariant sample standard deviation as

\delta = S/R

This is because we want to know the amount of variation irrespective of the scale being used. Finally, we get:

n_{req} \;=\; \lceil \;9 \;z_{\alpha/2}^{2}\; K^{2}\; \delta^{2}\;\rceil

Scaling Laws — how each knob changes the bottom line

Expected Number of Samples

The expected number of total samples needed to achieve a $(1-\alpha)$ % confidence interval, $n$ , is:

n \;=\; \;9 \;z_{\alpha/2}^{2}\; K^{2}\; \delta^{2}

where

$z_{\alpha/2}$ is the z-score for the confidence level $\alpha$ (statistical confidence)
$K$ is the number of bins or classes (granularity of score)
$\delta$ is the normalized sample standard deviation (noisiness/discriminative power of the evaluator)

Based on this, we get the following relationships:

n \;\propto\; z_{\alpha/2}^{2}\;\approx\; \ln(1/\alpha)

n \;\propto\; \delta^{2}

n \;\propto\; K^{2}

Confidence level $\alpha$ : Sharper CIs get expensive slowly (log‑like). 95% → 99% multiplies $n$ by only ≈ 1.7. Reliability is relatively inexpensive!
Normalized Std‑dev $\delta$ : Halving variability quarters the required runs. If the evaluator has more variability, quadratically more samples are needed to be confident. This value is high for evaluators which produce a wider range of scores.
Number of Bins $K$ : Increasing the number of bins $K$ for the same range will quadratically increase the number of samples required, because we want higher granularity. For example, if a 0-1 scale with 2 bins will require 6 samples, but a 0-1 scale with 4 bins will require 36 samples.

Impacts on Time, Cost and Quality of Evaluation

\text{Cost} \propto n

\text{Latency} \propto \frac{n}{\text{mean\_batch\_size}}

\text{Quality} \uparrow \text{as} \; \alpha \downarrow \text{and} \; K \uparrow

Since the number of input and output tokens are the same, cost is directly proportional to the total number of samples, assuming constant price per token.
The latency is proportional to the number of concurrent calls made to the LLM, which is $n / \text{mean\_batch\_size}$ . Increasing the minimum and maximum batch sizes will cause a faster convergence, but can lead to overshooting the number of samples required.
Quality is some function of the confidence level $\alpha$ and the half-width $d$ . Lower values of $\alpha$ and $d$ will improve the reliability and granularity of the metric.

Discussion — practical knobs & tips

Initial sample size

A pilot of 5-10 samples gives a tolerant first variance estimate. If you have historical data, use its $\delta = S/R$ to seed $n_{\text{req}}$ before the first run.

Batch size

If the batch size is too large, the latency will increase - there are always some API calls that take much longer. From experiments, I have found that 5-10 concurrent API calls work best for OpenAI.

Here's a latency plot for concurrent API calls for GPT-4.1

As you can see, the error bars in the latency increase with the number of concurrent API calls, pushing the mean latency up.

What if $n_{\text{req}}$ is still huge?

Depending on your use case, you might want to tune the cost, latency, or quality by adjusting $n_{\text{req}}$ :

Tighten the evaluator rubric so $\sigma$ drops. A more specific rubric which activates infrequently will have a lower $\sigma$ .
Accept 90% CIs instead of 95%.
Reduce classes: 1-10 ratings -> 1-5 ratings will reduce $K$ by half and the cost by a quarter.