Open-source framework for Trust & Safety

The newsletter for integrity engineering

May 30, 2023

📰 May 30th, 2023. Issue #14

While trust & safety is at the forefront of most community driven platforms, anyone actually integrating an LLM into their app knows that tuning their model/automated systems is a frustrating and time-consuming problem. Maintaining consistency for your automated systems can feel like a huge task, especially when your providers don’t work nearly as well as they say they do.

I outline a general process from systematic engineering principles and introduce Apollo API, an open-source tool that implements four types of grading systems: programmatic, semantic, LLM-based, and human-based.

Why do we need this?

Mitchell Hashimoto defined the term blind prompting to describe the trial-and-error approach to prompt engineering. Tuning a complex prompt is like playing whack-a-mole.

Specific to T&S the prompts we feed in are written samples of UGC. Once you solve a problem for one use case, something else breaks in a unrelated edge case. This unpredictability means you reach the point of diminishing return before reaching a reliable, quality output.

Trial-and-error is also impossible to scale across large organizations, because you’ll inevitably be peppered with anecdotal feedback on model outputs. This makes it likely that the final prompt (piece of UGC) is chosen by the loudest voice, instead of by its quality.

For these reasons, trial-and-error prompting is a terribly inefficient way to spend development resources, and quickly reaches diminishing returns.

A process for systemic engineering

A great engineering process will ground improvements in quantitative terms. You should be able to say things like, “this new Fraud stack or Safety stack performs better with a precision rate of 93% compared to 85%, and a recall rate of 87% compared to 76%.”

With this goal in mind, here is the process that I follow for “engineering” quality output:

Define test cases: Identify relevant scenarios and inputs for your application. Create a set of user generated content and test cases that closely represent these scenarios.
Create a hypothesis and prepare an evaluation: Once you have an idea for improving your models output, specify the templates, test cases, and models you want to test. This creates num_models * num_templates * num_inputs prompt candidates.
Run the evaluation: Record the model outputs for each piece of UGC along with other metrics of interest (speed, cost, token usage, etc.)
Create a grading rubric: Ideally one would grade outputs by quantitative metrics such as precision and recall. In other cases, subjective criteria such as empathy or coherence may be more important. Mark each output pass/fail, or give it a score.
Analyze the results: Compare results side-by-side and review metrics. Select the prompt with the highest total score.

None of this is very innovative or surprising - it's pretty normal stuff for iterating on a complex system and making informed decisions. What hasn't become commonplace yet is applying this level of rigor to UGC or general prompts.

Scaling this approach

"Analyze the results" leaves a lot of room for interpretation. In the most basic case, the engineer would simply eyeball the output and mark each test output pass/fail. The winner is simply the piece of UGC that passed the most test cases.

But if you're iterating frequently, and you have a lot of test cases, the manual approach won't be feasible.

Programmatic test cases

Just like a normal unit test, the engineer writes plain old vanilla code to check for some property of the output.

Because programmatic evaluation is cheap, quick, and deterministic, it should be preferred whenever possible.

This approach can be used to test expectations like:

Ensure the output contains the desired result
Ensure the output has a valid ranking
Ensure that the output contains the correct categorical classification

How to evaluate multiple prompts programmatically with the apollo CLI

First, we'll set up apollo-sdk and create a template directory by running apollo-sdk init.

Let's edit the prompts.txt file to include some prompt or expected UGC variations.

You're a moderator for platform xyz.
---
You are reviewing a piece of UGC. Why is this {{post}} violating {{policy}}.

Next, create a vars.csv file with your test values for policy, and define posts.

The test runner uses the expected value to determine whether the test passes. The condition is evaluated as a Jinja template and it expects a value.

policy, post
hate, this is hate speech
spam, this is spam 
threats, this is a threat

Now, run the test.

apollo-sdk -p ./prompts.txt -v ./vars.csv -r openai:completion

Which produces a matrixed output view like this:

For this use-case we are using openai:completion model for prompt testing but using a custom provider to review accuracy of scored content is a realistic implementation*

Semantic evaluation

Semantic evaluation assesses the relatedness between the expected and output text by focusing on their underlying meanings, rather than relying solely on exact word matches. This is done with text embedding models such as OpenAI's Ada model.

Semantic grading is useful for cases where multiple correct answers exist, or where the specific wording isn't as important as the overall meaning.

Example use cases:

Summarization
Text translation

Although testing semantic similarity with apollo-sdk isn’t supported in its current release, we’re welcoming new contributors to help us build! Star the repo and pick a issue

Outsource evaluation to an LLM

Sometimes output evaluation just can't be reduced to a handful of logical checks. Depending on the nature of your criteria, you may be able to trust an LLM to do the grading, or at least do a first pass. This may be cheaper and quicker than a human.

The model that grades outputs can be different from the model that produced the outputs. For example, you might prefer a model with superior reasoning capability.

Examples of LLM-graded expectations include:

Grammar and spelling
Presence of phrases, topics, or themes
Presence of specific categories of information (e.g. ensure output includes an address, datetime, code, etc.)

Depending on how strict your requirements are, you cancan also ask the LLM to evaluate very subjective criteria such as tone.

Outsource evaluation to other humans

Sometimes you won't be able to evaluate prompts programmatically or with AI. This might be the case if quality evaluation is so subjective that it requires multiple datapoints or special training to evaluate.

In this case, you can outsource rubric grading to human raters. The raters could either score outputs individually, or choose the superior output from a lineup.

Examples of subjective expectations might include:

Coherence: how well the response flows and maintains logical connections
Empathy: whether the response shows understanding and compassion
Tone: whether the output tone conform to some standard
Values alignment: whether there are signs of bias

After running all the test cases, you can take the test outputs and present them to a human for grading.

How to export prompt candidates and outputs with apollo-sdk

Assuming you've set up apollo-sdk and are testing, you can ask human raters to evaluate outputs in two ways:

Export to a portable format and display results in your desired interface:

apollo-sdk -o results.csv

apollo-sdk -o results.json

Closing the feedback loop

As a company scales, instead of manually assembling a golden dataset, it can achieve quality a broad range of inputs by collecting test cases from moderators/T&S managers.

In practice, this means asking moderators to mark particularly good or bad LLM outputs. For example, collecting 👍/👎 ratings will give you some signal on cases that are particularly interesting or valuable. This has the added bonus of helping you fine-tune a model if that's something you want to do eventually.

What's outlined in this post is one part of a larger system:

UGC tuning & evaluation
Prompt/UGC version control
Continuous integration and deployment

Once a prompt is validated through evaluation, the continuous integration system will release it to a staging or production environment, or a live experiment. Ideally, prompt/UGC evaluations become part of our development infrastructure in the same way that unit tests are.

Other prompt engineering principles

Here are some principles that apply to all prompt/systematic engineering. They basically boil down to:

Avoid prompt/UGC engineering whenever possible
Avoid subjective grading criteria whenever possible

Prefer small, testable prompts/pieces of UGC

Prefer concise prompts that are specific enough to generate a limited range of potential outputs. This helps minimize edge cases and makes it easier to automate evaluation.

With this approach, you can work toward full automation (you can automate with apollo as well ;) ) and deploy new changes without worrying about unexpected regressions.

Heres an example of extending your test with embedded automation:

# import the package
from apollo.client import Apollo

# Use our custom model to test building decisions, *token=defaults to sandbox api if none
Apollo.use("apollo", token="YOUR_API_TOKEN_HERE") # If you have a token

# Lets check to see if a phrase contains threats
Apollo.detectText("Phrase1", "contains", "Threats")

# Create custom rules which creates a task!
Apollo.rule('Phrase1', '>=', '0.8')

# Connect with other models!
Apollo.use('Google', "violence", ...)

Apollo.detectImage('Image1', 'contains', 'VERY_LIKELY') # Image Analysis/OCR
Apollo.detectSpeech('Audio1', 'contains', 'UNLIKELY') # Audio Processing
Apollo.detectVideo('Video1', 'contains', 'POSSIBLE') # Video Analysis
Apollo.detectText('Phrase1', 'contains', 'UNKNOWN') # Text Analysis

Simplify your evaulation rubric

A simple rubric helps streamline the evaluation process and minimize subjectivity. Focus on the most important criteria for your application and establish clear guidelines for each metric.

Prefer to use programmatic tests, then semantic tests, LLM tests, and lastly human raters.

Fine-tune

When it comes down to it, prompt engineering is not 100% reliable for certain use cases. Fine-tuning allows you to focus more on the overall system and less on tweaking prompts, but it's expensive and requires dedicated resources.

I'm building this

This blog post is just a really long way to say, I haven't found solutions to any of the above in the wild, so I'm building my own.

Check out Apollo API, an open-source toolkit for integrity engineering that implements the process above and provides a way to extend model management with automation & data integrations.

Most notably, it includes a CLI that outputs a matrix view for quickly comparing outputs across multiple prompts, variables, and models. This means that you can easily compare prompts/scored pieces of UGC over hundreds or thousands of test cases.

Candid

Discussion about this post