Using SSE to load test an LLM API
Learn how to use server sent events to load test an LLM API
Server-sent events is a server push technology, which allows clients to receive updates from a server using an HTTP connection. It instructs servers how to initiate data transmission to clients after the initial client connection is established.
Gatling allows you to to test server-sent events as an extension of the HTTP DSL.
Step 1: Setting up the project
First, ensure you have Gatling installed and set up. If you don’t have Gatling installed, you can download it on the documentation and clone this project from Github and open the articles/load-test-llm-sse
folder.
Step 2: Configuring the HTTP protocol
Picture yourself at a grand theater in Paris, comfortably seated and admiring the set and ambiance. In Gatling, just as the theater environment shapes the audience experience, the HTTP protocol provides the framework for your test scenarios. The baseUrl defines where the performance takes place, guiding all interactions to the correct destination.
In your Gatling project, configure the HTTP protocol to specify the base URL of ChatGPT (OpenAI) API. We use sseUnmatchedInboundMessageBufferSize
in order to buffer the inbound message
import static io.gatling.javaapi.core.CoreDsl.*;
import static io.gatling.javaapi.http.HttpDsl.*;
import io.gatling.javaapi.core.*;
import io.gatling.javaapi.http.*;
public class SSELLM extends Simulation {
String api_key = System.getenv("api_key");
HttpProtocolBuilder httpProtocol =
http.baseUrl("https://api.openai.com/v1/chat")
.sseUnmatchedInboundMessageBufferSize(100);
Step 3: Defining the scenario
Now the piece has started, the actors enter the scene and follow their scripts. At Gatling, we call this a scenario, and it defines the steps your test will take (connecting, parsing messages, user interaction, etc.,).
In our case, our scenario is pretty small. People will:
- connect to the completion endpoint of Open AI,
- send a prompt using SSE,
- process all the messages until ChatGPT sends us {“data”:"[DONE]"},
- close the SSE connection.
ScenarioBuilder prompt = scenario("Scenario").exec(
sse("Connect to LLM and get Answer")
.post("/completions")
.header("Authorization", "Bearer "+api_key)
.body(StringBody("{\"model\": \"gpt-3.5-turbo\",\"stream\":true,\"messages\":[{\"role\":\"user\",\"content\":\"Just say HI\"}]}"))
.asJson(),
asLongAs("#{stop.isUndefined()}").on(
sse.processUnmatchedMessages((messages, session) -> {
return messages.stream()
.anyMatch(message -> message.message().contains("{\"data\":\"[DONE]\"}")) ? session.set("stop", true) : session;
})
),
sse("close").close()
);
The processUnmatchedMessages
method allows us to process the inbound messages. This function catches all the messages that ChatGPT sent us and when we receive {“data”:"[DONE]"}, we set a stop variable to true in order to exit the loop.
Step 4: Injecting users
As the audience arrives and fills their seats, the theater comes alive. In Gatling, this is the injection profile. It permits you to choose how and when users enter your test, whether gradually, all at once, or in waves.
In our guide, we will simulate a low number of users (i.e. 10 users) arriving at once on our website. Do you want to use different user arrival profiles? Check out our various injection profiles.
{
setUp(
prompt.injectOpen(atOnceUsers(10))
).protocols(httpProtocol);
}
Step 5: Running the simulation
Run the simulation to see how the LLM handles the load. Use the following command to execute the test:
Set the API token environment variable:
export api_key=<API-token-value>
set api_key=<API-token-value>
Then launch the test:
./mvnw gatling:test
mvnw.cmd gatling:test
Step 6: Analyzing the results
After the simulation is complete, Gatling generates an HTML link in the terminal that you can use to access your report. Review metrics like response times, the number of successful and failed connections, and other metrics to spot potential issues with your service.
Conclusion
By updating SSE support to add the post method, Gatling enables load testing for applications using this method like LLMs, and many more. This practical example using the OpenAI API demonstrates how you can use Gatling to ensure your applications effectively manage user demands. So, don’t streSSE about it and use Gatling to keep your servers and users happy.