Our opinionated abtractions and internal tooling for building/maintaining production LLM systems
Arjun S
on
Apr 7, 2024
Metaforms (previously WorkHack) has always been a Gen-AI native company. Ever since its inception, we've built and deployed a bunch of production LLM applications for both enterprise and consumer products. Every time we did a major refactor (or rebuilt from the ground up), we re-imagined abstracts and tooling needed to rapidly build and maintain reliable LLM systems.
In this article, I'll talk about a few key design choices and systems we've built in place.
Modules & Abstractions
Previous state
Like most people, we started with simple wrappers to make LLM completion/embedding calls that abstract away details like selecting the right keys based on the preferred model, handling exceptions and retries based on error codes, and logging requests/responses.
Over time we added a few bells and whistles, like selecting models based on required tokens, maintaining a global state across services to store error rates for different keys/regions, and using that information for more intelligent key switching.
Problems
Building new LLM-powered features was still not as fast as we wanted it to be (in hours not days). You still had to define controllers and services (where you'd also have to handle auth and access control) if the feature was served directly to the frontend (say, a sentence completion feature), define types and input validations, write transformations, etc. You also had to handle logging (we use Redshift for all event logs). It was clear that there was a lot of boilerplate code and it was time for another layer of abstraction.
Current state
Here's a look at how we build LLM-powered features in the backend now. LLM completion based features are just an instance of the "LLMComponentFunction" class as shown below. This abstracts away all the boilerplate code mentioned above. Writing an LLM component is as simple as writing sequential steps of transformations and completions. Need to expose as API (with runtime schema validation)? Need to define access control rules? Need to enable streaming and clean/transform streamed phrases? All of this now happens declaratively. In addition to that, you also get out-of-the-box standard event logging for every request, along with intermediate results, first and last token latency, error codes, etc. These request events share the same trace ID as all the other events inserted during the processing of that request. So you can do things like write an SQL query to find, say, what is the total LLM usage cost incurred by a specific user, or a specific feature across users. This is in addition to all the details already abstracted away previously like key rotation and model selection.
Defining an LLM component
(Re)using llm component
Especially for processing streaming results, the new LLMComponents has some nifty features up its sleeve. We use streaming for a variety of features where resulting completion is returned line by line to the client as soon as it's processed. On example is how we do a syntax highlighter-like annotation of user's natural language description of data points to collect. We highlight data points, validation rules, conditions, etc and highlight them in different colors, and let users hover to see actions they can perform on them right in the text field.
This being a latency-critical feature (we superimpose highlights as the user is typing within a second), streaming was an obvious choice. For most such stream processing use cases, the code to iterate line by line gets a lot simpler if we can guarantee that each chunk streamed belongs to one line or the other but does not overlap. Similarly, guaranteeing that each chunk returned is of minimum N character length means we can avoid too many updates sent to the client when LLMs stream smaller token increments (we use WebSocket via AWS API GW which means each update from the backend is sent to GW as a POST REST request).
This abstraction extends to our frontend react codebase as well. There is a completion hook for both streaming and non-streaming requests. Just plug in the component/function name used in the "LLMComponentFunction" instance in the backend code to initiate/stop streaming, and get the loading state and results.
Internal tooling
Prompt management
Problems
Decoupling from codebase: Shipping prompt updates were slower when tracked in code. Non-developers did not have access to it and had to rely on developers.
Automated testing and benchmarking: Unit testing tools were not suited for testing LLM features. These are typically long-running jobs so you wouldn't want to block code deployments for that long. Such tools were also not convenient to perform evaluations on results and store metrics to review.
Solution
We ended up building a simple in-house admin console to manage prompts. Firstly, it was more convenient to decouple prompts from the codebase and store them in DB. Shipping updates to prompts and rolling back was much quicker, you don't have to wait for longer CI/CD pipelines that run static type checking, unit tests, e2e tests, etc just to ship prompt updates. You also don't have to necessarily be a developer to ship prompt updates.
Decoupling from the codebase meant we had to rebuild one critical functionality git provided - versioning and rollbacks. Since we owned the interface this was fairly simple to do. Every prompt update is saved with a commit message to help track the reasoning behind updates.
We built an automated testing and eval batch job from scratch that gets triggered every time a new prompt update is made. Curating a dataset for these evals was a high-effort task, so we avoided doing them for V0 releases until we had enough feedback about that feature and know it's here to stay (we constantly ship and soon kill features and believe that the fastest way to iterate).
And finally, we realized the best way to enforce processes is through tools that enable them. To ship updates to prompt, you have to update them in dev, and then click on migrate to production. That brings up a confirmation message asking to confirm they've tested all relevant functionalities in dev. A simple reminder but solved for silly human errors we used to have.
Conclusion
There are a lot of LLM wrappers and tooling out there. We've tried a bunch of them and while those out-of-the-box solutions are great generic abstractions, what ended up helping us ship fast the most was a different one that gives control of what is almost always unique - prompts/transformations/chaining, and abstracts away almost everything else across the application stack.
Bangalore, India / San Francisco, US
WorkHack Inc. 2023