Running your own embedding model api server - Part 2
In part 1, we chose a text embedding model and saved the model with BentoML. Now, let’s create a BentoML Service which has the serving logic. Create a file service.py
.
import bentoml
import torch
from bentoml.io import JSON, NumpyNdarray
from pydantic import BaseModel
from typing import Any, List, Dict
from torch import Tensor
class EmbeddingRequest(BaseModel):
texts: List[str]
class EmbeddingResponse(BaseModel):
token_length: int
embedding_vector: Any
input_spec = JSON(pydantic_model=EmbeddingRequest)
output_spec = JSON(pydantic_model=EmbeddingResponse)
tokenizer_runner = bentoml.transformers.get("e5_base_tokenizer").to_runner()
model_runner = bentoml.transformers.get("e5_base_model").to_runner()
svc = bentoml.Service("e5_base", runners=[tokenizer_runner, model_runner])
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
@svc.api(input=input_spec, output=output_spec, route="v1/embedding")
async def instructor_embedding(input):
texts = input.texts
encoded = tokenizer_runner.run(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
model_output = model_runner.run(**encoded)
embeddings = average_pool(model_output.last_hidden_state, encoded['attention_mask'])
embeddings = embeddings.detach().numpy()
token_length = len(torch.flatten(encoded.data['input_ids'], 0))
return {"embedding_vector": embeddings, "token_length": token_length}
Let’s look at core parts of the code.
- Specifying input and output data specifications:
input_spec
: An instance of theJSON
class frombentoml.io
that specifies the input data type as JSON and uses theEmbeddingRequest
Pydantic model for validation.output_spec
: An instance of theJSON
class frombentoml.io
that specifies the output data type as JSON and uses theEmbeddingResponse
Pydantic model for validation.
- Creating BentoML service:
tokenizer_runner
: A BentoML runner for tokenizing text inputs using a pre-trained tokenizer.model_runner
: A BentoML runner for generating embeddings using a pre-trained model.svc
: An instance ofbentoml.Service
that represents the BentoML service. It is named “e5_base” and uses thetokenizer_runner
andmodel_runner
for processing input data.
- Defining the main API endpoint:
@svc.api
: A decorator that defines the main API endpoint for the BentoML service. It specifies the input and output data specifications, as well as the route path (“/v1/embedding”).instructor_embedding
: The API function that receives the input data and processes it to generate embeddings.- The function calculates the total token length and returns the embeddings and token length as the API response.
Overall, this code defines a BentoML service that provides an API endpoint for generating word embeddings using a transformer-based model, with input validation and data type specifications.
Now we can run the BentoML server with the new service in development mode.
$ bentoml serve service:svc --development --reload
2022-09-18T21:11:22-0700 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at <http://localhost:3000/metrics>.
2022-09-18T21:11:22-0700 [INFO] [cli] Starting development HTTP BentoServer from "service.py:svc" listening on 0.0.0.0:3000 (Press CTRL+C to quit)
2022-09-18 21:11:23 circus[80177] [INFO] Loading the plugin...
2022-09-18 21:11:23 circus[80177] [INFO] Endpoint: 'tcp://127.0.0.1:61825'
2022-09-18 21:11:23 circus[80177] [INFO] Pub/sub: 'tcp://127.0.0.1:61826'
When development server has started successfully, we can view the generated API docs and test it out directly. Access to http://localhost:3000
with your browser. You can check that POST - /v1/embedding endpoint has been created, which is defined with @svc.api
decorator in service.py.

When send embedding API request to the server, we can receive embedded data. We just checked that development server works fine. Then how could we run the production server? One of the best way is to make a docker image and run it with docker. Let’s build a docker image.
Create Bento
We can build the model and service into a bento
. Bento is the distribution format for a service. It is a self-contained archive that contains all the source code, model files and dependency specifications required to run the service. Once bento
is created, docker image can be automatically generated.
To build a bento
, create a bentofile.yaml
.
service: "service:svc" # Same as the argument passed to `bentoml serve`
labels:
owner: nlp_team
stage: dev
include:
- "*.py" # A pattern for matching which files to include in the bento
python:
requirements_txt: "./requirements.txt"
docker:
distro: debian
python_version: "3.8.12"
cuda_version: "11.6.2"
Now build bento and check saved bento.
$ bentoml build --version 1.0.0
Building BentoML service "e5_base:1.0.0" from build context "/home/ubuntu/workspace/embedding-model-server/e5_base".
Packing model "e5_base_model:hiitriq7uscd7w34"
Packing model "e5_base_tokenizer:hiitria7uscd7w34"
██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝
Successfully built Bento(tag="e5_base:1.0.0").
Possible next steps:
* Containerize your Bento with `bentoml containerize`:
$ bentoml containerize e5_base:1.0.0
* Push to BentoCloud with `bentoml push`:
$ bentoml push e5_base:1.0.0
$ bentoml list
Tag Size Creation Time
e5_base:1.0.0 1.06 GiB 2023-07-11 10:09:55
After bento is saved, we can serve bento right away
$ bentoml serve e5_base:1.0.0
2023-07-11T10:26:44+0000 [INFO] [cli] Environ for worker 0: set CPU thread count to 2
2023-07-11T10:26:44+0000 [INFO] [cli] Environ for worker 0: set CPU thread count to 2
2023-07-11T10:26:44+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "e5_base:1.0.0" can be accessed at <http://localhost:3000/metrics>.
2023-07-11T10:26:44+0000 [INFO] [cli] Starting production HTTP BentoServer from "e5_base:1.0.0" listening on <http://0.0.0.0:3000> (Press CTRL+C to quit)
Build docker image
Since bentoml is saved, docker image can be automatically generated with bentoml containerize
command.
$ bentoml containerize e5_base:1.0.0
Successfully built Bento container for "e5_base:1.0.0" with tag(s) "e5_base:1.0.0"
To run your newly built Bento container, run:
docker run -it --rm -p 3000:3000 e5_base:1.0.0 serve
Now, run created docker image
docker run -it --rm -p 3000:3000 e5_base:1.0.0 serve
As like we served development mode with bentoML cli, it works like charm. Returning embedding response.
Conclusion
In conclusion, building your own text embedding API server using BentoML is a powerful way to leverage the capabilities of text embedding models and provide seamless integration with your applications. By following the steps outlined in this article, you have learned how to write a BentoML Service, save the Bento bundle, and build a Docker image for deployment. This approach offers flexibility, scalability, and ease of use, enabling you to unlock the potential of text embedding in various use cases, such as search engines, recommendation systems, and sentiment analysis. With BentoML, you have the tools at your disposal to create a robust and efficient text embedding API server tailored to your specific needs. Start building your own text embedding solution today and unlock the power of natural language understanding in your applications!