Running your own embedding model api server - Part 2

In part 1, we chose a text embedding model and saved the model with BentoML. Now, let’s create a BentoML Service which has the serving logic. Create a file service.py.

import bentoml

import torch
from bentoml.io import JSON, NumpyNdarray
from pydantic import BaseModel
from typing import Any, List, Dict
from torch import Tensor

class EmbeddingRequest(BaseModel):
    texts: List[str]

class EmbeddingResponse(BaseModel):
    token_length: int
    embedding_vector: Any

input_spec = JSON(pydantic_model=EmbeddingRequest)
output_spec = JSON(pydantic_model=EmbeddingResponse)

tokenizer_runner = bentoml.transformers.get("e5_base_tokenizer").to_runner()
model_runner = bentoml.transformers.get("e5_base_model").to_runner()

svc = bentoml.Service("e5_base", runners=[tokenizer_runner, model_runner])

def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

@svc.api(input=input_spec, output=output_spec, route="v1/embedding")
async def instructor_embedding(input):

    texts = input.texts
    encoded = tokenizer_runner.run(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    model_output = model_runner.run(**encoded)
    embeddings = average_pool(model_output.last_hidden_state, encoded['attention_mask'])
    embeddings = embeddings.detach().numpy()

    token_length = len(torch.flatten(encoded.data['input_ids'], 0))

    return {"embedding_vector": embeddings, "token_length": token_length}

Let’s look at core parts of the code.

  1. Specifying input and output data specifications:
    • input_spec: An instance of the JSON class from bentoml.io that specifies the input data type as JSON and uses the EmbeddingRequest Pydantic model for validation.
    • output_spec: An instance of the JSON class from bentoml.io that specifies the output data type as JSON and uses the EmbeddingResponse Pydantic model for validation.
  2. Creating BentoML service:
    • tokenizer_runner: A BentoML runner for tokenizing text inputs using a pre-trained tokenizer.
    • model_runner: A BentoML runner for generating embeddings using a pre-trained model.
    • svc: An instance of bentoml.Service that represents the BentoML service. It is named “e5_base” and uses the tokenizer_runner and model_runner for processing input data.
  3. Defining the main API endpoint:
    • @svc.api: A decorator that defines the main API endpoint for the BentoML service. It specifies the input and output data specifications, as well as the route path (“/v1/embedding”).
    • instructor_embedding: The API function that receives the input data and processes it to generate embeddings.
    • The function calculates the total token length and returns the embeddings and token length as the API response.

Overall, this code defines a BentoML service that provides an API endpoint for generating word embeddings using a transformer-based model, with input validation and data type specifications.

Now we can run the BentoML server with the new service in development mode.

$ bentoml serve service:svc --development --reload
2022-09-18T21:11:22-0700 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "service.py:svc" can be accessed at <http://localhost:3000/metrics>.
2022-09-18T21:11:22-0700 [INFO] [cli] Starting development HTTP BentoServer from "service.py:svc" listening on 0.0.0.0:3000 (Press CTRL+C to quit)
2022-09-18 21:11:23 circus[80177] [INFO] Loading the plugin...
2022-09-18 21:11:23 circus[80177] [INFO] Endpoint: 'tcp://127.0.0.1:61825'
2022-09-18 21:11:23 circus[80177] [INFO] Pub/sub: 'tcp://127.0.0.1:61826'

When development server has started successfully, we can view the generated API docs and test it out directly. Access to http://localhost:3000 with your browser. You can check that POST - /v1/embedding endpoint has been created, which is defined with @svc.api decorator in service.py.

Untitled

When send embedding API request to the server, we can receive embedded data. We just checked that development server works fine. Then how could we run the production server? One of the best way is to make a docker image and run it with docker. Let’s build a docker image.

Create Bento

We can build the model and service into a bento. Bento is the distribution format for a service. It is a self-contained archive that contains all the source code, model files and dependency specifications required to run the service. Once bento is created, docker image can be automatically generated.

To build a bento, create a bentofile.yaml.

service: "service:svc"  # Same as the argument passed to `bentoml serve`
labels:
    owner: nlp_team
    stage: dev
include:
  - "*.py"  # A pattern for matching which files to include in the bento
python:
  requirements_txt: "./requirements.txt"
docker:
    distro: debian
    python_version: "3.8.12"
    cuda_version: "11.6.2"

Now build bento and check saved bento.

$ bentoml build --version 1.0.0
Building BentoML service "e5_base:1.0.0" from build context "/home/ubuntu/workspace/embedding-model-server/e5_base".
Packing model "e5_base_model:hiitriq7uscd7w34"
Packing model "e5_base_tokenizer:hiitria7uscd7w34"

██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝

Successfully built Bento(tag="e5_base:1.0.0").

Possible next steps:

 * Containerize your Bento with `bentoml containerize`:
    $ bentoml containerize e5_base:1.0.0

 * Push to BentoCloud with `bentoml push`:
    $ bentoml push e5_base:1.0.0
$ bentoml list
Tag            Size      Creation Time       
 e5_base:1.0.0  1.06 GiB  2023-07-11 10:09:55

After bento is saved, we can serve bento right away

$ bentoml serve e5_base:1.0.0
2023-07-11T10:26:44+0000 [INFO] [cli] Environ for worker 0: set CPU thread count to 2
2023-07-11T10:26:44+0000 [INFO] [cli] Environ for worker 0: set CPU thread count to 2
2023-07-11T10:26:44+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "e5_base:1.0.0" can be accessed at <http://localhost:3000/metrics>.
2023-07-11T10:26:44+0000 [INFO] [cli] Starting production HTTP BentoServer from "e5_base:1.0.0" listening on <http://0.0.0.0:3000> (Press CTRL+C to quit)

Build docker image

Since bentoml is saved, docker image can be automatically generated with bentoml containerize command.

$ bentoml containerize e5_base:1.0.0

Successfully built Bento container for "e5_base:1.0.0" with tag(s) "e5_base:1.0.0"
To run your newly built Bento container, run:
    docker run -it --rm -p 3000:3000 e5_base:1.0.0 serve

Now, run created docker image

docker run -it --rm -p 3000:3000 e5_base:1.0.0 serve

As like we served development mode with bentoML cli, it works like charm. Returning embedding response.

Conclusion

In conclusion, building your own text embedding API server using BentoML is a powerful way to leverage the capabilities of text embedding models and provide seamless integration with your applications. By following the steps outlined in this article, you have learned how to write a BentoML Service, save the Bento bundle, and build a Docker image for deployment. This approach offers flexibility, scalability, and ease of use, enabling you to unlock the potential of text embedding in various use cases, such as search engines, recommendation systems, and sentiment analysis. With BentoML, you have the tools at your disposal to create a robust and efficient text embedding API server tailored to your specific needs. Start building your own text embedding solution today and unlock the power of natural language understanding in your applications!