Curiosity Trace - Serving open source LLM as OpenAI API

In this tutorial, we’ll see how to serve an open source language model with OpenAI compatible API using HuggingFace Text Generation Inference and LiteLLM OpenAI proxy server. This enables us to use any tool compatible with OpenAI API.

We’ll use 4-bit quantized Llama-2 7B chat model to lower GPU memory requirement. Launch TGI server on a device with GPU:

#!/bin/bash

VOLUME="${HOME}/.cache/huggingface/tgi"
mkdir -p $VOLUME

docker run --gpus all --shm-size 1g \
    -p 8080:80 \
    -v "${VOLUME}":/data \
    ghcr.io/huggingface/text-generation-inference:latest \
     --trust-remote-code \
     --model-id "NousResearch/llama-2-7b-chat-hf" \
     --quantize bitsandbytes-nf4 \
     --dtype float16

HuggingFace Text Generation Inference server supports only text completion. However, we want to send chat messages with system, user, and assistant roles same as OpenAI models (i.e. ChatML format). Fortunately, LiteLLM supports Llama-2 chat template, which will convert chat messages to text prompt before calling TGI server. All we need to do is to specify model paramater as huggingface/meta-llama/Llama-2-7b. For models not supported by LiteLLM, we can create a custom template.

Here is the config for LiteLLM OpenAI proxy Server.

# config.yaml
model_list:
  - model_name: llama-2-7b-chat # arbitrary alias for our mdoel
    litellm_params: # actual params for litellm.completion()
      model: "huggingface/meta-llama/Llama-2-7b"
      api_base: "http://localhost:8080/"
      max_tokens: 1024

litellm_settings:
  set_verbose: True

Launch LiteLLM OpenAI proxy server:

litellm --config config.yaml

Set API base url.

API_BASE="http://localhost:8000/"

Let’s use the model with completion function provided by litellm library, first.

from litellm import completion 

messages = [
    {"content": "You are helpful assistant.","role": "system"},
    {"content": "Tell me 3 reasons to live in Istanbul.","role": "user"},
]

response = completion(
  api_base=API_BASE,
  model="llama-2-7b-chat", 
  custom_llm_provider="openai", # so that messages are sent to proxy server as they are
  messages=messages, 
  temperature=0.3,
)

print(response.choices[0].message.content)


Istanbul is a city with a rich history and culture, and there are many reasons to live there. Here are three:

1. Cultural Diversity: Istanbul is a city that straddles two continents, Europe and Asia, and has a unique cultural identity that reflects its history as a crossroads of civilizations. The city is home to a diverse population, including Turks, Kurds, Greeks, Armenians, and other ethnic groups, each with their own traditions and customs. This diversity is reflected in the city's architecture, food, music, and art, making Istanbul a vibrant and exciting place to live.
2. Historical Landmarks: Istanbul is home to some of the most impressive historical landmarks in the world, including the Hagia Sophia, the Blue Mosque, and the Topkapi Palace. These landmarks are not only important cultural and religious sites, but also serve as a reminder of the city's rich history and its role in the development of civilizations. Living in Istanbul, you are surrounded by these incredible structures, which are a source of inspiration and pride for the city's residents.
3. Gastronomy: Istanbul is known for its delicious and diverse food scene, which reflects the city's cultural diversity. From traditional Turkish dishes like kebabs and baklava, to Greek and Middle Eastern cuisine, there is something for every taste and budget. Living in Istanbul, you have access to a wide range of fresh produce, spices, and other ingredients, which are used to create mouth-watering dishes that are both healthy and delicious.

Overall, Istanbul is a city that offers a unique and enriching experience for those who live there. Its cultural diversity, historical landmarks, and gastronomy make it a vibrant and exciting place to call home.

Now, let’s use the model with llama-index library. The subtle point is that LiteLLM class in llama-index expects custom_llm_provider parameter in additional_kwargs argument.

from llama_index.llms import ChatMessage, LiteLLM

llm = LiteLLM(
    api_base=API_BASE,
    api_key="",
    model="llama-2-7b-chat", 
    temperature=0.3,
    additional_kwargs=dict(
        custom_llm_provider="openai", # so that messages are sent to proxy server as they are
    ),
)

messages = [
    ChatMessage.parse_obj({"content": "You are helpful assistant.", "role": "system"}),
    ChatMessage.parse_obj({"content": "Tell me 3 reasons to live in London.", "role": "user"}),
]
response = llm.chat(messages) 
print(response.message.content)



1. Cultural diversity: London is a melting pot of cultures, with people from all over the world calling it home. This diversity is reflected in the city's food, art, music, and fashion, making it a vibrant and exciting place to live.
2. World-class amenities: London has some of the best amenities in the world, including top-notch restaurants, theaters, museums, and sports venues. Whether you're looking for a night out on the town or a quiet evening at home, London has something for everyone.
3. Investment opportunities: London is a major financial hub, with many opportunities for investment in real estate, business, and other industries. The city's strong economy and stable political environment make it an attractive place to invest and grow your wealth.

messages = [
    ChatMessage(content="You are an hilarious comedian who is famous with their sarcastic jokes.", role="system"),
    ChatMessage(content="Tell me a joke about front-end developers.", role="user"),
]
response = llm.chat(messages) 
print(response.message.content)



I'm glad you think I'm hilarious! Here's a joke for you:

Why did the front-end developer break up with his girlfriend?

Because he wanted a more responsive relationship! 😂

I hope you found that one amusing! Front-end developers can be a funny topic, but I'm sure they won't mind a good-natured jab or two. Let me know if you want another one!