Ollama

SCRP offers Ollama with a customized start-up script that makes it easier to run on a cluster. Models are loaded directly from our local repository, so they do not use up your own storage quota.

While Ollama gives you the freedom to run a wide variety of models, SCRP-Chat, our managed generative AI service, is easier to use and is capable of much higher throughout, so you should first consider that option.

To use Ollama, you first start the Ollama server, followed by the Ollama client.

Starting the Ollama Server

First start the Ollama server on a GPU node. You will need a GPU with enough VRAM to load your model:

Model size	GPU Type	Command
< 20B	RTX 3060	`gpu scrp-ollama serve`
20 - 40B	RTX 3090	`rtx3090 scrp-ollama serve`
40B - 150B	A100	`a100 scrp-ollama serve`
> 150B	Hopper	`hopper scrp-ollama serve`

Note that when launched this way, the Ollama server will terminate with the terminal you launch it with. If you would like to keep the server alive, you should launch it with sbatch instead.

Once the server is running, you can run a model either by initiating an API call, or by starting the Ollama client.

Running the Ollama Client

The Ollama client allows you to interact with the model in a terminal. Start the client with the following command:

scrp-ollama run model_name

where model_name is a model in the local repository.

You can also show the list of available models with the following command:

scrp-ollama list

To see what models have already been loaded:

scrp-ollama ps

API Access

Ollama supports its own API as well as an OpenAI-compatible API.

Ollama API

Below is an example using the Ollama Python API. Please note the following:

The Ollama server is only accessible within the cluster.
Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.

Ollama provides a one-off .generate() method as well as a more complex .chat() method:

# One-off generation
import ollama
from scrp_ollama import get_host

ollama = ollama.Client(host=get_host())

response = ollama.generate(model='gemma3', prompt='What is the capital of France?')
print(response['response'])

# The chat method supports roles and message history
import ollama
from scrp_ollama import get_host

ollama = ollama.Client(host=get_host())

messages = [
    {'role': 'user', 'content': 'Hello, how are you?'},
    {'role': 'assistant', 'content': 'I am fine, thank you.'},
    {'role': 'user', 'content': 'What can you do?'},
]
chat_response = ollama.chat(model='gemma3', messages=messages)
print(chat_response['message']['content'])

OpenAI API

Ollama also provides an OpenAI-compatible API. Again, please note the following:

The Ollama server is only accessible within the cluster.
Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.

from openai import OpenAI
from scrp_ollama import get_openai_base_url

client = OpenAI(
    base_url = get_openai_base_url(),
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="gemma3",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {'role': 'user', 'content': 'Hello, how are you?'},
    {'role': 'assistant', 'content': 'I am fine, thank you.'},
    {'role': 'user', 'content': 'What can you do?'},
  ]
)
print(response.choices[0].message.content)

Tool Calling

Tool calling, also known as function calling, allows a model to call Python functions you provide as it deem necessary. Note that what the model actually do is to return in its response the name of the function and the argument it provides to the function. The actual execution of the function has to be done by the user.

To use tool calling, you need the following components:

Each tool is a Python function.
A list with one JSON for each tool. The JSON should have the name, description and parameters of the tool.
A loop to call the model and execute function calls repeatedly, until the model generates a final response.

The following is a simple example:

# Ollama + GPT-OSS tool calling

# Settings
model = "gpt-oss:latest"

# Tool Python function
import random
def get_weather(location: str, unit: str):
    return str(round(random.uniform(32, 35),1)) + unit
tool_functions = {"get_weather": get_weather}

# Tool JSON definition. This is how the model knows about the function
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location", "unit"]
        }
    }
}]

# Conversation history
messages = [{"role": "user", "content": "What's the weather like in San Francisco and Portland in celsius?"}]

# Imports for OpenAI client and output parsing
from openai import OpenAI
from scrp_ollama import get_openai_base_url
import json

client = OpenAI(
    base_url = get_openai_base_url(),
    api_key='ollama', # required, but unused
)

tool_calls = [1]

# Call model as long as there is a tool call
while tool_calls != None and tool_calls != []: 
    
    # Call model
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        tools=tools,
        tool_choice="auto",
        temperature=0,
    )
    
    # Model's reasoning 
    try:
        print(response.choices[0].message.reasoning_content)
    except:
        try:
            print(response.choices[0].message.reasoning)
        except:
            pass
    print('-' * 30)
    
    # Add reasoning to conversation history
    response_dump = response.choices[0].message.model_dump()
    messages.append(response_dump)
    
    # Process tool calling
    tool_calls = response_dump.get("tool_calls", None)
    print(tool_calls)
    print('-' * 30)
    if tool_calls is not None:
        for tool_call in tool_calls:
            call_id: str = tool_call["id"]
            if fn_call := tool_call.get("function"):
                fn_name: str = fn_call["name"]
                fn_args: dict = json.loads(fn_call["arguments"])
                fn_res: str = json.dumps(tool_functions[fn_name](**fn_args))

                print("fn_res:",fn_res)
                print('-' * 30)

                # Append tool output to conversation history
                messages.append({
                    "role": "tool",
                    "content": fn_res,
                    "tool_call_id": call_id,
                })

# Final response
print(response.choices[0].message.content)