SCRP offers Ollama with a customized start-up script that makes it easier to run on a cluster. Models are loaded directly from our local repository, so they do not use up your own storage quota.

While Ollama gives you the freedom to run a wide variety of models, SCRP-Chat, our managed generative AI service, is easier to use and is capable of much higher throughout, so you should first consider that option.

To use Ollama, you first start the Ollama server, followed by the Ollama client.

Starting the Ollama Server

First start the Ollama server on a GPU node. You will need a GPU with enough VRAM to load your model:

Model size GPU Type Command
< 20B RTX 3060 gpu scrp-ollama serve
20 - 40B RTX 3090 compute --gpus-per-task=rtx3090 scrp-ollama serve
> 40B A100 compute -p a100 --gpus-per-task=a100 scrp-ollama serve

Note that when launched this way, the Ollama server will terminate with the terminal you launch it with. If you would like to keep the server alive, you should launch it with sbatch instead.

Running the Ollama Client

Once the server is running, you can run a model by starting the client on another terminal:

scrp-ollama run model_name

where model_name is a model in the local repository.

You can also show the list of available models with the following command:

scrp-ollama list

To see what models have already been loaded:

scrp-ollama ps

API Access

Ollama supports its own API as well as an OpenAI-compatible API.

Ollama API

Below is an example using the Ollama Python API. Please note the following:

  • You will need to replace host_address with what is reported by scrp-ollama serve.
  • The Ollama server is only accessible within the cluster.
  • Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.

Ollama provides a one-off .generate() method as well as a more complex .chat() method:

# One-off generation
import ollama

ollama = ollama.Client(host='host_address')

response = ollama.generate(model='gemma3', prompt='What is the capital of France?')
print(response['response'])
# The chat method supports roles and message history
import ollama

ollama = ollama.Client(host='host_address')

messages = [
    {'role': 'user', 'content': 'Hello, how are you?'},
    {'role': 'assistant', 'content': 'I am fine, thank you.'},
    {'role': 'user', 'content': 'What can you do?'},
]
chat_response = ollama.chat(model='gemma3', messages=messages)
print(chat_response['message']['content'])

OpenAI API

Ollama also provides an OpenAI-compatible API. Again, please note the following:

  • You will need to replace host_address with what is reported by scrp-ollama serve.
  • The Ollama server is only accessible within the cluster.
  • Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.
from openai import OpenAI

client = OpenAI(
    base_url = 'host_address/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="gemma3",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {'role': 'user', 'content': 'Hello, how are you?'},
    {'role': 'assistant', 'content': 'I am fine, thank you.'},
    {'role': 'user', 'content': 'What can you do?'},
  ]
)
print(response.choices[0].message.content)