Ollama
SCRP offers Ollama with a customized start-up script that makes it easier to run on a cluster. Models are loaded directly from our local repository, so they do not use up your own storage quota.
While Ollama gives you the freedom to run a wide variety of models, SCRP-Chat, our managed generative AI service, is easier to use and is capable of much higher throughout, so you should first consider that option.
To use Ollama, you first start the Ollama server, followed by the Ollama client.
Starting the Ollama Server
First start the Ollama server on a GPU node. You will need a GPU with enough VRAM to load your model:
Model size | GPU Type | Command |
---|---|---|
< 20B | RTX 3060 | gpu scrp-ollama serve |
20 - 40B | RTX 3090 | compute --gpus-per-task=rtx3090 scrp-ollama serve |
> 40B | A100 | compute -p a100 --gpus-per-task=a100 scrp-ollama serve |
Note that when launched this way, the Ollama server will terminate with the terminal you launch it with. If you would like to keep the server alive, you should launch it with sbatch instead.
Running the Ollama Client
Once the server is running, you can run a model by starting the client on another terminal:
scrp-ollama run model_name
where model_name
is a model in the local repository.
You can also show the list of available models with the following command:
scrp-ollama list
To see what models have already been loaded:
scrp-ollama ps
API Access
Ollama supports its own API as well as an OpenAI-compatible API.
Ollama API
Below is an example using the Ollama Python API. Please note the following:
- You will need to replace
host_address
with what is reported byscrp-ollama serve
. - The Ollama server is only accessible within the cluster.
- Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.
Ollama provides a one-off .generate()
method as well as a more complex .chat()
method:
# One-off generation
import ollama
ollama = ollama.Client(host='host_address')
response = ollama.generate(model='gemma3', prompt='What is the capital of France?')
print(response['response'])
# The chat method supports roles and message history
import ollama
ollama = ollama.Client(host='host_address')
messages = [
{'role': 'user', 'content': 'Hello, how are you?'},
{'role': 'assistant', 'content': 'I am fine, thank you.'},
{'role': 'user', 'content': 'What can you do?'},
]
chat_response = ollama.chat(model='gemma3', messages=messages)
print(chat_response['message']['content'])
OpenAI API
Ollama also provides an OpenAI-compatible API. Again, please note the following:
- You will need to replace
host_address
with what is reported byscrp-ollama serve
. - The Ollama server is only accessible within the cluster.
- Ollama does not provide any form of authentication, so anyone on the cluster can connect to your Ollama server.
from openai import OpenAI
client = OpenAI(
base_url = 'host_address/v1',
api_key='ollama', # required, but unused
)
response = client.chat.completions.create(
model="gemma3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{'role': 'user', 'content': 'Hello, how are you?'},
{'role': 'assistant', 'content': 'I am fine, thank you.'},
{'role': 'user', 'content': 'What can you do?'},
]
)
print(response.choices[0].message.content)