Generative AI Chat
SCRP provides access to locally-hosted, state-of-the-art generative AI models. Information passed to the models never leaves the cluster, making it suitable for processing proprietary data.
Web Access
- Log in to https://scrp-chat.econ.cuhk.edu.hk with your SCRP credentials.
- If necessary, choose a model at the top. The default model is
Llama3.1 70B
. - Files can be uploaded by clicking the
+
button on the left of the dialog box at the bottom. The models are capable of parsing many types of files, including pure text, Excel spreadsheets, CSVs and PDFs.
API Access
SCRP provides an OpenAI-compatible API. To access the API,
- Log in to https://scrp-chat.econ.cuhk.edu.hk with your SCRP credentials.
- Click on the user avatar on the top-right corner, then click Settings.
- In Account, click Show next to ‘API Keys’.
- Click
+ Create new secret key
. - Copy the created API key and provide it along with the full model name when accessing the API through HTTP requests or OpenAI Python library. See the examples below.
Available Models
SCRP currently provide access to the following models:
Model | Max. Throughput | Context Length | Quantization | Full Model Name |
---|---|---|---|---|
Llama 3.1 70B (default) | 60 token/s | 95K | AWQ INT4 | hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 |
Llama 3.1 405B | 5 token/s | 128K | Q4_K_M | llama3.1:405b-instruct-q4_K_M |
Llama 3.1 1B | 200 token/s | 95K | FP16 | meta-llama/Llama-3.2-1B-Instruct |
Llama 3.2 11B Vision | 50 token/s | 300 | FP8 | neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic |
Larger models are more capable but run slower. The different is particularly noticeable when it comes to fact recall, mathematics and logical reasoning. A comparison of the models’ performance can be found here.
Tips for Getting the Best Result
- Models can hallucinate. During text generation, a Large Language Model (LLM) probabilistically chooses the most likely next word, so it can generate convincingly-looking but completely fabricated content. I
- When formulating your prompt, be specific in what you are asking for. For example, if you believe the answer is uncertain, you should tell the model about that, otherwise the model will likely provide a wrong answer with confidence.
- Asking the models for concise output will affect output quality. The models do have the ability to think privately, so asking them for concise output is equivalent to asking them not to think in detail. Instead, ask them to “think step-by-step”, followed by outputing a concise answer at the end.
- Pure text is preferred to formatted text. For example, if you were to provide the model with your manuscript for editing, provide the raw LaTeX instead of the generated PDF file.
Text Generation Settings
Below are a list of common settings for text generation. For web access, the settings are shown when you click on the control panel icon next to your avatar on the top-right corner.
- Temperature: a lower temperature sharpens the distribution from which the model
sample the next word. Setting temperature to 0 will make the model’s output deterministic.
Exposed as
temperature
in OpenAI API. - Top K: Only sample from the top 𝑘-most likely words. Not available in OpenAI API.
- Top P: Only sample words are within the top cumulative distribution of 𝑝.
Exposed as
top_p
in OpenAI API. - Frequency Penalty: A higher penalty lowers the probability of sampling words that have
already been sampled before.
Exposed as
frequency_penalty
in OpenAI API.
Examples
Below is an example of using the OpenAI Python Library to access Llama3.1 70B on SCRP, with temperature set to 1 and top-p set to 0.5:
from openai import OpenAI
client = OpenAI(
base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
api_key='your_api_key_here',
)
response = client.chat.completions.create(
model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Where is Hong Kong?"},
],
temperature=1,
top_p=0.5
)
print(response.choices[0].message.content)
You can also interact with the models through HTTP request:
curl https://scrp-chat.econ.cuhk.edu.hk/ollama/api/generate \
-H "Authorization:bearer your_api_key_here" \
-H 'Content-Type: application/json' \
-d '{ "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
"prompt": "Where is Hong Kong?",
"stream": false }'
Below is an example of using the OpenAI Python Library to access Llama3.2 11B Vision.
The image is referenced by its URL in the file image_url
.
For accurate reading of the content in the image, a low temperature is recommended.
from openai import OpenAI
client = OpenAI(
base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
api_key='your_api_key_here',
)
response = client.chat.completions.create(
model="neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic",
messages=[
{"role": "user", "content":[
{"type":"text","text":"Describe this image:"},
{"type":"image_url","image_url":{"url":"https://www.example.com/my-picture.png"}}
]
}
],
temperature=0.1,
)
print(response.choices[0].message.content)