SCRP provides access to locally-hosted, state-of-the-art generative AI models. Information passed to the models never leaves the cluster, making it suitable for processing proprietary data.

Web Access

  1. Log in to https://scrp-chat.econ.cuhk.edu.hk with your SCRP credentials.
  2. If necessary, choose a model at the top. The default model is Llama3.1 70B.
  3. Files can be uploaded by clicking the + button on the left of the dialog box at the bottom. The models are capable of parsing many types of files, including pure text, Excel spreadsheets, CSVs and PDFs.

API Access

SCRP provides an OpenAI-compatible API. To access the API,

  1. Log in to https://scrp-chat.econ.cuhk.edu.hk with your SCRP credentials.
  2. Click on the user avatar on the top-right corner, then click Settings.
  3. In Account, click Show next to ‘API Keys’.
  4. Click + Create new secret key.
  5. Copy the created API key and provide it along with the full model name when accessing the API through HTTP requests or OpenAI Python library. See the examples below.

Available Models

SCRP currently provide access to the following models:

Model Max. Throughput Context Length Quantization Full Model Name
Llama 3.1 70B (default) 60 token/s 95K AWQ INT4 hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
Llama 3.1 405B 5 token/s 128K Q4_K_M llama3.1:405b-instruct-q4_K_M
Llama 3.1 1B 200 token/s 95K FP16 meta-llama/Llama-3.2-1B-Instruct
Llama 3.2 11B Vision 50 token/s 300 FP8 neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic

Larger models are more capable but run slower. The different is particularly noticeable when it comes to fact recall, mathematics and logical reasoning. A comparison of the models’ performance can be found here.

Tips for Getting the Best Result

  • Models can hallucinate. During text generation, a Large Language Model (LLM) probabilistically chooses the most likely next word, so it can generate convincingly-looking but completely fabricated content. I
  • When formulating your prompt, be specific in what you are asking for. For example, if you believe the answer is uncertain, you should tell the model about that, otherwise the model will likely provide a wrong answer with confidence.
  • Asking the models for concise output will affect output quality. The models do have the ability to think privately, so asking them for concise output is equivalent to asking them not to think in detail. Instead, ask them to “think step-by-step”, followed by outputing a concise answer at the end.
  • Pure text is preferred to formatted text. For example, if you were to provide the model with your manuscript for editing, provide the raw LaTeX instead of the generated PDF file.

Text Generation Settings

Below are a list of common settings for text generation. For web access, the settings are shown when you click on the control panel icon next to your avatar on the top-right corner.

  • Temperature: a lower temperature sharpens the distribution from which the model sample the next word. Setting temperature to 0 will make the model’s output deterministic. Exposed as temperature in OpenAI API.
  • Top K: Only sample from the top 𝑘-most likely words. Not available in OpenAI API.
  • Top P: Only sample words are within the top cumulative distribution of 𝑝. Exposed as top_p in OpenAI API.
  • Frequency Penalty: A higher penalty lowers the probability of sampling words that have already been sampled before. Exposed as frequency_penalty in OpenAI API.

Examples

Below is an example of using the OpenAI Python Library to access Llama3.1 70B on SCRP, with temperature set to 1 and top-p set to 0.5:

from openai import OpenAI

client = OpenAI(
    base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
    api_key='your_api_key_here',
)

response = client.chat.completions.create(
  model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Where is Hong Kong?"},
  ],
  temperature=1,
  top_p=0.5
)
print(response.choices[0].message.content)

You can also interact with the models through HTTP request:

curl https://scrp-chat.econ.cuhk.edu.hk/ollama/api/generate \
-H "Authorization:bearer your_api_key_here" \
-H 'Content-Type: application/json' \
-d '{ "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", 
      "prompt": "Where is Hong Kong?", 
      "stream": false }'

Below is an example of using the OpenAI Python Library to access Llama3.2 11B Vision. The image is referenced by its URL in the file image_url. For accurate reading of the content in the image, a low temperature is recommended.

from openai import OpenAI

client = OpenAI(
    base_url = 'https://scrp-chat.econ.cuhk.edu.hk/api',
    api_key='your_api_key_here',
)

response = client.chat.completions.create(
  model="neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic",
  messages=[
    {"role": "user", "content":[ 
        {"type":"text","text":"Describe this image:"},
        {"type":"image_url","image_url":{"url":"https://www.example.com/my-picture.png"}}
        ]
    }
  ],
  temperature=0.1,
)
print(response.choices[0].message.content)