Setting Up Your First LLM Model¶

Setting Up Your First LLM Model

Learn how to configure and deploy an LLM inference endpoint on the Syntera Platform

~15 minutes

In this guide

Prerequisites
Browse Model Catalog
Configure Inference Endpoint
Create Dedicated Endpoint
Test and Monitor Your Endpoint

Prerequisites

A Syntera Platform account with appropriate permissions
Access to the Syntera LLM Inference Service
Understanding of your inference requirements (latency, throughput, etc.)
Budget approval for LLM inference costs (if applicable)

LLM Inference Endpoint Setup Workflow

Browse Model Catalog

→

Configure Inference Endpoint

→

Create Dedicated Endpoint

→

Test & Monitor

Phase 1: Browse Model Catalog¶

1

Begin by browsing the model catalog to find the right LLM for your application needs.

View Model Cards

Browse model cards with key specifications to quickly compare available models

Look for context window size, supported languages, and inference speed

Filter by Criteria

Filter models by size, capabilities, cost, and language support

Narrow down options based on your specific requirements

Compare Performance

Review benchmark metrics to compare model performance

Compare throughput, latency, and quality metrics

Review Pricing

Check pricing and resource requirements for each model

Consider both deployment costs and per-token pricing

How to browse and select a model

Click the "BROWSE MODEL CATALOG" button
This opens the model catalog interface
Use the filter panel to narrow down options
Filter by model size, capabilities, and language support
Review model cards for each candidate
Compare specifications, benchmark scores, and pricing
Select your desired model
Click "Select model" to proceed to the configuration phase

Pro Tip: For production workloads, consider selecting a model that balances performance and cost efficiency. Smaller models are faster and cheaper but may not handle complex tasks as well as larger ones.

Phase 2: Configure Inference Endpoint¶

2

After selecting a model, configure the inference endpoint to match your specific requirements.

Core Configurations

Set Memory Allocation

Allocate sufficient memory for model and inference workloads

Select Base Model and Version

Choose specific model version for your deployment

Choose Hardware Tier

Select GPU type and count for optimal performance

Inference Parameters

Temperature

Adjust temperature to control randomness in outputs

Top-p / Top-k Sampling

Configure sampling methods for text generation

Maximum Token Length

Set the maximum token length for generated responses

Advanced Settings

Context Window Size

Define the context window for model processing

Response Caching Policy

Configure caching to improve performance and reduce costs

Configuration Preview

Endpoint Configuration

# Inference Endpoint Configuration
name: my-first-llm-endpoint
model:
  base: gpt-4
  version: 2023-12-01
hardware:
  gpu_type: NVIDIA_A100
  gpu_count: 1
  memory: 32GB
parameters:
  temperature: 0.7
  max_tokens: 2048
  top_p: 0.95
  context_window: 8192
  response_cache:
    enabled: true
    ttl_seconds: 3600

Steps to Configure Your Endpoint

Set memory allocation
Choose sufficient memory based on model size and expected traffic
Select specific model version
Choose between available versions of your selected model
Configure hardware tier
Select GPU type and count for optimal performance
Adjust inference parameters
Set temperature, sampling methods, and token limitations
Configure advanced settings
Set context window size and caching policies
Click "Submit Configuration"
Proceed to the endpoint creation review screen

Important Note: Hardware configurations directly impact both performance and cost. Over-provisioning will increase costs unnecessarily, while under-provisioning may lead to poor performance or request failures.

Phase 3: Create Dedicated Endpoint¶

3

Review your configuration, accept pricing terms, and create your dedicated inference endpoint.

Configuration Summary

Model

GPT-4 (2023-12-01)

Hardware

1× NVIDIA A100 GPU

Memory

32GB

Max Tokens

2,048

Temperature

0.7

Context Window

8,192 tokens

Pricing and Usage Terms

Item

Cost

Hardware (A100 GPU)

$2.50/hour

Input Tokens

$0.01/1K tokens

Output Tokens

$0.03/1K tokens

Estimated Monthly Cost

$1,800 - $3,500*

* Based on average usage of 5,000 requests/day with 1,000 tokens per request

I accept the pricing and usage terms for this endpoint

Click "Create Endpoint"

Initiate the endpoint creation process

→

Wait for Provisioning

System provisions resources (typically 5-10 minutes)

→

Endpoint Ready

Receive endpoint details and API keys

Endpoint Resources

System Provisions

Dedicated hardware, networking, and scaling infrastructure

Endpoint Details & API Keys

Secure credentials for accessing your endpoint

Example API Integration

Python

import requests
import json

# Your endpoint details
API_URL = "https://api.syntera.ai/v1/inference/endpoints/your-endpoint-id"
API_KEY = "your-api-key" 

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

# Example request
data = {
    "prompt": "Explain the concept of machine learning in simple terms.",
    "max_tokens": 500,
    "temperature": 0.7
}

response = requests.post(API_URL, headers=headers, data=json.dumps(data))

# Process the response
if response.status_code == 200:
    result = response.json()
    print(result["generated_text"])
else:
    print(f"Error: {response.status_code}")
    print(response.text)

Note: Store your API keys securely and never expose them in client-side code. Use environment variables or secure secret management services to handle credentials.

Phase 4: Test and Monitor Your Endpoint¶

4

Test your newly created endpoint and set up monitoring to ensure optimal performance.

Test Your Endpoint

Send Test Prompts

Quickly test your endpoint with sample prompts

Access Interactive Playground

Experiment with different parameters and prompts

View Responses and Metrics

Analyze response quality, latency, and token usage

Review Comprehensive Documentation

Access detailed guides and reference materials

Copy Example Code

Get integration examples in multiple languages

Monitor Production Usage

Monitor Performance Metrics

Track key metrics like response time and throughput

Analyze Usage Patterns

Identify peak usage times and common request types

Track Request Volume and Latency

Monitor traffic patterns and response times

Review Billing and Projections

Track costs and forecast future expenditures

Manage API Security Controls

Set rate limits and configure access controls

Test Console

Input Prompt

Send Request

Response

Machine learning is like teaching a computer to learn from examples instead of explicitly programming it with specific instructions.

Imagine you want to teach a child to recognize cats. You wouldn't explain all the detailed features like "four legs, fur, whiskers, etc." Instead, you'd show the child many pictures of cats, and eventually, they'd learn to identify cats on their own.

Similarly, with machine learning:

You give the computer lots of examples (data)
The computer finds patterns in this data
After "learning" from these patterns, it can make predictions or decisions about new data it hasn't seen before

For example, a machine learning system can learn to recognize spam emails after seeing thousands of examples of spam and non-spam messages, without being explicitly programmed with rules about what makes an email "spammy."

Tokens: Input: 9 / Output: 174

Latency: 842ms

Cost: $0.0052

Monitoring Dashboard Preview

Endpoint: my-first-llm-endpoint

Last 24 hours ▾

Requests per Minute

47.3

12.4%

Average Latency (ms)

842

5.2%

Tokens per Request

183

0.8%

Error Rate (%)

0.3%

0.1%

Next Steps for Production

Configure alerts for key metrics
Set up notifications for error spikes, latency issues, or high costs
Integrate with your application
Implement API calls to your endpoint from your application
Establish monitoring routine
Set up regular checks of endpoint performance
Optimize based on usage patterns
Adjust configurations as you observe real-world usage

Congratulations! You've successfully set up your first LLM inference endpoint. Monitor its performance, gather user feedback, and optimize configurations as needed to ensure the best experience for your users.

Troubleshooting

Check resource quota: Ensure your account has sufficient quota for the selected hardware tier.
Verify configuration: Review your endpoint configuration for any invalid settings.
Contact support: If issues persist, contact Syntera support with your endpoint configuration details.

Check hardware tier: Consider upgrading to a more powerful GPU tier.
Reduce context window: Large context windows require more processing time.
Enable caching: If your workload has repetitive queries, enable response caching.
Monitor concurrent requests: High traffic can cause queuing and increased latency.

Verify API key: Ensure you're using the correct API key and it hasn't expired.
Check endpoint URL: Confirm you're using the correct endpoint URL in your requests.
Review request format: Ensure your API request follows the correct format and parameters.
Network connectivity: Check if there are any network issues or firewall restrictions.

Set budget alerts: Configure alerts to notify you when costs exceed expected thresholds.
Monitor usage patterns: Regularly review your usage dashboard for unexpected spikes.
Optimize token usage: Review your prompts to reduce unnecessary tokens in requests.
Consider auto-scaling: Configure auto-scaling to reduce resources during low-traffic periods.

Setting Up Your First LLM Model¶

Setting Up Your First LLM Model

Prerequisites

LLM Inference Endpoint Setup Workflow

Phase 1: Browse Model Catalog¶

View Model Cards

Filter by Criteria

Compare Performance

Review Pricing

How to browse and select a model

Phase 2: Configure Inference Endpoint¶

Core Configurations

Set Memory Allocation

Select Base Model and Version

Choose Hardware Tier

Inference Parameters

Temperature

Top-p / Top-k Sampling

Maximum Token Length

Advanced Settings

Context Window Size

Response Caching Policy

Configuration Preview

Steps to Configure Your Endpoint

Phase 3: Create Dedicated Endpoint¶

Configuration Summary

Pricing and Usage Terms

Click "Create Endpoint"

Wait for Provisioning

Endpoint Ready

Endpoint Resources

System Provisions

Endpoint Details & API Keys

Example API Integration

Phase 4: Test and Monitor Your Endpoint¶

Test Your Endpoint

Send Test Prompts

Access Interactive Playground

View Responses and Metrics

Review Comprehensive Documentation

Copy Example Code

Monitor Production Usage

Monitor Performance Metrics

Analyze Usage Patterns

Track Request Volume and Latency

Review Billing and Projections

Manage API Security Controls

Test Console

Input Prompt

Response

Monitoring Dashboard Preview

Next Steps for Production

Troubleshooting

Need Help?

Support Portal

Community Forum

Email Support