Skip to content

Setting Up Your First LLM Model

LLM Integration

Setting Up Your First LLM Model

Learn how to configure and deploy an LLM inference endpoint on the Syntera Platform

~15 minutes

Prerequisites

  • A Syntera Platform account with appropriate permissions
  • Access to the Syntera LLM Inference Service
  • Understanding of your inference requirements (latency, throughput, etc.)
  • Budget approval for LLM inference costs (if applicable)

LLM Inference Endpoint Setup Workflow

Browse Model Catalog
Configure Inference Endpoint
Create Dedicated Endpoint
Test & Monitor

Phase 1: Browse Model Catalog

1

Begin by browsing the model catalog to find the right LLM for your application needs.

View Model Cards

Browse model cards with key specifications to quickly compare available models

Look for context window size, supported languages, and inference speed

Filter by Criteria

Filter models by size, capabilities, cost, and language support

Narrow down options based on your specific requirements

Compare Performance

Review benchmark metrics to compare model performance

Compare throughput, latency, and quality metrics

Review Pricing

Check pricing and resource requirements for each model

Consider both deployment costs and per-token pricing

How to browse and select a model

  1. Click the "BROWSE MODEL CATALOG" button
    This opens the model catalog interface
  2. Use the filter panel to narrow down options
    Filter by model size, capabilities, and language support
  3. Review model cards for each candidate
    Compare specifications, benchmark scores, and pricing
  4. Select your desired model
    Click "Select model" to proceed to the configuration phase
Pro Tip: For production workloads, consider selecting a model that balances performance and cost efficiency. Smaller models are faster and cheaper but may not handle complex tasks as well as larger ones.

Phase 2: Configure Inference Endpoint

2

After selecting a model, configure the inference endpoint to match your specific requirements.

Core Configurations

Set Memory Allocation

Allocate sufficient memory for model and inference workloads

Select Base Model and Version

Choose specific model version for your deployment

Choose Hardware Tier

Select GPU type and count for optimal performance

Inference Parameters

Temperature

Adjust temperature to control randomness in outputs

Top-p / Top-k Sampling

Configure sampling methods for text generation

Maximum Token Length

Set the maximum token length for generated responses

Advanced Settings

Context Window Size

Define the context window for model processing

Response Caching Policy

Configure caching to improve performance and reduce costs

Configuration Preview

Endpoint Configuration
# Inference Endpoint Configuration
name: my-first-llm-endpoint
model:
  base: gpt-4
  version: 2023-12-01
hardware:
  gpu_type: NVIDIA_A100
  gpu_count: 1
  memory: 32GB
parameters:
  temperature: 0.7
  max_tokens: 2048
  top_p: 0.95
  context_window: 8192
  response_cache:
    enabled: true
    ttl_seconds: 3600

Steps to Configure Your Endpoint

  1. Set memory allocation
    Choose sufficient memory based on model size and expected traffic
  2. Select specific model version
    Choose between available versions of your selected model
  3. Configure hardware tier
    Select GPU type and count for optimal performance
  4. Adjust inference parameters
    Set temperature, sampling methods, and token limitations
  5. Configure advanced settings
    Set context window size and caching policies
  6. Click "Submit Configuration"
    Proceed to the endpoint creation review screen
Important Note: Hardware configurations directly impact both performance and cost. Over-provisioning will increase costs unnecessarily, while under-provisioning may lead to poor performance or request failures.

Phase 3: Create Dedicated Endpoint

3

Review your configuration, accept pricing terms, and create your dedicated inference endpoint.

Configuration Summary

Model
GPT-4 (2023-12-01)
Hardware
1× NVIDIA A100 GPU
Memory
32GB
Max Tokens
2,048
Temperature
0.7
Context Window
8,192 tokens

Pricing and Usage Terms

Item
Cost
Hardware (A100 GPU)
$2.50/hour
Input Tokens
$0.01/1K tokens
Output Tokens
$0.03/1K tokens
Estimated Monthly Cost
$1,800 - $3,500*
* Based on average usage of 5,000 requests/day with 1,000 tokens per request

Click "Create Endpoint"

Initiate the endpoint creation process

Wait for Provisioning

System provisions resources (typically 5-10 minutes)

Endpoint Ready

Receive endpoint details and API keys

Endpoint Resources

System Provisions

Dedicated hardware, networking, and scaling infrastructure

Endpoint Details & API Keys

Secure credentials for accessing your endpoint

Example API Integration

Python
import requests
import json

# Your endpoint details
API_URL = "https://api.syntera.ai/v1/inference/endpoints/your-endpoint-id"
API_KEY = "your-api-key" 

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

# Example request
data = {
    "prompt": "Explain the concept of machine learning in simple terms.",
    "max_tokens": 500,
    "temperature": 0.7
}

response = requests.post(API_URL, headers=headers, data=json.dumps(data))

# Process the response
if response.status_code == 200:
    result = response.json()
    print(result["generated_text"])
else:
    print(f"Error: {response.status_code}")
    print(response.text)
Note: Store your API keys securely and never expose them in client-side code. Use environment variables or secure secret management services to handle credentials.

Phase 4: Test and Monitor Your Endpoint

4

Test your newly created endpoint and set up monitoring to ensure optimal performance.

Test Your Endpoint

Send Test Prompts

Quickly test your endpoint with sample prompts

Access Interactive Playground

Experiment with different parameters and prompts

View Responses and Metrics

Analyze response quality, latency, and token usage

Review Comprehensive Documentation

Access detailed guides and reference materials

Copy Example Code

Get integration examples in multiple languages

Monitor Production Usage

Monitor Performance Metrics

Track key metrics like response time and throughput

Analyze Usage Patterns

Identify peak usage times and common request types

Track Request Volume and Latency

Monitor traffic patterns and response times

Review Billing and Projections

Track costs and forecast future expenditures

Manage API Security Controls

Set rate limits and configure access controls

Test Console

Input Prompt

Send Request

Response

Machine learning is like teaching a computer to learn from examples instead of explicitly programming it with specific instructions.

Imagine you want to teach a child to recognize cats. You wouldn't explain all the detailed features like "four legs, fur, whiskers, etc." Instead, you'd show the child many pictures of cats, and eventually, they'd learn to identify cats on their own.

Similarly, with machine learning:

  1. You give the computer lots of examples (data)
  2. The computer finds patterns in this data
  3. After "learning" from these patterns, it can make predictions or decisions about new data it hasn't seen before

For example, a machine learning system can learn to recognize spam emails after seeing thousands of examples of spam and non-spam messages, without being explicitly programmed with rules about what makes an email "spammy."

Tokens: Input: 9 / Output: 174
Latency: 842ms
Cost: $0.0052

Monitoring Dashboard Preview

Endpoint: my-first-llm-endpoint
Last 24 hours ▾
Requests per Minute
47.3
12.4%
Average Latency (ms)
842
5.2%
Tokens per Request
183
0.8%
Error Rate (%)
0.3%
0.1%

Next Steps for Production

  1. Configure alerts for key metrics
    Set up notifications for error spikes, latency issues, or high costs
  2. Integrate with your application
    Implement API calls to your endpoint from your application
  3. Establish monitoring routine
    Set up regular checks of endpoint performance
  4. Optimize based on usage patterns
    Adjust configurations as you observe real-world usage
Congratulations! You've successfully set up your first LLM inference endpoint. Monitor its performance, gather user feedback, and optimize configurations as needed to ensure the best experience for your users.

Troubleshooting

  • Check resource quota: Ensure your account has sufficient quota for the selected hardware tier.
  • Verify configuration: Review your endpoint configuration for any invalid settings.
  • Contact support: If issues persist, contact Syntera support with your endpoint configuration details.
  • Check hardware tier: Consider upgrading to a more powerful GPU tier.
  • Reduce context window: Large context windows require more processing time.
  • Enable caching: If your workload has repetitive queries, enable response caching.
  • Monitor concurrent requests: High traffic can cause queuing and increased latency.
  • Verify API key: Ensure you're using the correct API key and it hasn't expired.
  • Check endpoint URL: Confirm you're using the correct endpoint URL in your requests.
  • Review request format: Ensure your API request follows the correct format and parameters.
  • Network connectivity: Check if there are any network issues or firewall restrictions.
  • Set budget alerts: Configure alerts to notify you when costs exceed expected thresholds.
  • Monitor usage patterns: Regularly review your usage dashboard for unexpected spikes.
  • Optimize token usage: Review your prompts to reduce unnecessary tokens in requests.
  • Consider auto-scaling: Configure auto-scaling to reduce resources during low-traffic periods.