Setting Up Your First LLM Model¶
Setting Up Your First LLM Model
Learn how to configure and deploy an LLM inference endpoint on the Syntera Platform
Prerequisites
- A Syntera Platform account with appropriate permissions
- Access to the Syntera LLM Inference Service
- Understanding of your inference requirements (latency, throughput, etc.)
- Budget approval for LLM inference costs (if applicable)
LLM Inference Endpoint Setup Workflow
Phase 1: Browse Model Catalog¶
Begin by browsing the model catalog to find the right LLM for your application needs.
View Model Cards
Browse model cards with key specifications to quickly compare available models
Filter by Criteria
Filter models by size, capabilities, cost, and language support
Compare Performance
Review benchmark metrics to compare model performance
Review Pricing
Check pricing and resource requirements for each model
How to browse and select a model
-
Click the "BROWSE MODEL CATALOG" button
This opens the model catalog interface
-
Use the filter panel to narrow down options
Filter by model size, capabilities, and language support
-
Review model cards for each candidate
Compare specifications, benchmark scores, and pricing
-
Select your desired model
Click "Select model" to proceed to the configuration phase
Phase 2: Configure Inference Endpoint¶
After selecting a model, configure the inference endpoint to match your specific requirements.
Core Configurations
Inference Parameters
Advanced Settings
Configuration Preview
# Inference Endpoint Configuration
name: my-first-llm-endpoint
model:
base: gpt-4
version: 2023-12-01
hardware:
gpu_type: NVIDIA_A100
gpu_count: 1
memory: 32GB
parameters:
temperature: 0.7
max_tokens: 2048
top_p: 0.95
context_window: 8192
response_cache:
enabled: true
ttl_seconds: 3600
Steps to Configure Your Endpoint
-
Set memory allocation
Choose sufficient memory based on model size and expected traffic
-
Select specific model version
Choose between available versions of your selected model
-
Configure hardware tier
Select GPU type and count for optimal performance
-
Adjust inference parameters
Set temperature, sampling methods, and token limitations
-
Configure advanced settings
Set context window size and caching policies
-
Click "Submit Configuration"
Proceed to the endpoint creation review screen
Phase 3: Create Dedicated Endpoint¶
Review your configuration, accept pricing terms, and create your dedicated inference endpoint.
Configuration Summary
Pricing and Usage Terms
Endpoint Resources
System Provisions
Dedicated hardware, networking, and scaling infrastructure
Endpoint Details & API Keys
Secure credentials for accessing your endpoint
Example API Integration
import requests
import json
# Your endpoint details
API_URL = "https://api.syntera.ai/v1/inference/endpoints/your-endpoint-id"
API_KEY = "your-api-key"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
# Example request
data = {
"prompt": "Explain the concept of machine learning in simple terms.",
"max_tokens": 500,
"temperature": 0.7
}
response = requests.post(API_URL, headers=headers, data=json.dumps(data))
# Process the response
if response.status_code == 200:
result = response.json()
print(result["generated_text"])
else:
print(f"Error: {response.status_code}")
print(response.text)
Phase 4: Test and Monitor Your Endpoint¶
Test your newly created endpoint and set up monitoring to ensure optimal performance.
Test Your Endpoint
Send Test Prompts
Quickly test your endpoint with sample prompts
Access Interactive Playground
Experiment with different parameters and prompts
View Responses and Metrics
Analyze response quality, latency, and token usage
Review Comprehensive Documentation
Access detailed guides and reference materials
Copy Example Code
Get integration examples in multiple languages
Monitor Production Usage
Monitor Performance Metrics
Track key metrics like response time and throughput
Analyze Usage Patterns
Identify peak usage times and common request types
Track Request Volume and Latency
Monitor traffic patterns and response times
Review Billing and Projections
Track costs and forecast future expenditures
Manage API Security Controls
Set rate limits and configure access controls
Test Console
Input Prompt
Response
Machine learning is like teaching a computer to learn from examples instead of explicitly programming it with specific instructions.
Imagine you want to teach a child to recognize cats. You wouldn't explain all the detailed features like "four legs, fur, whiskers, etc." Instead, you'd show the child many pictures of cats, and eventually, they'd learn to identify cats on their own.
Similarly, with machine learning:
- You give the computer lots of examples (data)
- The computer finds patterns in this data
- After "learning" from these patterns, it can make predictions or decisions about new data it hasn't seen before
For example, a machine learning system can learn to recognize spam emails after seeing thousands of examples of spam and non-spam messages, without being explicitly programmed with rules about what makes an email "spammy."
Monitoring Dashboard Preview
Next Steps for Production
-
Configure alerts for key metrics
Set up notifications for error spikes, latency issues, or high costs
-
Integrate with your application
Implement API calls to your endpoint from your application
-
Establish monitoring routine
Set up regular checks of endpoint performance
-
Optimize based on usage patterns
Adjust configurations as you observe real-world usage
Troubleshooting
- Check resource quota: Ensure your account has sufficient quota for the selected hardware tier.
- Verify configuration: Review your endpoint configuration for any invalid settings.
- Contact support: If issues persist, contact Syntera support with your endpoint configuration details.
- Check hardware tier: Consider upgrading to a more powerful GPU tier.
- Reduce context window: Large context windows require more processing time.
- Enable caching: If your workload has repetitive queries, enable response caching.
- Monitor concurrent requests: High traffic can cause queuing and increased latency.
- Verify API key: Ensure you're using the correct API key and it hasn't expired.
- Check endpoint URL: Confirm you're using the correct endpoint URL in your requests.
- Review request format: Ensure your API request follows the correct format and parameters.
- Network connectivity: Check if there are any network issues or firewall restrictions.
- Set budget alerts: Configure alerts to notify you when costs exceed expected thresholds.
- Monitor usage patterns: Regularly review your usage dashboard for unexpected spikes.
- Optimize token usage: Review your prompts to reduce unnecessary tokens in requests.
- Consider auto-scaling: Configure auto-scaling to reduce resources during low-traffic periods.