| # 📖 User Guide - ZeroGPU LLM Inference |
|
|
| ## Quick Start (5 Minutes) |
|
|
| ### 1. Choose Your Model |
| The model dropdown shows 30+ options organized by size: |
| - **Compact (<2B)**: Fast, lightweight - great for quick responses |
| - **Mid-size (2-8B)**: Best balance of speed and quality |
| - **Large (14B+)**: Highest quality, slower but more capable |
|
|
| **Recommendation for beginners**: Start with `Qwen3-4B-Instruct-2507` |
|
|
| ### 2. Try an Example Prompt |
| Click on any example below the chat box to get started: |
| - "Explain quantum computing in simple terms" |
| - "Write a Python function..." |
| - "What are the latest developments..." (requires web search) |
|
|
| ### 3. Start Chatting! |
| Type your message and press Enter or click "📤 Send" |
|
|
| ## Core Features |
|
|
| ### 💬 Chat Interface |
|
|
| The main chat area shows: |
| - Your messages on one side |
| - AI responses with a 🤖 avatar |
| - Copy button on each message |
| - Smooth streaming as tokens generate |
|
|
| **Tips:** |
| - Press Enter to send (Shift+Enter for new line) |
| - Click Copy button to save responses |
| - Scroll up to review history |
| - Use Clear Chat to start fresh |
|
|
| ### 🤖 Model Selection |
|
|
| **When to use each size:** |
|
|
| | Model Size | Best For | Speed | Quality | |
| |------------|----------|-------|---------| |
| | <2B | Quick questions, testing | ⚡⚡⚡ | ⭐⭐ | |
| | 2-8B | General chat, coding help | ⚡⚡ | ⭐⭐⭐ | |
| | 14B+ | Complex reasoning, long-form | ⚡ | ⭐⭐⭐⭐ | |
|
|
| **Specialized Models:** |
| - **Phi-4-mini-Reasoning**: Math, logic problems |
| - **Qwen2.5-Coder**: Programming tasks |
| - **DeepSeek-R1-Distill**: Step-by-step reasoning |
| - **Apriel-1.5-15b-Thinker**: Multimodal understanding |
|
|
| ### 🔍 Web Search |
|
|
| Enable this when you need: |
| - Current events and news |
| - Recent information (after model training cutoff) |
| - Facts that change frequently |
| - Real-time data |
|
|
| **How it works:** |
| 1. Toggle "🔍 Enable Web Search" |
| 2. Web search settings accordion appears |
| 3. System prompt updates automatically |
| 4. Search runs in background (won't block chat) |
| 5. Results injected into context |
|
|
| **Settings explained:** |
| - **Max Results**: How many search results to fetch (4 is good default) |
| - **Max Chars/Result**: Limit length per result (50 prevents overwhelming context) |
| - **Search Timeout**: Maximum wait time (5s recommended) |
|
|
| ### 📝 System Prompt |
|
|
| This defines the AI's personality and behavior. |
|
|
| **Default prompts:** |
| - Without search: Helpful, creative assistant |
| - With search: Includes search results and current date |
|
|
| **Customization ideas:** |
| ``` |
| You are a professional code reviewer... |
| You are a creative writing coach... |
| You are a patient tutor explaining concepts simply... |
| You are a technical documentation writer... |
| ``` |
|
|
| ## Advanced Features |
|
|
| ### 🎛️ Advanced Generation Parameters |
|
|
| Click the accordion to reveal these controls: |
|
|
| #### Max Tokens (64-16384) |
| - **What it does**: Sets maximum response length |
| - **Lower (256-512)**: Quick, concise answers |
| - **Medium (1024)**: Balanced (default) |
| - **Higher (2048+)**: Long-form content, detailed explanations |
|
|
| #### Temperature (0.1-2.0) |
| - **What it does**: Controls randomness/creativity |
| - **Low (0.1-0.3)**: Focused, deterministic (good for facts, code) |
| - **Medium (0.7)**: Balanced creativity (default) |
| - **High (1.2-2.0)**: Very creative, unpredictable (stories, brainstorming) |
|
|
| #### Top-K (1-100) |
| - **What it does**: Limits token choices to top K most likely |
| - **Lower (10-20)**: More focused |
| - **Medium (40)**: Balanced (default) |
| - **Higher (80-100)**: More varied vocabulary |
|
|
| #### Top-P (0.1-1.0) |
| - **What it does**: Nucleus sampling threshold |
| - **Lower (0.5-0.7)**: Conservative choices |
| - **Medium (0.9)**: Balanced (default) |
| - **Higher (0.95-1.0)**: Full vocabulary range |
|
|
| #### Repetition Penalty (1.0-2.0) |
| - **What it does**: Reduces repeated words/phrases |
| - **Low (1.0-1.1)**: Allows some repetition |
| - **Medium (1.2)**: Balanced (default) |
| - **High (1.5+)**: Strongly avoids repetition (may hurt coherence) |
|
|
| ### Preset Configurations |
|
|
| **For Creative Writing:** |
| ``` |
| Temperature: 1.2 |
| Top-P: 0.95 |
| Top-K: 80 |
| Max Tokens: 2048 |
| ``` |
|
|
| **For Code Generation:** |
| ``` |
| Temperature: 0.3 |
| Top-P: 0.9 |
| Top-K: 40 |
| Max Tokens: 1024 |
| Repetition Penalty: 1.1 |
| ``` |
|
|
| **For Factual Q&A:** |
| ``` |
| Temperature: 0.5 |
| Top-P: 0.85 |
| Top-K: 30 |
| Max Tokens: 512 |
| Enable Web Search: Yes |
| ``` |
|
|
| **For Reasoning Tasks:** |
| ``` |
| Model: Phi-4-mini-Reasoning or DeepSeek-R1 |
| Temperature: 0.7 |
| Max Tokens: 2048 |
| ``` |
|
|
| ## Tips & Tricks |
|
|
| ### 🎯 Getting Better Results |
|
|
| 1. **Be Specific**: "Write a Python function to sort a list" → "Write a Python function that sorts a list of dictionaries by a specific key" |
|
|
| 2. **Provide Context**: "Explain recursion" → "Explain recursion to someone learning programming for the first time, with a simple example" |
|
|
| 3. **Use System Prompts**: Define role/expertise in system prompt instead of every message |
|
|
| 4. **Iterate**: Use follow-up questions to refine responses |
|
|
| 5. **Experiment with Models**: Try different models for the same task |
|
|
| ### ⚡ Performance Tips |
|
|
| 1. **Start Small**: Test with smaller models first |
| 2. **Adjust Max Tokens**: Don't request more than you need |
| 3. **Use Cancel**: Stop bad generations early |
| 4. **Clear Cache**: Clear chat if experiencing slowdowns |
| 5. **One Task at a Time**: Don't send multiple requests simultaneously |
|
|
| ### 🔍 When to Use Web Search |
|
|
| **✅ Good use cases:** |
| - "What happened in the latest SpaceX launch?" |
| - "Current cryptocurrency prices" |
| - "Recent AI research papers" |
| - "Today's weather in Paris" |
|
|
| **❌ Don't need search for:** |
| - General knowledge questions |
| - Code writing/debugging |
| - Math problems |
| - Creative writing |
| - Theoretical explanations |
|
|
| ### 💭 Understanding Thinking Mode |
|
|
| Some models output `<think>...</think>` blocks: |
|
|
| ``` |
| <think> |
| Let me break this down step by step... |
| First, I need to consider... |
| </think> |
| |
| Here's the answer: ... |
| ``` |
|
|
| **In the UI:** |
| - Thinking shows as "💭 Thought" |
| - Answer shows separately |
| - Helps you see the reasoning process |
|
|
| **Best for:** |
| - Complex math problems |
| - Multi-step reasoning |
| - Debugging logic |
| - Learning how AI thinks |
|
|
| ## Troubleshooting |
|
|
| ### Generation is Slow |
| - Try a smaller model |
| - Reduce Max Tokens |
| - Disable web search if not needed |
| - Clear chat history |
|
|
| ### Responses are Repetitive |
| - Increase Repetition Penalty |
| - Reduce Temperature slightly |
| - Try different model |
|
|
| ### Responses are Random/Nonsensical |
| - Decrease Temperature |
| - Reduce Top-P |
| - Reduce Top-K |
| - Try more stable model |
|
|
| ### Web Search Not Working |
| - Check timeout isn't too short |
| - Verify internet connection |
| - Try increasing Max Results |
| - Check search query in debug panel |
|
|
| ### Cancel Button Doesn't Work |
| - Wait a moment (might be processing) |
| - Refresh page if persists |
| - Check browser console for errors |
|
|
| ## Keyboard Shortcuts |
|
|
| - **Enter**: Send message |
| - **Shift+Enter**: New line in input |
| - **Ctrl+C**: Copy (when text selected) |
| - **Ctrl+A**: Select all in input |
|
|
| ## Best Practices |
|
|
| ### For Beginners |
| 1. Start with example prompts |
| 2. Use default settings initially |
| 3. Try 2-4 different models |
| 4. Gradually explore advanced settings |
| 5. Read responses fully before replying |
|
|
| ### For Power Users |
| 1. Create custom system prompts |
| 2. Fine-tune parameters per task |
| 3. Use debug panel for prompt engineering |
| 4. Experiment with model combinations |
| 5. Utilize web search strategically |
|
|
| ### For Developers |
| 1. Study the debug output |
| 2. Test code generation thoroughly |
| 3. Use lower temperature for determinism |
| 4. Compare multiple models |
| 5. Save working configurations |
|
|
| ## Privacy & Safety |
|
|
| - **No data collection**: Conversations not stored permanently |
| - **Model limitations**: May produce incorrect information |
| - **Verify important info**: Don't rely solely on AI for critical decisions |
| - **Web search**: Uses DuckDuckGo (privacy-focused) |
| - **Open source**: Code is transparent and auditable |
|
|
| ## Support & Feedback |
|
|
| Found a bug? Have a suggestion? |
| - Check GitHub issues |
| - Submit feature requests |
| - Contribute improvements |
| - Share your use cases |
|
|
| --- |
|
|
| **Happy chatting! 🎉** |
|
|