Qwen3.5-40B-RoughHouse-Claude-4.6-Opus — oQ4e ⭐ Recommended (4.8 bpw)
oQ4e mixed-precision quantization of DavidAU/Qwen3.5-40B-RoughHouse-Claude-4.6-Opus-Polar-Deckard-Uncensored-Heretic-Thinking
Quantized using oMLX oQ — data-driven sensitivity-aware mixed-precision quantization for Apple Silicon. Standard mlx-lm compatible safetensors. Works with oMLX, LM Studio, mlx-lm, and any MLX-compatible inference server.
⭐ This is the recommended quant. Near-lossless quality at 70% size reduction. Three benchmarks actually improved over bf16. Fits on a 32GB MacBook with TurboQuant KV cache compression.
⚠️ Content Warning
This is a fully uncensored model. The original was abliterated via Heretic and fine-tuned without safety alignment. It may produce graphic, offensive, or inappropriate content. Use responsibly.
About the Original Model
Created by DavidAU. This is the "RoughHouse" variant — the raw, untrained version after expanding two 27B Qwen 3.5 fine-tunes to 40B parameters.
- Architecture: 40B dense (not MoE), 96 layers, 1275 tensors
- Base: Qwen3.5-27B expanded to 40B (50% more layers than base)
- Training: Claude/Polaris (5 datasets) + Deckard/PDK (5 datasets) + Heretic uncensoring
- RoughHouse: Raw release without final training step after expansion
All Available Quants
| Quant | BPW | Size | MMLU | TruthfulQA | ARC-C | HellaSwag | Status |
|---|---|---|---|---|---|---|---|
| bf16 (reference) | 16.0 | 73.6 GB | 86.2% | 85.3% | 94.3% | 90.5% | Source |
| oQ4e ⭐ (this) | ~4.8 | ~22 GB | 85.4% | 85.7% | 96.0% | 91.5% | ✅ Recommended |
| oQ2e | ~3.1 | ~14.3 GB | 43.2% | 40.0% | 40.3% | 20.6% | Available |
Benchmark Results — oQ4e (4.8 bpw, ~22 GB)
Full Comparison: bf16 vs oQ4e vs oQ2e
| Benchmark | Samples | bf16 | oQ4e (this) | Delta | oQ2e | Delta |
|---|---|---|---|---|---|---|
| MMLU | 1000 | 86.2% | 85.4% | -0.8 | 43.2% | -43.0 |
| HellaSwag | 200 | 90.5% | 91.5% | +1.0 | 20.6% | -69.9 |
| TruthfulQA | Full (817) | 85.3% | 85.7% | +0.4 | 40.0% | -45.3 |
| ARC-Challenge | 300 | 94.3% | 96.0% | +1.7 | 40.3% | -54.0 |
| Winogrande | 300 | 83.0% | 78.3% | -4.7 | 44.0% | -39.0 |
| GSM8K | 100 | 97.0% | 95.0% | -2.0 | 30.5% | -66.5 |
| HumanEval | Full (164) | 82.3% | 82.3% | 0.0 | 29.3% | -53.0 |
| MBPP | 200 | 75.0% | 73.0% | -2.0 | 2.0% | -73.0 |
| LiveCodeBench | 100 | 28.0% | 27.0% | -1.0 | 5.0% | -23.0 |
Key Findings
- oQ4e is essentially lossless — average delta of -0.8% across all benchmarks
- Three benchmarks improved over bf16: HellaSwag (+1.0), TruthfulQA (+0.4), ARC-C (+1.7)
- HumanEval identical at 82.3% — coding ability fully preserved
- 73.6 GB → 22 GB — 70% size reduction with zero meaningful quality loss
- Fits on 32GB devices with TurboQuant KV cache compression enabled
MMLU Category Breakdown — oQ4e
Top (100%): Astronomy, College Biology, Computer Security, Conceptual Physics, HS Computer Science, HS Government & Politics, HS World History, International Law, Logical Fallacies, Medical Genetics, Sociology, US Foreign Policy
Bottom 5: College Chemistry (42.9%), Global Facts (57.1%), Virology (58.3%), Anatomy (60.0%), Electrical Engineering (60.0%)
Comparison with GLM-5 (4.8-bit MLX, 744B MoE)
| Benchmark | RoughHouse oQ4e (22GB) | GLM-5 4.8bit (~430GB) |
|---|---|---|
| MMLU | 85.4% | 87.4% |
| TruthfulQA | 85.7% | 90.5% |
| HumanEval | 82.3% | 84.2% |
| GSM8K | 95.0% | — |
| ARC-C | 96.0% | — |
This 22GB oQ4e quant scores within 2% of GLM-5 (a 744B MoE model at ~430GB) on MMLU while being 20x smaller.
Quantization Settings
| Parameter | Value |
|---|---|
| Method | oQ (oMLX Universal Dynamic Quantization) |
| Level | oQ4e (Enhanced) |
| Enhanced (+) | Yes (GPTQ error compensation) |
| Effective BPW | ~4.8 |
| Calibration Dataset | Code + Multilingual + Tool Calling |
| Calibration Samples | 128 |
| Sequence Length | 512 |
| Hardware | Apple M3 Ultra, 512GB Unified Memory |
How to Use
oMLX
Drop the model folder into your oMLX models directory. Auto-detected on server start.
mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Hunterx/Qwen3.5-40B-RoughHouse-Claude-4.6-Opus-oQ4e")
messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
LM Studio
Search for the model and download. Works with MLX backend on Apple Silicon.
Recommended Settings
Per DavidAU's guidance for the RoughHouse variant:
- Temperature: 0.5 - 1.0 (lower for factual, higher for creative)
- Min context window: 8k - 16k
- Rep penalty: 1.05 - 1.1 (if looping occurs)
- System prompt: Even a single sentence helps stabilize the "wild" nature
Credits
- Original Model: DavidAU — fine-tuning, expansion to 40B, Heretic uncensoring
- Base Architecture: Qwen3.5-27B by Alibaba/Qwen Team
- Quantization: oQ by jundot/oMLX
- Benchmarks & Quantization by: Hunterx — oMLX v0.2.20 Intelligence Benchmark suite on M3 Ultra (512GB)
License
Apache 2.0 (inherited from original model)
- Downloads last month
- 739
4-bit