Instructions to use codefuse-ai/CodeFuse-CodeLlama-34B-4bits with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use codefuse-ai/CodeFuse-CodeLlama-34B-4bits with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="codefuse-ai/CodeFuse-CodeLlama-34B-4bits")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("codefuse-ai/CodeFuse-CodeLlama-34B-4bits") model = AutoModelForCausalLM.from_pretrained("codefuse-ai/CodeFuse-CodeLlama-34B-4bits") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use codefuse-ai/CodeFuse-CodeLlama-34B-4bits with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "codefuse-ai/CodeFuse-CodeLlama-34B-4bits" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codefuse-ai/CodeFuse-CodeLlama-34B-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits
- SGLang
How to use codefuse-ai/CodeFuse-CodeLlama-34B-4bits with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "codefuse-ai/CodeFuse-CodeLlama-34B-4bits" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codefuse-ai/CodeFuse-CodeLlama-34B-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "codefuse-ai/CodeFuse-CodeLlama-34B-4bits" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "codefuse-ai/CodeFuse-CodeLlama-34B-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use codefuse-ai/CodeFuse-CodeLlama-34B-4bits with Docker Model Runner:
docker model run hf.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits
| frameworks: | |
| - Pytorch | |
| license: other | |
| tasks: | |
| - text-generation | |
| # Model Card for CodeFuse-CodeLlama-34B-4bits | |
| <p align="left"> | |
| <img src="./LOGO.png" width="100%" /> | |
| </p> | |
| [[中文]](#chinese) [[English]](#english) | |
| <a id="english"></a> | |
| ## Model Description | |
| CodeFuse-CodeLlama-34B-4bits is the 4-bit quantized version of CodeFuse-CodeLlama-34B, which is a 34B Code-LLM fine-tuned over multiple code tasks(600k instrunctions/answers)on the base model CodeLlama-34b-Python. | |
| After undergoing 4-bit quantization, the CodeFuse-CodeLlama-34B-4bits model can be loaded on either a single A10 (24GB VRAM) or a RTX 4090 (24GB VRAM). Moreover, the quantized model still achives an impressive accuracy of 73.8% on the Humaneval pass@1 metric. | |
| <br> | |
| ## News and Updates | |
| 🔥🔥🔥 2023-09-26 We are pleased to announce the release of the 4-bit quantized version of CodeFuse-CodeLlama-34B. Despite the quantization process, the model still achieves a remarkable 73.8% accuracy (greedy decoding) on the HumanEval pass@1 metric. | |
| 🔥🔥🔥 2023-09-11 CodeFuse-CodeLlama34B has achieved 74.4% of pass@1 (greedy decoding) on HumanEval, which is SOTA results for openspurced LLMs at present. | |
| <br> | |
| ## Code Community | |
| **Homepage**: 🏡 https://github.com/codefuse-ai (**Please give us your support with a Star🌟 + Fork🚀 + Watch👀**) | |
| + If you wish to fine-tune the model yourself, you can visit ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨ | |
| + If you wish to deploy the model yourself, you can visit ✨[FasterTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse)✨✨ | |
| + If you wish to see a demo of the model, you can visit ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨ | |
| <br> | |
| ## Performance | |
| | Model | HumanEval(pass@1) | Date | | |
| |:--------------------------------|:-----------------:|:-------:| | |
| | **CodeFuse-CodeLlama-34B** | **74.4%** | 2023.9 | | |
| |**CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023.9 | | |
| | WizardCoder-Python-34B-V1.0 | 73.2% | 2023.8 | | |
| | GPT-4(zero-shot) | 67.0% | 2023.3 | | |
| | PanGu-Coder2 15B | 61.6% | 2023.8 | | |
| | CodeLlama-34b-Python | 53.7% | 2023.8 | | |
| | CodeLlama-34b | 48.8% | 2023.8 | | |
| | GPT-3.5(zero-shot) | 48.1% | 2022.11 | | |
| | OctoCoder | 46.2% | 2023.8 | | |
| | StarCoder-15B | 33.6% | 2023.5 | | |
| | LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 | | |
| <br> | |
| ## GPU Memory Usage | |
| We measured the GPU memory usage after loading the model, as well as the memory usage when encoding 2048/1024 tokens and generating 1024/2048 tokens. The results are presented in the table below. | |
| | Precision | Idle Model | Encoding 2048 tokens and Generating 1024 tokens | Encoding 1024 tokens and Generating 2048 tokens | | |
| |:--------------------------------|:-------------------|:------------------------:|:------------:| | |
| |bfloat16 | 64.89GB | 69.31GB | 66.41GB | | |
| |int4 | 19.09GB | 22.19GB | 20.78GB | | |
| <br> | |
| ## Requirements | |
| * python>=3.8 | |
| * pytorch>=2.0.0 | |
| * transformers==4.32.0 | |
| * auto_gptq==0.4.2 | |
| * Sentencepiece | |
| * CUDA 11.4 | |
| <br> | |
| ## Inference String Format | |
| The inference string is a concatenated string formed by combining conversation data (human and bot contents) in the training data format. It is used as input during the inference process. | |
| Here is an example format of the concatenated string: | |
| ```python | |
| """ | |
| <|role_start|>human<|role_end|>Human 1st round input | |
| <|role_start|>bot<|role_end|>Bot 1st round output</s> | |
| <|role_start|>human<|role_end|>Human 2nd round input | |
| <|role_start|>bot<|role_end|>Bot 2nd round output</s> | |
| ... | |
| ... | |
| ... | |
| <|role_start|>human<|role_end|>Human nth round input | |
| <|role_start|>bot<|role_end|>{Bot output to be genreated}</s> | |
| """ | |
| ``` | |
| When applying inference, you always make your input string end with "<|role_start|>bot<|role_end|>" to ask the model generating answers. | |
| <br> | |
| ## Quickstart | |
| ```bash | |
| git clone https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git | |
| ``` | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ```python | |
| import os | |
| import torch | |
| import time | |
| from transformers import AutoTokenizer | |
| from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig | |
| os.environ["TOKENIZERS_PARALLELISM"] = "false" | |
| def load_model_tokenizer(model_name_or_local_path): | |
| """ | |
| Load model and tokenizer based on the given model name or local path of the downloaded model. | |
| """ | |
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_local_path, | |
| trust_remote_code=True, | |
| use_fast=False, | |
| legacy=False) | |
| tokenizer.padding_side = "left" | |
| model = AutoGPTQForCausalLM.from_quantized(model_name_or_local_path, | |
| inject_fused_attention=False, | |
| inject_fused_mlp=False, | |
| use_cuda_fp16=True, | |
| disable_exllama=False, | |
| device_map='auto' # Support multi-gpus | |
| ) | |
| return model, tokenizer | |
| def inference(model, tokenizer, prompt): | |
| """ | |
| Uset the given model and tokenizer to generate an answer for the specified prompt. | |
| """ | |
| st = time.time() | |
| prompt = prompt if prompt.endswith('\n') else f'{prompt}\n' | |
| inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>" | |
| input_ids = tokenizer.encode(inputs, | |
| return_tensors="pt", | |
| padding=True, | |
| add_special_tokens=False).to("cuda") | |
| with torch.no_grad(): | |
| generated_ids = model.generate( | |
| input_ids=input_ids, | |
| top_p=0.95, | |
| temperature=0.1, | |
| do_sample=True, | |
| max_new_tokens=512, | |
| eos_token_id=tokenizer.eos_token_id, | |
| pad_token_id=tokenizer.pad_token_id | |
| ) | |
| print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}') | |
| outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) | |
| print(f'generate text is {outputs[0][len(inputs): ]}') | |
| latency = time.time() - st | |
| print('latency is {} seconds'.format(latency)) | |
| if __name__ == "__main__": | |
| model_name_or_local_path = '<Mole name (i.e. codefuse-ai/CodeFuse-CodeLlama-34B-4bits) or local path of the downloaded model>' | |
| prompt = 'Please write a QuickSort program in Python' | |
| model, tokenizer = load_model_tokenizer(model_name_or_local_path) | |
| inference(model, tokenizer, prompt) | |
| ``` | |
| **The current inference example code is based on [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ). If you want to achieve higher inference speed, it is recommended to combine it with [TensorRT-LLM (Early Access)](https://developer.nvidia.com/tensorrt-llm-early-access).** | |
| <br> | |
| ## Consistency Check | |
| Here, SHA256 values are provided for the model-related files for consistency check during the download. | |
| | File | SHA256 | | |
| |-------------------------------:|:--------------------------------:| | |
| |config.json | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b | | |
| |generation_config.json | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 | | |
| |gptq_model-4bit-64g.bin | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a | | |
| |pytorch_model.bin.index.json | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d | | |
| |quantize_config.json | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 | | |
| |special_tokens_map.json | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 | | |
| |tokenizer.json | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 | | |
| |tokenizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 | | |
| |tokenizer_config.json | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 | | |
| <br> | |
| <br> | |
| <a id="chinese"></a> | |
| ## 模型简介 | |
| CodeFuse-CodeLlama-34B-4bits是CodeFuse-CodeLlama-34B模型的4bits量化版本,后者是通过QLoRA对基座模型CodeLlama-34b-Python进行多代码任务微调而得到的代码大模型,模型输入长度为4K。 | |
| 经4bits量化后,CodeFuse-CodeLlama-34B-4bits可用单张A10 (24GB显存)或者RTX 4090 (24GB显存)加载,同时,量化后的模型在Humaneval pass@1指标上仍取得了73.8%的表现。 | |
| <br> | |
| ## 新闻 | |
| 🔥🔥🔥 2023-09-26 CodeFuse-CodeLlama-34B 4bits量化版本发布,量化后模型在HumanEval pass@1指标为73.8% (贪婪解码)。 | |
| 🔥🔥🔥 2023-09-11 CodeFuse-CodeLlama-34B发布,HumanEval pass@1指标达到74.4% (贪婪解码), 为当前开源SOTA。 | |
| <br> | |
| ## 代码社区 | |
| **大本营**: 🏡 https://github.com/codefuse-ai (**请支持我们的项目Star🌟 + Fork🚀 + Watch👀**) | |
| + 如果您想自己微调该模型,可以访问 ✨[MFTCoder](https://github.com/codefuse-ai/MFTCoder)✨✨ | |
| + 如果您想自己部署该模型,可以访问 ✨[FasterTransformer4CodeFuse](https://github.com/codefuse-ai/FasterTransformer4CodeFuse)✨✨ | |
| + 如果您想观看该模型示例,可以访问 ✨[CodeFuse Demo](https://github.com/codefuse-ai/codefuse)✨✨ | |
| <br> | |
| ## 评测表现(代码) | |
| | 模型 | HumanEval(pass@1) | 日期 | | |
| |:--------------------------------|:-----------------:|:-------:| | |
| | **CodeFuse-CodeLlama-34B** | **74.4%** | 2023.9 | | |
| |**CodeFuse-CodeLlama-34B-4bits** | **73.8%** | 2023.9 | | |
| | WizardCoder-Python-34B-V1.0 | 73.2% | 2023.8 | | |
| | GPT-4(zero-shot) | 67.0% | 2023.3 | | |
| | PanGu-Coder2 15B | 61.6% | 2023.8 | | |
| | CodeLlama-34b-Python | 53.7% | 2023.8 | | |
| | CodeLlama-34b | 48.8% | 2023.8 | | |
| | GPT-3.5(zero-shot) | 48.1% | 2022.11 | | |
| | OctoCoder | 46.2% | 2023.8 | | |
| | StarCoder-15B | 33.6% | 2023.5 | | |
| | LLaMA 2 70B(zero-shot) | 29.9% | 2023.7 | | |
| <br> | |
| ## 显存使用 | |
| 我们测量了模型加载后占用的显存占用情况,以及输入2048/1024 tokens并输出1024/2048 tokens时的显存使用情况,如下表所示 | |
| | 精度 | 模型空载 | 输入2048 tokens + 输出1024 tokens | 输入1024 tokens + 输出2048 tokens | | |
| |:--------------------------------|:-------------------|:------------------------:|:------------:| | |
| |bfloat16 | 64.89GB | 69.31GB | 66.41GB | | |
| |int4 | 19.09GB | 22.19GB | 20.78GB | | |
| <br> | |
| ## 依赖要求 | |
| * python>=3.8 | |
| * pytorch>=2.0.0 | |
| * transformers==4.32.0 | |
| * auto_gptq==0.4.2 | |
| * Sentencepiece | |
| * CUDA 11.4 | |
| <br> | |
| ## 推理数据格式 | |
| 推理数据为模型在训练数据格式下拼接的字符串形式,它也是推理时输入prompt拼接的方式: | |
| ```python | |
| """ | |
| <|role_start|>human<|role_end|>Human 1st round input | |
| <|role_start|>bot<|role_end|>Bot 1st round output</s> | |
| <|role_start|>human<|role_end|>Human 2nd round input | |
| <|role_start|>bot<|role_end|>Bot 2nd round output</s> | |
| ... | |
| ... | |
| ... | |
| <|end|><|role_start|>human<|role_end|>Human nth round input | |
| <|end|><|role_start|>bot<|role_end|>{Bot output to be genreated}</s> | |
| """ | |
| ``` | |
| 推理时,请确保拼接的prompt字符串以"<|role_start|>bot<|role_end|>"结尾,引导模型生成回答。 | |
| <br> | |
| ## 快速使用 | |
| ```bash | |
| git clone https://huggingface.co/codefuse-ai/CodeFuse-CodeLlama-34B-4bits.git | |
| ``` | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ```python | |
| import os | |
| import torch | |
| import time | |
| from transformers import AutoTokenizer | |
| from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig | |
| os.environ["TOKENIZERS_PARALLELISM"] = "false" | |
| def load_model_tokenizer(model_name_or_local_path): | |
| """ | |
| Load model and tokenizer based on the given model name or local path of downloaded model. | |
| """ | |
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_local_path, | |
| trust_remote_code=True, | |
| use_fast=False, | |
| legacy=False) | |
| tokenizer.padding_side = "left" | |
| model = AutoGPTQForCausalLM.from_quantized(model_name_or_local_path, | |
| inject_fused_attention=False, | |
| inject_fused_mlp=False, | |
| use_cuda_fp16=True, | |
| disable_exllama=False, | |
| device_map='auto' # Support multi-gpus | |
| ) | |
| return model, tokenizer | |
| def inference(model, tokenizer, prompt): | |
| """ | |
| Uset the given model and tokenizer to generate an answer for the speicifed prompt. | |
| """ | |
| st = time.time() | |
| prompt = prompt if prompt.endswith('\n') else f'{prompt}\n' | |
| inputs = f"<|role_start|>human<|role_end|>{prompt}<|role_start|>bot<|role_end|>" | |
| input_ids = tokenizer.encode(inputs, | |
| return_tensors="pt", | |
| padding=True, | |
| add_special_tokens=False).to("cuda") | |
| with torch.no_grad(): | |
| generated_ids = model.generate( | |
| input_ids=input_ids, | |
| top_p=0.95, | |
| temperature=0.1, | |
| do_sample=True, | |
| max_new_tokens=512, | |
| eos_token_id=tokenizer.eos_token_id, | |
| pad_token_id=tokenizer.pad_token_id | |
| ) | |
| print(f'generated tokens num is {len(generated_ids[0][input_ids.size(1):])}') | |
| outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) | |
| print(f'generate text is {outputs[0][len(inputs): ]}') | |
| latency = time.time() - st | |
| print('latency is {} seconds'.format(latency)) | |
| if __name__ == "__main__": | |
| model_name_or_local_path = '<模型名字 (即codefuse-ai/CodeFuse-CodeLlama-34B-4bits)或者提前下载到本地的模型路径>' | |
| prompt = '请用Python实现一个快速排序算法' | |
| model, tokenizer = load_model_tokenizer(model_name_or_local_path) | |
| inference(model, tokenizer, prompt) | |
| ``` | |
| **目前的推理示例代码是基于[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)的,如果你想获取更高的推理速度,建议结合使用[TensorRT-LLM (Early Access)](https://developer.nvidia.com/tensorrt-llm-early-access)。** | |
| <br> | |
| ## 一致性校验 | |
| 这里提供了模型相关文件的SHA256值,用于下载一致性校验。 | |
| | 文件 | SHA256 | | |
| |-------------------------------:|:--------------------------------:| | |
| |config.json | bd1b92f942549f76d7e02e65fd346b39903943912d6d6a2ff8ff345e43e1115b | | |
| |generation_config.json | b625bd13a52d0685313c32919324b9bdc9e75a4f1338ca5c28226d1693e130a3 | | |
| |gptq_model-4bit-64g.bin | 79441bad1d5ab852d0238ed7e113b9912f31189cf9181d7119dd297c4beb454a | | |
| |pytorch_model.bin.index.json | 9a714170172282cfbcaa120af13c0df08b06d040ff24dab30229d8a010821d3d | | |
| |quantize_config.json | 3c1744a928e9d6c3f9a2cbb1bb5a89539077e7d456948bf5aee0deed6a7b8028 | | |
| |special_tokens_map.json | ff3b4a612c4e447acb02d40071bddd989fe0da87eb5b7fe0dbadfc4f74de7531 | | |
| |tokenizer.json | f7b50bcf6d6672eade5e43514d48e9c1e4e63a56aef7b14acdaca94ce93436f7 | | |
| |tokenizer.model | 9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347 | | |
| |tokenizer_config.json | c12441e82f2dce0baff87cf5948e82d6e9b51cc0b5266369c30c319fb771eeb2 | | |