MartyNattakit commited on
Commit
81c4c9c
Β·
1 Parent(s): ef0d2f0

add app.py, Dockerfile, README for HF Spaces

Browse files
Files changed (3) hide show
  1. README.md +78 -3
  2. app.py +30 -0
  3. dockerfile +24 -0
README.md CHANGED
@@ -1,7 +1,82 @@
1
- # AI Builders 2025 : CodeSentinel | CWE Classifier for C/C++
 
 
 
 
 
 
 
 
2
 
3
- ## Overview:
4
- CodeSentinel is an AI-based system designed to automatically detect and classify software vulnerabilities in C and C++ code using the Common Weakness Enumeration (CWE) standard. By utilizing Machine Learning (ML) and Natural Language Processing (NLP) techniques, this project aims to automate the identification of common security vulnerabilities in source code, significantly reducing the time and effort required for manual analysis.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  ## Links
7
  - [Try the application here!](https://huggingface.co/spaces/martynattakit/CodeSentinel-CWE_Classification)
 
1
+ ---
2
+ title: CodeSentinel
3
+ emoji: πŸ›‘οΈ
4
+ colorFrom: green
5
+ colorTo: gray
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
 
11
+ # CodeSentinel
12
+
13
+ Vulnerability classification tool combining fine-tuned ML models with MITRE framework coverage.
14
+
15
+ Paste a **code snippet**, **CVE description**, or **bug report** β€” CodeSentinel identifies the vulnerability type, severity, and (for AI/ML inputs) the relevant ATLAS attack technique.
16
+
17
+ ## What it does
18
+
19
+ - **Code input** β†’ Qwen2.5-Coder 7B analyzes the code β†’ RoBERTa classifies the CWE
20
+ - **Text input** β†’ RoBERTa classifies directly from the description
21
+ - **AI/ML input** β†’ ATLAS pattern matcher identifies the relevant attack technique
22
+
23
+ ## Models
24
+
25
+ | Model | Purpose | Accuracy |
26
+ |-------|---------|----------|
27
+ | [`martynattakit/vuln-classifier-roberta`](https://huggingface.co/martynattakit/vuln-classifier-roberta) | CWE classification from text | Macro F1: 0.850 |
28
+ | [`martynattakit/vuln-analyzer-qwen-lora`](https://huggingface.co/martynattakit/vuln-analyzer-qwen-lora) | Code β†’ vulnerability description | Eval loss: β€” |
29
+
30
+ ## Coverage
31
+
32
+ **CWE Top 25** (MITRE 2024):
33
+ CWE-787, CWE-79, CWE-89, CWE-416, CWE-78, CWE-20, CWE-125, CWE-22, CWE-352, CWE-434, CWE-862, CWE-476, CWE-287, CWE-190, CWE-502, CWE-77, CWE-119, CWE-798, CWE-918, CWE-306, CWE-362, CWE-269, CWE-94, CWE-863, CWE-276
34
+
35
+ **MITRE ATLAS** (25 techniques):
36
+ Prompt injection, data poisoning, model extraction, membership inference, adversarial examples, jailbreaking, and more.
37
+
38
+ ## Known limitations
39
+
40
+ - **CWE-77**: 0 F1 β€” insufficient training samples. Predictions for this class are unreliable.
41
+ - **CWE-863**: F1 0.60 β€” semantic overlap with CWE-862 makes these hard to distinguish.
42
+ - **ATLAS matching** uses keyword signals + retrieval, not a fine-tuned classifier. Confidence scores reflect signal overlap, not ground-truth accuracy. No labeled ATLAS dataset exists yet.
43
+ - **Code analysis** training data is primarily C/C++ (BigVul). Python/JS/Go descriptions may be less precise.
44
+
45
+ ## Stack
46
+
47
+ ```
48
+ RoBERTa-base fine-tuned on 165k CVE→CWE pairs (xamxte/cve-to-cwe)
49
+ Qwen2.5-Coder-7B QLoRA fine-tuned on BigVul (1,596 samples)
50
+ ATLAS matcher keyword RAG over 25 hand-crafted MITRE case studies
51
+ FastAPI REST API backend
52
+ ```
53
+
54
+ ## Local development
55
+
56
+ ```bash
57
+ pip install -r requirements.txt
58
+ python app.py
59
+ # β†’ http://localhost:7860
60
+ ```
61
+
62
+ ## Project structure
63
+
64
+ ```
65
+ pipeline/
66
+ classifier.py RoBERTa inference wrapper
67
+ code_analyzer.py Qwen inference wrapper
68
+ atlas_matcher.py ATLAS pattern matcher
69
+ router.py Input routing + output card
70
+ api/
71
+ main.py FastAPI endpoints
72
+ frontend/
73
+ index.html Web UI
74
+ data/
75
+ atlas_cases.json 25 MITRE ATLAS techniques (hand-crafted)
76
+ notebooks/
77
+ 01_roberta_finetune.ipynb
78
+ 02_qwen_qlora.ipynb
79
+ ```
80
 
81
  ## Links
82
  - [Try the application here!](https://huggingface.co/spaces/martynattakit/CodeSentinel-CWE_Classification)
app.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py
3
+ HF Spaces entry point β€” serves both the FastAPI backend and the frontend UI.
4
+ HF Spaces runs this file directly with: python app.py
5
+ """
6
+
7
+ import uvicorn
8
+ from fastapi.staticfiles import StaticFiles
9
+ from fastapi.responses import FileResponse
10
+ from pathlib import Path
11
+
12
+ from api.main import app
13
+
14
+ # ── Serve frontend ────────────────────────────────────────────────────────────
15
+ # Mount the frontend folder so index.html is served at "/"
16
+
17
+ FRONTEND_DIR = Path(__file__).parent / "frontend"
18
+
19
+ @app.get("/", include_in_schema=False)
20
+ async def serve_frontend():
21
+ return FileResponse(FRONTEND_DIR / "index.html")
22
+
23
+ # ── Run ───────────────────────────────────────────────────────────────────────
24
+ if __name__ == "__main__":
25
+ uvicorn.run(
26
+ "app:app",
27
+ host="0.0.0.0",
28
+ port=7860, # HF Spaces default port
29
+ reload=False,
30
+ )
dockerfile ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ git \
8
+ curl \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Copy requirements first for layer caching
12
+ COPY requirements.txt .
13
+
14
+ # Install Python dependencies
15
+ RUN pip install --no-cache-dir -r requirements.txt
16
+
17
+ # Copy project files
18
+ COPY . .
19
+
20
+ # HF Spaces runs on port 7860
21
+ EXPOSE 7860
22
+
23
+ # Start the app
24
+ CMD ["python", "app.py"]