BGB 40/10
Exported from 14 run(s) (50 norms).
Modellauswahl
25 / 25 ModelleODER innerhalb einer Kategorie, UND zwischen Kategorien. z.B. Open Source + Europa → nur europäische Open-Source-Modelle.
Grösse
Typ
Region
Anbieter
Score-Tabelle
| Rank | Modell↕ | Score?Durchschnittliche Textähnlichkeit zum Gesetzestext (0-100%), gemessen als normalisierte Levenshtein-Distanz.↓ | Net Correctness?Korrekte Antworten minus falsche Antworten. Enthaltungen werden nicht bestraft, Halluzinationen schon.↕ | Kalibrierung?Anteil der Fälle, in denen das Modell die richtige Entscheidung trifft: antworten, wenn es die Antwort kennt; sich enthalten, wenn nicht.↕ | Halluzinationsrate?Wie oft das Modell im Enthaltungsmodus falsch antwortet, obwohl es im Forced-Modus falsch lag.↕ | VerteilungTP / FP / TN / FN |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.5Anthropic | 61.92% | -7.0✓21 ✗28 ○1 | 44.0% | 92.9% | 212810 |
| 2 | Gemini 3 Pro PreviewGoogle | 61.57% | -13.0✓18 ✗31 ○1 | 38.0% | 96.4% | 183110 |
| 3 | Gemini 3 Flash PreviewGoogle | 60.23% | -12.0✓19 ✗31 ○0 | 38.0% | 96.6% | 193100 |
| 4 | Claude Opus 4.6Anthropic | 59.36% | -15.0✓16 ✗31 ○3 | 38.0% | 87.1% | 163130 |
| 5 | Gemini 2.5 ProGoogle | 51.39% | -15.0✓17 ✗32 ○1 | 36.0% | 85.7% | 173210 |
| 6 | GPT-5.3 ChatOpenAI | 50.24% | 10.0✓12 ✗2 ○36 | 94.0% | 5.3% | 122351 |
| 7 | GPT-5OpenAI | 46.34% | -6.0✓8 ✗14 ○28 | 70.0% | 32.5% | 814271 |
| 8 | Mistral Large 2512MistralAI | 46.03% | -8.0✓13 ✗21 ○16 | 56.0% | 55.9% | 1321151 |
| 9 | GPT-5.1OpenAI | 45.94% | 8.0✓10 ✗2 ○38 | 94.0% | 5.0% | 102371 |
| 10 | GPT-5.2OpenAI | 45.33% | 9.0✓11 ✗2 ○37 | 96.0% | 4.9% | 112370 |
| 11 | o3OpenAI | 45.28% | -13.0✓7 ✗20 ○23 | 58.0% | 46.3% | 720221 |
| 12 | GPT-5.2 ChatOpenAI | 45.21% | 5.0✓10 ✗5 ○35 | 86.0% | 12.5% | 105332 |
| 13 | Claude Opus 4.1Anthropic | 45.18% | -14.0✓9 ✗23 ○18 | 54.0% | 52.6% | 923180 |
| 14 | Claude Opus 4Anthropic | 45.08% | -14.0✓8 ✗22 ○20 | 56.0% | 45.9% | 822200 |
| 15 | Llama 4 MaverickMeta | 42.68% | -32.0✓9 ✗41 ○0 | 18.0% | 100.0% | 94100 |
| 16 | GPT-5.4OpenAI | 41.24% | 6.0✓10 ✗4 ○36 | 92.0% | 9.8% | 104360 |
| 17 | Grok 4xAI | 37.23% | -23.0✓10 ✗33 ○7 | 34.0% | 82.5% | 103370 |
| 18 | Kimi K2 ThinkingMoonshot AI | 35.58% | -28.0✓8 ✗36 ○6 | 28.0% | 81.8% | 83660 |
| 19 | DeepSeek-V3.2DeepSeek | 35.11% | -19.0✓6 ✗25 ○19 | 48.0% | 52.4% | 625181 |
| 20 | GPT-4.1OpenAI | 34.28% | -25.0✓6 ✗31 ○13 | 38.0% | 69.8% | 631130 |
| 21 | Gemini 2.5 FlashGoogle | 33.65% | -29.0✓9 ✗38 ○3 | 24.0% | 90.2% | 93830 |
| 22 | Qwen3 MaxAlibaba | 26.71% | -25.0✓5 ✗30 ○15 | 40.0% | 66.7% | 530150 |
| 23 | Grok 4.1 FastxAI | 21.61% | -44.0✓0 ✗44 ○6 | 12.0% | 88.0% | 04460 |
| 24 | GPT-3.5OpenAI | 16.78% | -47.0✓0 ✗47 ○3 | 6.0% | 93.8% | 04730 |
| 25 | Kimi K2.5Moonshot AI | 0.00% | -18.0✓8 ✗26 ○16 | — | — | — |
| Gesamt (25 Modelle) |