BGB 40/10

Exported from 17 run(s) (50 norms).

Modellauswahl

12 / 28 Modelle

ODER innerhalb einer Kategorie, UND zwischen Kategorien. z.B. Open Source + Europa → nur europäische Open-Source-Modelle.

Grösse
Typ
Region
Anbieter

Score-Tabelle

Rank
Modell
Score?
Net Correctness?
Kalibrierung?
Halluzinationsrate?
VerteilungTP / FP / TN / FN
1
GPT-5.5OpenAI
77.47%
26.030416
84.0%23.5%
304124
2
Claude Opus 4.7Anthropic
68.51%
4.023198
60.0%72.0%
231971
3
Gemini 3 Pro PreviewGoogle
61.57%
-13.018311
38.0%96.4%
183110
4
Claude Opus 4.6Anthropic
59.36%
-3.0161915
60.0%51.6%
1619141
5
DeepSeek-V4-ProDeepSeek
53.88%
2.0171518
68.0%32.4%
1715171
6
GPT-5OpenAI
46.34%
-6.081428
70.0%32.5%
814271
7
Mistral Large 2512Mistral AI
46.03%
-8.0132116
56.0%55.9%
1321151
8
Llama 4 MaverickMeta
42.68%
-32.09410
18.0%100.0%
94100
9
GPT-5.4OpenAI
41.24%
6.010436
92.0%9.8%
104360
10
Grok 4xAI
37.23%
-23.010337
34.0%82.5%
103370
11
GPT-4.1OpenAI
34.28%
-25.063113
38.0%69.8%
631130
12
GPT-3.5 TurboOpenAI
16.78%
-47.00473
6.0%93.8%
04730
Gesamt (12 Modelle)