BGB 40/10

Exported from 17 run(s) (50 norms).

Modellauswahl

28 / 28 Modelle

ODER innerhalb einer Kategorie, UND zwischen Kategorien. z.B. Open Source + Europa → nur europäische Open-Source-Modelle.

Grösse
Typ
Region
Anbieter

Score-Tabelle

Rank
Modell
Score?
Net Correctness?
Kalibrierung?
Halluzinationsrate?
VerteilungTP / FP / TN / FN
1
GPT-5.5OpenAI
77.47%
26.030416
84.0%23.5%
304124
2
Claude Opus 4.7Anthropic
68.51%
4.023198
60.0%72.0%
231971
3
Claude Opus 4.5Anthropic
61.92%
-6.021272
46.0%89.3%
212720
4
Gemini 3 Pro PreviewGoogle
61.57%
-13.018311
38.0%96.4%
183110
5
Gemini 3 Flash PreviewGoogle
60.23%
-12.019310
38.0%96.6%
193100
6
Claude Opus 4.6Anthropic
59.36%
-3.0161915
60.0%51.6%
1619141
7
DeepSeek-V4-ProDeepSeek
53.88%
2.0171518
68.0%32.4%
1715171
8
Gemini 2.5 ProGoogle
51.39%
-15.017321
36.0%85.7%
173210
9
GPT-5.3 ChatOpenAI
50.24%
10.012236
94.0%5.3%
122351
10
GPT-5OpenAI
46.34%
-6.081428
70.0%32.5%
814271
11
Mistral Large 2512Mistral AI
46.03%
-8.0132116
56.0%55.9%
1321151
12
GPT-5.1OpenAI
45.94%
8.010238
94.0%5.0%
102371
13
GPT-5.2OpenAI
45.33%
9.011237
96.0%4.9%
112370
14
o3OpenAI
45.28%
-13.072023
58.0%46.3%
720221
15
GPT-5.2 ChatOpenAI
45.21%
5.010535
86.0%12.5%
105332
16
Claude Opus 4.1Anthropic
45.18%
-14.092318
54.0%52.6%
923180
17
Claude Opus 4Anthropic
45.08%
-14.082220
56.0%45.9%
822200
18
Llama 4 MaverickMeta
42.68%
-32.09410
18.0%100.0%
94100
19
GPT-5.4OpenAI
41.24%
6.010436
92.0%9.8%
104360
20
Grok 4xAI
37.23%
-23.010337
34.0%82.5%
103370
21
Kimi K2.5Moonshot AI
36.59%
-18.082616
48.0%55.6%
826160
22
Kimi K2 ThinkingMoonshot AI
35.58%
-28.08366
28.0%81.8%
83660
23
DeepSeek-V3.2DeepSeek
35.11%
-19.062519
48.0%52.4%
625181
24
GPT-4.1OpenAI
34.28%
-25.063113
38.0%69.8%
631130
25
Gemini 2.5 FlashGoogle
33.65%
-29.09383
24.0%90.2%
93830
26
Qwen3 MaxAlibaba/Qwen
26.71%
-25.053015
40.0%66.7%
530150
27
Grok 4.1 FastxAI
21.61%
-44.00446
12.0%88.0%
04460
28
GPT-3.5 TurboOpenAI
16.78%
-47.00473
6.0%93.8%
04730
Gesamt (28 Modelle)