Work in Progress

BGB 40/10

Exported from 14 run(s) (50 norms).

Modellauswahl

25 / 25 Modelle

ODER innerhalb einer Kategorie, UND zwischen Kategorien. z.B. Open Source + Europa → nur europäische Open-Source-Modelle.

Grösse
Typ
Region
Anbieter

Score-Tabelle

Rank
Modell
Score?
Net Correctness?
Kalibrierung?
Halluzinationsrate?
VerteilungTP / FP / TN / FN
1
Claude Opus 4.5Anthropic
61.92%
-7.021281
44.0%92.9%
212810
2
Gemini 3 Pro PreviewGoogle
61.57%
-13.018311
38.0%96.4%
183110
3
Gemini 3 Flash PreviewGoogle
60.23%
-12.019310
38.0%96.6%
193100
4
Claude Opus 4.6Anthropic
59.36%
-15.016313
38.0%87.1%
163130
5
Gemini 2.5 ProGoogle
51.39%
-15.017321
36.0%85.7%
173210
6
GPT-5.3 ChatOpenAI
50.24%
10.012236
94.0%5.3%
122351
7
GPT-5OpenAI
46.34%
-6.081428
70.0%32.5%
814271
8
Mistral Large 2512MistralAI
46.03%
-8.0132116
56.0%55.9%
1321151
9
GPT-5.1OpenAI
45.94%
8.010238
94.0%5.0%
102371
10
GPT-5.2OpenAI
45.33%
9.011237
96.0%4.9%
112370
11
o3OpenAI
45.28%
-13.072023
58.0%46.3%
720221
12
GPT-5.2 ChatOpenAI
45.21%
5.010535
86.0%12.5%
105332
13
Claude Opus 4.1Anthropic
45.18%
-14.092318
54.0%52.6%
923180
14
Claude Opus 4Anthropic
45.08%
-14.082220
56.0%45.9%
822200
15
Llama 4 MaverickMeta
42.68%
-32.09410
18.0%100.0%
94100
16
GPT-5.4OpenAI
41.24%
6.010436
92.0%9.8%
104360
17
Grok 4xAI
37.23%
-23.010337
34.0%82.5%
103370
18
Kimi K2 ThinkingMoonshot AI
35.58%
-28.08366
28.0%81.8%
83660
19
DeepSeek-V3.2DeepSeek
35.11%
-19.062519
48.0%52.4%
625181
20
GPT-4.1OpenAI
34.28%
-25.063113
38.0%69.8%
631130
21
Gemini 2.5 FlashGoogle
33.65%
-29.09383
24.0%90.2%
93830
22
Qwen3 MaxAlibaba
26.71%
-25.053015
40.0%66.7%
530150
23
Grok 4.1 FastxAI
21.61%
-44.00446
12.0%88.0%
04460
24
GPT-3.5OpenAI
16.78%
-47.00473
6.0%93.8%
04730
25
Kimi K2.5Moonshot AI
0.00%
-18.082616
Gesamt (25 Modelle)
LandDE
GesetzBGB
Normen50
Modelle25