METHODOLOGY
Independent. Reproducible. Swiss.
Every KI-Assurance evaluation follows the same rigorous methodology – whether we evaluate one model or thirty. No opinions. No black boxes. Only reproducible, evidence-based results.
The Engine
Inspect AI
The evaluation infrastructure of the UK AI Safety Institute, used by leading AI labs including xAI, with contributions from DeepMind and Anthropic. Open source (MIT license), with over 100 evaluation tasks and a proven architecture for systematic AI testing.
Compl-AI
The EU AI Act compliance benchmark suite from ETH Zurich, INSAIT, and LatticeFlow AI. Maps 27+ established benchmarks to the 6 Trustworthy AI principles (EU HLEG). Published methodology (ArXiv: 2410.07959).
Swiss-Bench
Our proprietary evaluation scenarios for Swiss languages (German, French, Italian), legal terminology, financial domain language, and domain-specific failure modes in the Swiss regulatory environment.
KIAS Score: 6 Dimensions
Accuracy & Performance
Does the model perform its task correctly?
Robustness & Reliability
Does it behave consistently under stress?
Fairness & Non-Discrimination
Does it treat all groups equitably?
Data Protection
Does it protect personal data?
Transparency & Explainability
Can its decisions be traced and understood?
Swiss Regulatory Alignment
Is it suitable for the Swiss regulatory environment?
Each dimension is scored 0–100, with confidence intervals and sample sizes.
The Process
Scoping
We jointly define evaluation objectives, models, and benchmarks (1 hour).
Configuration
We configure the evaluation pipeline for your specific models and data (2–4 hours).
Evaluation
The engine runs automated benchmarks. No manual intervention. Fully reproducible.
Analysis
We interpret the results, identify failure modes, and map gaps to regulatory requirements.
Report
You receive a standardized evaluation report with KIAS scores, gap analysis, and recommendations.
Handover
You receive the complete evaluation harness. You can rerun every test yourself.
Reproducibility Guarantee
Every evaluation report includes:
- Complete evaluation configuration (Inspect AI task definitions, scorer logic, datasets)
- Model version identifiers and API parameters used
- Seed values and sampling parameters
- Cryptographic timestamp of raw results
- The complete evaluation harness – rerunnable at any time
We do not use proprietary, non-reproducible methods.
Independence
We have no commercial relationships with AI model providers. No commissions. No vendor partnerships. No pay-for-score. Every model is evaluated with the same methodology.
Data Sovereignty
You provide an API key. We run the evaluation.
Our dockerized engine runs on your infrastructure.
We bring dedicated hardware to your site. Complete air gap.
You anonymize your data first using our script.
No data leaves Switzerland. No data is retained beyond the engagement.
Ready for an independent evaluation?
Contact us for a no-obligation initial consultation. In 30 minutes we will clarify your evaluation needs.
Get in touch →