SearchrAInk · Technical Paper · IEEE Format
A Multi-Model Algorithm for AI Visibility Measurement
Design, Analysis, and Convergence Properties of the SearchrAInk Framework
Hasse Muller
Independent Researcher · The Netherlands
Abstract
The emergence of large language models (LLMs) has transformed information retrieval from ranked search results to synthesised, conversational outputs. This shift introduces challenges in measuring brand visibility, as exposure is implicit and context-dependent. This paper presents a formalised algorithmic framework for SearchrAInk, a system that quantifies brand visibility across AI-generated responses. The approach integrates prompt orchestration, multi-model querying, semantic feature extraction, and weighted aggregation into a unified scoring pipeline. We introduce a five-dimensional evaluation model and provide formal proofs of boundedness, unbiasedness, consistency, and convergence. The framework enables reproducible benchmarking in AI-mediated discovery environments.
Index Terms
Generative AI, LLM Evaluation, Information Retrieval, Generative Engine Optimisation (GEO), Multi-Model Systems, Statistical Consistency, Hoeffding Bounds.
I. Introduction
Large language models (LLMs) have shifted information retrieval from explicit ranking to implicit synthesis. Unlike traditional search engines, which return lists of ranked documents, LLMs generate narrative outputs in which brand exposure is mediated by context, prompt framing, and latent model preferences. As a consequence, visibility becomes difficult to quantify with legacy rank-based metrics.
This paper formalises the SearchrAInk framework, which evaluates brand visibility across multiple LLMs and aggregates results into a unified, bounded metric with provable statistical guarantees.
II. System Model
Let:
- \(Q = \{q_1, \dots, q_n\}\) denote prompts drawn from a distribution \(\mathcal{D}_Q\);
- \(M = \{m_1, \dots, m_K\}\) denote the set of LLMs queried;
- \(R_{i,j} = m_j(q_i)\) denote the response of model \(m_j\) to prompt \(q_i\).
For each response we extract a feature vector
and define dimension scores
The empirical per-dimension score aggregates over prompts and models:
and the overall visibility score is
III. Algorithm
- Initialise \(R \leftarrow \emptyset\)
- for each \(q_i \in Q\) do
- for each \(m_j \in M\) do
- \(r \leftarrow m_j(q_i)\)
- \(f \leftarrow \mathrm{ExtractFeatures}(r)\)
- append \((i,j,f)\) to \(R\)
- end for
- end for
- for each dimension \(k\) do
- \(\hat{S}_k \leftarrow \tfrac{1}{|R|} \sum g_k(f)\)
- end for
- \(\hat{V} \leftarrow \sum_k w_k \hat{S}_k\)
- return \(\hat{V}\)
IV. Mathematical Analysis
A. Assumptions
- Scores are bounded: \(X_{i,j}^{(k)} \in [0,1]\).
- Samples across prompts and models are independent.
- All dimension scores have finite expectation and variance.
Theorem 1 — Boundedness.
The final score satisfies \(0 \le \hat{V} \le 1\).
Proof. Since \(\hat{S}_k \in [0,1]\) and \(\sum_k w_k = 1\) with \(w_k \ge 0\), the weighted sum remains in \([0,1]\).◻
Theorem 2 — Unbiasedness.
\(\mathbb{E}[\hat{S}_k] = S_k^{\star}\).
Proof. Follows from linearity of expectation.◻
Theorem 3 — Strong Consistency.
\(\hat{S}_k \xrightarrow{\text{a.s.}} S_k^{\star}\).
Proof. A direct application of the Strong Law of Large Numbers to the i.i.d. samples \(X_{i,j}^{(k)}\).◻
Theorem 4 — Convergence of the Final Score.
The aggregate score converges almost surely to its population counterpart:
Theorem 5 — Hoeffding Concentration.
For every \(\epsilon > 0\),
Theorem 6 — Aggregate Error Bound.
If \(|\hat{S}_k - S_k^{\star}| \le \epsilon_k\) for each \(k\), then
Theorem 7 — Asymptotic Normality.
As \(nK \to \infty\),
V. Discussion
The framework exhibits three properties:
- Robustness. Multi-model aggregation smooths per-model idiosyncrasies.
- Semantic awareness. Feature extraction captures nuance beyond token matching.
- Statistical consistency. The estimator converges and concentrates exponentially in \(nK\).
Limitations include:
- Model drift. Underlying LLMs evolve; benchmarks must be re-run periodically.
- Prompt bias. The prompt distribution \(\mathcal{D}_Q\) shapes outcomes and must be curated carefully.
- Non-stationarity. LLM responses are not strictly i.i.d. over time, which relaxes the guarantees of Theorems 3–7 to piecewise-stationary regimes.
VI. Conclusion
We presented a formal algorithm for measuring AI visibility across multiple LLMs. The resulting estimator is unbiased and strongly consistent, admits sub-Gaussian concentration, and satisfies a weighted-error decomposition that is convenient for interpretability. Together, these properties make SearchrAInk suitable for reproducible benchmarking of brand presence in generative AI systems.
Acknowledgment
The author thanks the broader AI research community for ongoing work on LLM evaluation and statistical NLP.
References
- P. Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS, 2020.
- Databricks, “Instructed Retriever: Aligning Retrieval with Instructions,” Technical Report, 2024.
- OpenAI, “GPT System Cards,” Technical Reports, 2023–2024.
- Anthropic, “Claude 3 Model Family,” Technical Report, 2024.
- Google DeepMind, “Gemini,” Technical Report, 2024.
© 2026 searchrAInk. All rights reserved.
hello@searchraink.com