Welcome to the Disrupting AI: Experts Insights Series by AllAboutAI.com—a platform dedicated to unravelling the complexities and innovations shaping artificial intelligence today.
In this edition, we bring you an exclusive interview with Oguzhan (Ouz) Gencoglu, Head of AI at RootSignals. With over 15 years of experience in machine learning and entrepreneurship, Oguz sheds light on the evolving challenges and opportunities in building scalable and trustworthy large language model (LLM) applications.
From addressing hallucinations and biases to defining metrics that matter, this conversation is packed with actionable insights for AI enthusiasts and professionals alike.
Discover the critical themes from this engaging discussion below, or watch the full interview here.
The Ultimate Guide to Choosing and Evaluating LLMs
LLMs are transforming industries at an unprecedented pace, but their success depends on careful evaluation and thoughtful application.
Oguz’s expert advice offers a roadmap for businesses to unlock the full potential of LLMs, ensuring these tools deliver accurate, scalable, and trustworthy results.
This blog distils the key highlights from the conversation, offering an in-depth look at how businesses can refine their AI strategies, streamline evaluation processes, and build more reliable, scalable LLM-powered solutions.
Trustworthiness and Challenges in LLMs (5:38 – 7:04)
Trustworthiness is critical in deploying large language models (LLMs) for real-world applications. According to Oguz, issues like hallucinations, where models generate inaccurate or fabricated information, and biases in responses are inherent to the nature of LLMs.
These challenges make it difficult to confidently expose such models to external users, as errors could lead to misinformation or breaches of trust. Despite advancements in reducing costs and improving latency, trust remains a primary obstacle.
Oguz believes that hallucinations are not simply bugs to be fixed but are features of the technology rooted in how neural networks operate.
The largest, biggest challenge of unlocking proper business value and going to production with LLMs is trustworthiness, meaning hallucinations, relevance, all that stuff…
We demand more complex things from AI, then it will hallucinate in those complex stuff. So, this is an inherent feature of this technology, deep neural networks.
Oguz’s perspective highlights the need to approach LLM trustworthiness as a design challenge rather than a simple technical fix. Businesses must acknowledge the inherent limitations of LLMs and build robust mechanisms to identify and manage these challenges.
Transparency and control mechanisms, like guardrails or real-time evaluations, can help mitigate risks and ensure outputs align with business goals.
Organizations should focus on application-specific evaluations and continually adapt to new challenges rather than expecting one-time fixes.
Metrics for LLM Evaluation (7:13 – 9:36)
Evaluation metrics are the foundation of any successful LLM application. Oguz argues that traditional machine learning metrics, such as perplexity and likelihood, are not as relevant for business applications.
Instead, evaluation should focus on metrics closer to real-world use cases, such as safety, relevance, and faithfulness. These higher-level metrics must be tailored to the specific domain of the application, whether it’s healthcare, legal, or customer support.
Moreover, metrics must be established early in the development process, as they guide the choice of models, prompts, and configurations.
We think metrics that are closer to business, closer to real use cases, are more relevant. These are stuff like hallucinations… faithfulness, truthfulness, answer relevance… and other abstract metrics like safety for children, politeness, engagingness, helpfulness.
Oguz’s approach underscores the importance of grounding evaluation metrics in business outcomes and user needs. For AI practitioners, this means moving beyond purely technical measures and focusing on how LLM outputs align with real-world expectations.
Custom metrics tailored to organizational goals ensure that evaluations are meaningful and actionable. Businesses can benefit from integrating these metrics early, providing a roadmap for fine-tuning and optimizing their LLM solutions.
LLM Evaluation and Optimization Process (3:09 – 5:12)
Evaluation should not be treated as an afterthought but as the starting point for building LLM-powered applications. Oguz believes that deciding on metrics upfront simplifies the development process, turning the choice of models and prompts into optimization problems.
Once the application is live, continuous evaluation is necessary to detect issues like hallucinations or irrelevant responses. RootSignals provides tools to facilitate this process by offering real-time evaluators and metrics to monitor performance during the development and production stages.
In our perspective at RootSignals, you decide first what you want from this application… Once you do that, because we have the metrics… the choice of model and prompt, it’s just an optimization problem.
Oguz emphasizes the value of a structured evaluation-first approach, simplifying LLM development’s complexity. By focusing on metrics that matter, developers can streamline decision-making and avoid unnecessary iterations.
For businesses, this strategy reduces risks and accelerates the deployment timeline. It also ensures that applications meet user expectations from the outset, enhancing their chances of success in real-world scenarios.
Data Quality and Model Behavior (9:58 – 12:29)
Data quality significantly impacts the performance of LLMs, but even pristine datasets cannot entirely prevent hallucinations. Oguz explains that LLMs operate as autoregressive models, predicting the next token based on probabilities. This design inherently leads to variability in output and occasional hallucinations, regardless of how well the data is curated.
Prompt phrasing and training data diversity influence model behaviour, often making outputs unpredictable.
The reason is these models are called autoregressive models… Essentially, there’s no control when training these models… So this hallucination is a feature, unfortunately.
Oguz’s insights reveal that hallucinations are not solely a data problem but a consequence of how LLMs are architected. This understanding encourages businesses to focus on robust evaluation frameworks and post-processing techniques to manage hallucinations.
While data quality remains crucial, companies should also emphasize designing prompts and workflows that minimize the impact of inherent model limitations.
Building and Scaling Applications (34:50 – 36:39)
Scaling LLM-powered applications requires thoughtful strategies to optimize performance while managing costs. Oguz highlights techniques like model quantization to reduce latency and the use of dynamic routing for efficient request handling.
RootSignals provides agnostic support for various frameworks, enabling businesses to choose their preferred tools while benefiting from continuous evaluation and monitoring.
You can dynamically route your requests… If you hit rate limits in certain things, or hardware limits, you can use this other thing… Quantization is pretty helpful, so you don’t want to host a full precision 200 billion model.
Oguz’s experience demonstrates that scaling LLM applications isn’t just about technical infrastructure but also cost-effectiveness. Businesses should evaluate trade-offs between performance and expense, leveraging dynamic routing and quantization to balance these factors.
These strategies help companies maintain reliable and efficient operations while minimizing unnecessary expenditures.
Real-World Applications and Trends (25:10 – 27:52)
Successful deployment of LLMs requires a clear automation mindset. According to Oguz, businesses that embrace full end-to-end automation tend to achieve more tremendous success than those relying on partial human oversight.
Real-time monitoring and transparent evaluation processes further ensure reliability, enabling companies to trust their systems while achieving higher efficiency.
Most companies who fail to go to production… have the wrong mindset, which is, ‘We’ll still do all these tasks, but with LLMs a bit faster.’… If you have the mindset, ‘No, we will completely end-to-end automate these tasks,’ then they start to ask… ‘Okay, but how do we measure this?
Oguz’s perspective highlights the importance of committing to automation as a strategic goal rather than a partial enhancement. Businesses must prioritize evaluation and guardrails during the automation process to build trust in their systems.
This proactive approach ensures smoother scaling and more meaningful productivity gains.
Future of LLM Evaluation (29:59 – 31:25)
Emerging paradigms like LLMs as judges promise to transform evaluation processes. Companies can reduce human dependency by training custom models to evaluate other LLMs while maintaining evaluation accuracy.
Open-source tools are complementary, offering cost-effective solutions for basic evaluations, though they often lack the domain-specific insights provided by tailored frameworks.
Large language models are, in fact, able to assess and evaluate other large language models… But if you can train your own custom models, LLMs as judges… it is very, very correlated with human judgment.
Oguz’s insights suggest that automated evaluation using LLMs as judges could streamline quality control processes while maintaining reliability. However, businesses must remain cautious, calibrating these evaluators to align with specific requirements.
Open-source tools can complement such efforts, but companies should tailor their approach to their unique needs.
Semantic SEO and Contextual Relevance (3:09 – 5:12)
Oguz highlights the role of context and intent in building effective LLM applications. RootSignals incorporates concepts like semantic SEO to ensure outputs align with user needs.
This involves balancing macro and micro contexts, structuring content with user intent in mind, and maintaining logical connections between related topics.
If you don’t know your metrics… building something doesn’t go to production… You need to decide first what you want from this application… Then it clicks: ‘Oh, we need to evaluate.
Oguz’s emphasis on contextual relevance underlines the importance of semantic understanding in LLM evaluations. Businesses can apply these principles to improve the quality of their AI applications, focusing on user intent and logical structuring to deliver meaningful results.
This approach aligns with best content creation and SEO practices, ensuring better alignment with business goals.
Conclusion
LLMs thrive on trust, transparency, and thoughtful evaluation. Oguz emphasizes starting with meaningful metrics and addressing challenges like hallucinations and biases to build reliable applications.
By applying these strategies, businesses can unlock the true potential of LLMs, creating scalable and impactful solutions that drive real-world success. For more insights, explore the Disrupting AI: Expert Insights series.