Hi, I’m Abdullah!

Sr. ML Engineer at Atlassian | Ex-Meta | PhD

RL & multi-Agentic AI for Confluence/Jira Search. On the Central AI team, I’m building and improving SMART Answer generation using RL-based fine-tuned LLMs and multi-Agent AI architecture. Previously at Meta on Ads Ranking, fine-tuning LLaMA 3 for large-scale suggestive ad generation.

I write about RecSys, agentic AI, RL fine-tuning, and what actually ships at scale.

The Biggest Gap in Agentic AI: Multi-Agent Evaluation

The missing layer between observability and benchmarks. Over the last year, Agentic AI has exploded. OpenAI has agents. Anthropic has agents. Google DeepMind has agents. Every startup suddenly has a multi-agent architecture diagram. And if you look closely, something interesting has happened. The industry solved observability. The industry largely solved benchmarks. Yet somehow, we still cannot answer a deceptively simple question: Was this agent actually good? Not “did it finish.” ...

The Missing Layer in Agentic AI: Why Evaluation Is the Next Enterprise Platform

Executive Summary Agentic AI is entering enterprise deployment faster than its evaluation infrastructure is maturing. Most teams can now observe traces and benchmark outcomes, but they still cannot reliably grade how agents behave in production across coordination quality, trajectory correctness, and safety compliance. That missing layer is becoming a strategic bottleneck for executive teams deciding where to place platform bets, set governance controls, and scale high-autonomy workflows with confidence. As of June 2026, the market has largely solved two layers: observability (OpenTelemetry GenAI conventions, AgentOps, OWASP AOS) and benchmark comparison (HAL, GAIA, SWE-bench). The unresolved layer sits between them: an open, framework-agnostic evaluation protocol that takes any OTel-compatible trace and scores agent behavior end-to-end. Without this layer, enterprises can measure activity and final outcomes, but still miss the process-level failures that drive hidden risk, cost overruns, and policy violations in real deployments. That gap is not only a research problem; it is now a platform opportunity with direct implications for deployment risk, governance, and competitive advantage. ...

The Evaluation of RecSys, Part 3: The Deep Learning Era (NCF, Wide & Deep, DeepFM, DIN, DLRM, AdaTT)

Part 3 of the series: how DNNs transformed RecSys from 2016 onward. NCF, Wide & Deep, DeepFM, DIN, DLRM, and AdaTT. Architectures, intuition, and where each shines.

The Evaluation of RecSys, Part 2: Factorization Machines and XGBoost

Part 2 of the series: how Factorization Machines generalized MF to arbitrary features, how XGBoost handled non-linear ranking, and the limitations that pushed the field toward deep neural networks.

The Evaluation of RecSys, Part 1: From Content-Based Filtering to Matrix Factorization

Part 1 of a 4-part series tracing how RecSys evolved from content-based filtering through collaborative filtering to matrix factorization, and where each technique falls short, setting up the next breakthrough.