Skip to main content

Statistics Seminar

Date:
-
Location:
MDS 220
Speaker(s) / Presenter(s):
Dr. Sean D'Rosario

Title: Optimizing Sampling and Diversity for Differentiable Search and Data Labeling Using Large Language Models for eCommerce

Abstract: While Deep Learning and Generative AI have transformed information retrieval, their reliability and efficiency ultimately depend on foundational statistical principles regarding distribution, diversity and sampling. This talk synthesizes research across search indexing and automated data labeling to demonstrate how classical statistical methodologies can solve critical bottlenecks in modern neural architectures. 

By moving beyond the "black box" interpretation of models, we show that enforcing such statistical constraints as maximizing marginal relevance or optimizing sampling distributions can significantly enhance system performance compared to standard deep learning baselines. In this talk, we examine Differentiable Search Indexing (DSI), showing that modifying training objectives with a Maximal Marginal Relevance (MMR)-inspired diversity component forces the model to learn a more representative distribution of information, balancing relevance with diversity. 

Also, we address data scarcity in Large Language Models (LLMs) through Active Learning, demonstrating that treating LLMs as probabilistic engines requiring rigorous uncertainty and diversity sampling strategies drastically reduces annotation costs while maintaining high accuracy. These applications illustrate how concepts like loss function optimization and experimental design remain central to advancing state-of-the-art AI systems. The talk illustrates how theoretical statistical ideas translate into real-world industry applications and offers insight into pathways toward industry careers.