aiDM 2024

Recently, the field of Artificial Intelligence (AI) has been experiencing a resurgence. AI broadly covers a wide swath of techniques, which include logic-based approaches, probabilistic graphical models, machine learning approaches such as deep learning. Advances in specialized hardware capabilities (e.g., Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), Field-Programmable Gate Arrays (FPGAs), etc.), software ecosystem (e.g., programming languages such as Python, Data Science frameworks, and accelerated ML libraries), and systems infrastructure (e.g., cloud servers with AI accelerators) have led to wide-spread adoption of AI techniques in a variety of domains. Examples of such domains include image classification, autonomous driving, automatic speech recognition, and conversational systems (e.g., chatbots). AI solutions not only support multiple data types (e.g., images, speech, or text), but also are available in various configurations and settings, from personal devices to large-scale distributed systems.
In spite of the wide-ranging techniques and applications of AI, their interactions with data management systems remain in infancy. Database management systems have been, for a long time, simply used as repositories for feeding inputs and storing results. Only very recently, we have started seeing some new efforts in using AI techniques in data management systems, e.g., enabling natural language interfaces to relational databases and applying machine learning techniques for query optimization. However, a lot more needs to be done to fully exploit the power of AI for data management systems and workloads.
aiDM'24 is a one-day workshop that will bring together people from academia and industry to discuss various ways of integrating AI techniques with data management systems. The primary goal of the proposed workshop is to explore opportunities for AI techniques for enhancing different components of data management systems, e.g., user interfaces, tooling, performance optimizations, new query types, and workloads. Special emphasis will be given to transparent exploitation of AI techniques (e.g., using Generative AI frameworks) for data management for enterprise class workloads. We hope this workshop will identify important areas of research and spur new efforts in this emerging field.

Topics of Interest

The goal of the workshop is to take a holistic view of various AI technologies and investigate how they can be applied to different component of an end-to-end data management pipeline. Special emphasis would be given to how AI techniques could be used for enhancing user experience by reducing complexity in tools, or providing newer insights, or providing better user interfaces. Topics of interest include, but are not restricted to:

AI-enabled improvements to foundational DB algorithms: sorting, searching, consensus

New AI-enabled business intelligence (BI) queries for relational databases

Integration of Large Language Models with databases and supporting services (e.g., Generative AI)

Natural language queries and conversational interfaces

AI-enabled database programming (e.g., natural language queries, SQL co-pilots, etc.)

Fairness of AI-based system components

Design and Implementation of Vector Databases for unstructured data

Ethics, governance, and societal implications of AI-enabled databases

Reasoning over knowledge bases

Self-tuning Databases using reinforced learning

Impact of model interpretability

Supporting multiple datatypes (e.g., images or time-series data)

Supporting semi-structured, streaming, and graph databases

Impact of AI on tooling, e.g., ETL or data cleaning

Performance implications of AI-enabled queries

Case studies of AI-accelerated workloads

AI-enabled databases for managing and supporting AI workloads

AI strategies for data provenence, access control, anomaly detection and cyber security

Workshop Schedule (8.30am - 5pm)

Session 1 (8.30-10am) (Chair: Milos Nikolic, University of Edinburgh)

Introductory Remarks

Keynote 1: Novel Applications for Large Language Models in Data Management, Immanuel Trummer, Assistant Professor, Department of Computer Science, Cornell University

Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for v “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.
Abstract: The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects at Cornell, aimed at exploiting advanced language processing for tasks such as parsing text about databases and data sets to support automated index selection, parameter tuning, or data profiling, mining data for patterns, described in natural language, or synthesizing scenario-specific code for data processing via specialized coding assistants. Finally, I discuss recent projects, aimed at scaling up the analysis of multimodal data via large language models to large repositories.

Coffee Break (10-10.30am PST)

Session 2 (10.30am - 12pm PST) (Chair: Milos Nikolic, University of Edinburgh)

Low Rank Approximation for Learned Query Optimization, Zixuan Yi, University of Pennsylvania, Yao Tian, The Hong Kong University of Science and Technology, Zachary Ives, University of Pennsylvania, and Ryan Marcus, University of Pennsylvania

Rethinking Table Retrieval from Data Lakes, Jan-Micha Bodensohn, DFKI and Technical University of Darmstadt, and Carsten Binnig, Technical University of Darmstadt and DFKI

Learning Bit Allocations for Z-Order Layouts in Analytic Data Systems, Jenny Gao, MIT, Jialin Ding, Amazon Web Services, Sivaprasad Sudhir, MIT, Samuel Madden, MIT

Lunch Break (12-1.30pm)

Session 3 (1.30-3.30pm) (Chair: Immanuel Trummer, Cornell)

Keynote 2: Solving Difficult Data Management Problems with ML and LLMs , Fatma Ozcan, Principal Engineer, Google

Fatma Ozcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on ML for databases, and platforms and infra-structure for large-scale data analysis. Dr.Ozcan got her PhD degree in computer science from University of Maryland, College Park, and her BSc degree in computer engineering from METU, Ankara. She has over 23 years of experience in industrial research, and has delivered core technologies into IBM products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. Dr.Ozcan co-authored several conference papers and patents. She received the VLDB Women in Database Research Award in 2022. She is an ACM Distinguished Member, and the vice chair of ACM SIGMOD.
Abstract: In this talk, I will discuss using ML and LLMs for data management problems. In the first part, I will review natural languages interfaces to data, and how LLMs are fueling new interest and solutions in this space. After reviewing some of the existing issues, we argue that we need semantic data models and better context for improving the accuracy of existing solutions. In the second part of the talk, we introduce pre-trained cardinality and cost models for databases. For learned database tasks, the state-of-the-art is one-off models that need to be trained individually per task and even per database which causes extremely high training overheads. In this talk, we argue that a new learning paradigm is needed that moves away from one-off models towards more generalizable models that can be used with only minimal overhead for an unseen database. I will present our early-prototype and initial results, and conclude with a research roadmap with many open challenges.

Panel (2.30-3.30pm): Foundation Models for and in Databases (Chair: Angela Bonifati, University of Lyon and CNRS)
- Moderator: Angela Bonifati, University of Lyon and CNRS
- Carsten Binnig, TU Darmstadt
- Yiwen Zhu, Gray Systems Lab, Microsoft
- Fatma Ozcan, Google
- Immanuel Trummer, Cornell
- Umar Farooq Minhas, Apple

Coffee Break (3.30-4pm)

Session 4 (4-5pm) (Chair: Danica Porobic, Oracle)

Mallet: SQL Dialect Translation with LLM Rule Generation, Amadou Ngom and Tim Kraska, MIT

Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended Relations, Hasan Jamil, University of Idaho

Organization

Workshop Steering Committee

Rajesh Bordawekar, IBM T.J. Watson Research Center

Oded Shmueli, Hirundo Ltd., and Emeritus Professor at Technion - Israel Institute of Technology

Workshop Program Chairs

Yael Amsterdamer, Department of Computer Science, Bar-Ilan University

Renata Borovica-Gajic, School of Computing and Information Systems, The University of Melbourne

Donatella Firmani, Department of Statistical Sciences, Sapienza University of Rome

Program Committee

Madelon Hulsebos, UC Berkeley

Sonia-Florina Horchidan, KTH Royal Institute of Technology

Yuliang Li, Reality Lab Research, Meta

Yasuko Matsubara, Osaka University

Apoorva Nitsure, IBM Almaden Research Center

Amit Somech, Bar-Ilan University

Matthias Urban, TU Darmstadt

Jun Wan, Databricks

Wenlu Wang, Texas A&M University

Submission Instructions

Important Dates

Paper Submission: Monday, 25th March 2024, 12 pm PST (EXTENDED)

Notification of Acceptance: Monday, 15th April, 2024

Camera-ready Submission: Monday, 6th May, 2024

Submission Site

All submissions will be handled electronically via EasyChair.

Formatting Guidelines

We will use the same document templates as the SIGMOD/PODS'24 conferences (the ACM format). Like SIGMOD/PODS'24, the aiDM submission should be double-blind.
It is the authors' responsibility to ensure that their submissions adhere strictly to the ACM format. In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review.

The paper length for a full paper is limited upto 12 pages, with unlimited pages of references. However, shorter papers (4 or 8 pages) are encouraged as well.

All accepted papers will be indexed via the ACM digital library and available for download from the workshop webpage in the digital library.