aiDM 2024
Seventh International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM)
 

 
Co-located with ACM SIGMOD/PODS 2024
Friday, June 14, 2024

 
 
  Links
 
 
 
 
 
 
 
 
 
Workshop Overview

Recently, the field of Artificial Intelligence (AI) has been experiencing a resurgence. AI broadly covers a wide swath of techniques, which include logic-based approaches, probabilistic graphical models, machine learning approaches such as deep learning. Advances in specialized hardware capabilities (e.g., Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), Field-Programmable Gate Arrays (FPGAs), etc.), software ecosystem (e.g., programming languages such as Python, Data Science frameworks, and accelerated ML libraries), and systems infrastructure (e.g., cloud servers with AI accelerators) have led to wide-spread adoption of AI techniques in a variety of domains. Examples of such domains include image classification, autonomous driving, automatic speech recognition, and conversational systems (e.g., chatbots). AI solutions not only support multiple data types (e.g., images, speech, or text), but also are available in various configurations and settings, from personal devices to large-scale distributed systems.

In spite of the wide-ranging techniques and applications of AI, their interactions with data management systems remain in infancy. Database management systems have been, for a long time, simply used as repositories for feeding inputs and storing results. Only very recently, we have started seeing some new efforts in using AI techniques in data management systems, e.g., enabling natural language interfaces to relational databases and applying machine learning techniques for query optimization. However, a lot more needs to be done to fully exploit the power of AI for data management systems and workloads.

aiDM'24 is a one-day workshop that will bring together people from academia and industry to discuss various ways of integrating AI techniques with data management systems. The primary goal of the proposed workshop is to explore opportunities for AI techniques for enhancing different components of data management systems, e.g., user interfaces, tooling, performance optimizations, new query types, and workloads. Special emphasis will be given to transparent exploitation of AI techniques (e.g., using Generative AI frameworks) for data management for enterprise class workloads. We hope this workshop will identify important areas of research and spur new efforts in this emerging field.

Topics of Interest

The goal of the workshop is to take a holistic view of various AI technologies and investigate how they can be applied to different component of an end-to-end data management pipeline. Special emphasis would be given to how AI techniques could be used for enhancing user experience by reducing complexity in tools, or providing newer insights, or providing better user interfaces. Topics of interest include, but are not restricted to:

  • AI-enabled improvements to foundational DB algorithms: sorting, searching, consensus
  • New AI-enabled business intelligence (BI) queries for relational databases
  • Integration of Large Language Models with databases and supporting services (e.g., Generative AI)
  • Natural language queries and conversational interfaces
  • AI-enabled database programming (e.g., natural language queries, SQL co-pilots, etc.)
  • Fairness of AI-based system components
  • Design and Implementation of Vector Databases for unstructured data
  • Ethics, governance, and societal implications of AI-enabled databases
  • Reasoning over knowledge bases
  • Self-tuning Databases using reinforced learning
  • Impact of model interpretability
  • Supporting multiple datatypes (e.g., images or time-series data)
  • Supporting semi-structured, streaming, and graph databases
  • Impact of AI on tooling, e.g., ETL or data cleaning
  • Performance implications of AI-enabled queries
  • Case studies of AI-accelerated workloads
  • AI-enabled databases for managing and supporting AI workloads
  • AI strategies for data provenence, access control, anomaly detection and cyber security


Workshop Schedule (8.30am - 5pm)


Session 1 (8.30-10am) (Chair: Milos Nikolic, University of Edinburgh)

  • Introductory Remarks
  • Keynote 1: Novel Applications for Large Language Models in Data Management, Immanuel Trummer, Assistant Professor, Department of Computer Science, Cornell University

    Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for v “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.

    Abstract: The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I outline novel opportunities in the context of data management, enabled by these advances. I discuss several recent research projects at Cornell, aimed at exploiting advanced language processing for tasks such as parsing text about databases and data sets to support automated index selection, parameter tuning, or data profiling, mining data for patterns, described in natural language, or synthesizing scenario-specific code for data processing via specialized coding assistants. Finally, I discuss recent projects, aimed at scaling up the analysis of multimodal data via large language models to large repositories.


Coffee Break (10-10.30am PST)


Session 2 (10.30am - 12pm PST) (Chair: Milos Nikolic, University of Edinburgh)

  • Low Rank Approximation for Learned Query Optimization, Zixuan Yi, University of Pennsylvania, Yao Tian, The Hong Kong University of Science and Technology, Zachary Ives, University of Pennsylvania, and Ryan Marcus, University of Pennsylvania
  • Rethinking Table Retrieval from Data Lakes, Jan-Micha Bodensohn, DFKI and Technical University of Darmstadt, and Carsten Binnig, Technical University of Darmstadt and DFKI
  • Learning Bit Allocations for Z-Order Layouts in Analytic Data Systems, Jenny Gao, MIT, Jialin Ding, Amazon Web Services, Sivaprasad Sudhir, MIT, Samuel Madden, MIT


Lunch Break (12-1.30pm)


Session 3 (1.30-3.30pm) (Chair: Immanuel Trummer, Cornell)

  • Keynote 2: Solving Difficult Data Management Problems with ML and LLMs , Fatma Ozcan, Principal Engineer, Google

    Fatma Ozcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on ML for databases, and platforms and infra-structure for large-scale data analysis. Dr.Ozcan got her PhD degree in computer science from University of Maryland, College Park, and her BSc degree in computer engineering from METU, Ankara. She has over 23 years of experience in industrial research, and has delivered core technologies into IBM products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. Dr.Ozcan co-authored several conference papers and patents. She received the VLDB Women in Database Research Award in 2022. She is an ACM Distinguished Member, and the vice chair of ACM SIGMOD.

    Abstract: In this talk, I will discuss using ML and LLMs for data management problems. In the first part, I will review natural languages interfaces to data, and how LLMs are fueling new interest and solutions in this space. After reviewing some of the existing issues, we argue that we need semantic data models and better context for improving the accuracy of existing solutions. In the second part of the talk, we introduce pre-trained cardinality and cost models for databases. For learned database tasks, the state-of-the-art is one-off models that need to be trained individually per task and even per database which causes extremely high training overheads. In this talk, we argue that a new learning paradigm is needed that moves away from one-off models towards more generalizable models that can be used with only minimal overhead for an unseen database. I will present our early-prototype and initial results, and conclude with a research roadmap with many open challenges.
  • Panel (2.30-3.30pm): Foundation Models for and in Databases (Chair: Angela Bonifati, University of Lyon and CNRS)

    • Moderator: Angela Bonifati, University of Lyon and CNRS
    • Anglela Bonifati is a Distinguished Professor in Computer Science (Professeur Classe Exceptionnelle) at Lyon 1 University (France), affiliated with the CNRS Liris research lab. She is the Head of the Database group in the same lab. She is also an Adjunct Professor at the University of Waterloo (Canada) in the Data Systems Group (since 2020). She is a Senior Member of the IUF (Institut Universitaire de France), a high distinction that recognizes top scientists across all disciplines in France. She is an IEEE Senior Member and an ACM member.

    • Carsten Binnig, TU Darmstadt
    • Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and a Visiting Researcher at the Google Systems Research Group. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of modern cloud data systems with a focus on modern data center hardware as well as machine learning for enabling efficient and easy-to-use cloud data systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards.

    • Yiwen Zhu, Gray Systems Lab, Microsoft
    • Yiwen Zhu is a principal applied data scientist in the Gray System Lab (GSL) at Microsoft. Before that, she received her B.S at Tsinghua University in 2012 and Ph.D. in 2017 supervised by Prof. Haris Koutsopoulos (Northeastern University) and Prof. Nigel Wilson (MIT). Her research interests centers on the vision of autonomous cloud systems, utilizing machine learning, statistical inference, and operation research techniques. Additionally, Yiwen leads the development of an internal large language model (LLM) application aimed at streamlining and enhancing workflows for software engineers.

    • Fatma Ozcan, Google
    • Fatma Ozcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on ML for databases, and platforms and infra-structure for large-scale data analysis. Dr.Ozcan got her PhD degree in computer science from University of Maryland, College Park, and her BSc degree in computer engineering from METU, Ankara. She has over 23 years of experience in industrial research, and has delivered core technologies into IBM products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. Dr.Ozcan co-authored several conference papers and patents. She received the VLDB Women in Database Research Award in 2022. She is an ACM Distinguished Member, and the vice chair of ACM SIGMOD.

    • Immanuel Trummer, Cornell
    • Immanuel Trummer is an assistant professor at Cornell University and a member of the Cornell Database Group. His papers were selected for v “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.

    • Umar Farooq Minhas, Apple
    • Umar Farooq Minhas is an engineering leader in Siri and Information Intelligence in AIML at Apple. Umar leads various efforts to provide ML-based semantic annotations (e.g., entity linking) exploiting the Apple Knowledge Graph to power various intelligent experiences in Siri Q&A, Spotlight, and Safari. Previously, Umar was a Principal Researcher working on ML for systems at Microsoft Research and a Research Staff Member working on big data at IBM Almaden Research.


Coffee Break (3.30-4pm)


Session 4 (4-5pm) (Chair: Danica Porobic, Oracle)

  • Mallet: SQL Dialect Translation with LLM Rule Generation, Amadou Ngom and Tim Kraska, MIT
  • Smart Science Needs Linked Open Data with a Dash of Large Language Models and Extended Relations, Hasan Jamil, University of Idaho


Organization

Workshop Steering Committee

Workshop Program Chairs

Program Committee

  • Madelon Hulsebos, UC Berkeley
  • Sonia-Florina Horchidan, KTH Royal Institute of Technology
  • Yuliang Li, Reality Lab Research, Meta
  • Yasuko Matsubara, Osaka University
  • Apoorva Nitsure, IBM Almaden Research Center
  • Amit Somech, Bar-Ilan University
  • Matthias Urban, TU Darmstadt
  • Jun Wan, Databricks
  • Wenlu Wang, Texas A&M University

Submission Instructions

Important Dates 

  • Paper Submission: Monday, 25th March 2024, 12 pm PST (EXTENDED)
  • Notification of Acceptance: Monday, 15th April, 2024
  • Camera-ready Submission: Monday, 6th May, 2024

Submission Site 

All submissions will be handled electronically via EasyChair.

Formatting Guidelines 

We will use the same document templates as the SIGMOD/PODS'24 conferences (the ACM format). Like SIGMOD/PODS'24, the aiDM submission should be double-blind.

It is the authors' responsibility to ensure that their submissions adhere strictly to the ACM format. In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review. 

The paper length for a full paper is limited upto 12 pages, with unlimited pages of references. However, shorter papers (4 or 8 pages) are encouraged as well.  

All accepted papers will be indexed via the ACM digital library and available for download from the workshop webpage in the digital library.