|
|
Recently, the field of Artificial Intelligence
(AI) has been experiencing a
resurgence. AI broadly covers
a wide swath of techniques,
which include logic-based
approaches, probabilistic
graphical models, machine
learning approaches such as
deep learning. Advances in
specialized hardware
capabilities (e.g., Graphics
Processing Units (GPUs),
Tensor Processing Units
(TPUs), Field-Programmable
Gate Arrays (FPGAs), etc.),
software ecosystem (e.g.,
programming languages such as
Python, Data Science frameworks, and
accelerated ML libraries), and
systems infrastructure (e.g.,
cloud servers with AI
accelerators) have led to
wide-spread adoption of AI
techniques in a variety of
domains. Examples of such
domains include image
classification, autonomous
driving, automatic speech
recognition, and
conversational systems (e.g.,
chatbots). AI solutions not
only support multiple data
types (e.g., images, speech,
or text), but also are
available in various
configurations and settings,
from personal devices to
large-scale distributed
systems.
In spite
of the wide-ranging techniques
and applications of AI, their
interactions with data
management systems remain in
infancy. Database management
systems have been, for a long
time, simply used as
repositories for feeding
inputs and storing
results. Only very recently,
we have started seeing some
new efforts in using AI
techniques in data management
systems, e.g., enabling
natural language interfaces to
relational databases and
applying machine learning
techniques for query
optimization. However, a lot
more needs to be done to fully
exploit the power of AI for
data management systems and
workloads.
aiDM'24 is a one-day workshop that will bring
together people from academia and industry
to discuss various ways of integrating AI techniques with data management systems.
The primary goal of the proposed workshop is to explore opportunities for AI techniques
for enhancing different components of data management systems,
e.g., user interfaces, tooling, performance optimizations, new query types, and workloads.
Special emphasis will be given to transparent exploitation of AI techniques (e.g., using
Generative AI frameworks) for data management for enterprise class workloads.
We hope this workshop will identify important areas of research and spur new efforts in this emerging field.
The goal of the workshop is to take a holistic view of various AI technologies and
investigate how they can be applied to different component of an end-to-end data management
pipeline. Special emphasis would be given to how AI techniques could be used for enhancing
user experience by reducing complexity in tools, or providing newer insights, or providing
better user interfaces. Topics of interest include, but are not restricted to:
- AI-enabled improvements to foundational DB algorithms: sorting, searching, consensus
- New AI-enabled business intelligence (BI) queries for relational databases
- Integration of Large Language Models with databases and supporting services (e.g., Generative AI)
- Natural language queries and conversational interfaces
- AI-enabled database programming (e.g., natural language queries, SQL co-pilots, etc.)
- Fairness of AI-based system components
- Design and Implementation of Vector Databases for unstructured data
- Ethics, governance, and societal implications of AI-enabled databases
- Reasoning over knowledge bases
- Self-tuning Databases using reinforced learning
- Impact of model interpretability
- Supporting multiple datatypes (e.g., images or time-series data)
- Supporting semi-structured, streaming, and graph databases
- Impact of AI on tooling, e.g., ETL or data cleaning
- Performance implications of AI-enabled queries
- Case studies of AI-accelerated workloads
- AI-enabled databases for managing and supporting AI workloads
- AI strategies for data provenence, access control, anomaly detection and cyber security
Session 1 (8.30-10am) (Chair: Milos Nikolic, University of Edinburgh)
- Introductory Remarks
- Keynote 1: Novel Applications for
Large Language Models in Data
Management, Immanuel
Trummer, Assistant Professor, Department of Computer Science,
Cornell University
Immanuel
Trummer is an assistant professor at Cornell University and a
member of the Cornell Database Group. His papers were selected for
v “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research
Highlight Award, and for publication in CACM as CACM Research
Highlight. His online lecture introducing students to database
topics collected over a million views. He received the NSF CAREER
Award and multiple Google Faculty Research Awards.
Abstract: The past years have been
marked by several breakthrough results in the domain of generative
AI, culminating in the rise of tools like ChatGPT, able to solve a
variety of language-related tasks without specialized training. In
this talk, I outline novel opportunities in the context of data
management, enabled by these advances. I discuss several recent
research projects at Cornell, aimed at exploiting advanced language
processing for tasks such as parsing text about databases and data
sets to support automated index selection, parameter tuning, or data
profiling, mining data for patterns, described in natural language,
or synthesizing scenario-specific code for data processing via
specialized coding assistants. Finally, I discuss recent projects,
aimed at scaling up the analysis of multimodal data via large
language models to large repositories.
Coffee Break
(10-10.30am PST)
Session 2 (10.30am - 12pm PST) (Chair: Milos Nikolic, University of Edinburgh)
-
Low Rank Approximation for
Learned Query
Optimization, Zixuan
Yi, University of
Pennsylvania, Yao Tian, The
Hong Kong University of
Science and Technology,
Zachary Ives, University of
Pennsylvania, and Ryan Marcus,
University of Pennsylvania
-
Rethinking Table
Retrieval from Data
Lakes, Jan-Micha
Bodensohn, DFKI and Technical
University of Darmstadt, and
Carsten Binnig, Technical
University of Darmstadt and DFKI
-
Learning Bit
Allocations for Z-Order
Layouts in Analytic Data
Systems, Jenny Gao,
MIT, Jialin Ding, Amazon Web
Services, Sivaprasad Sudhir,
MIT, Samuel Madden, MIT
Lunch Break (12-1.30pm)
Session 3 (1.30-3.30pm) (Chair: Immanuel Trummer, Cornell)
- Keynote 2: Solving Difficult Data
Management Problems with ML and
LLMs , Fatma Ozcan,
Principal Engineer, Google
Fatma Ozcan is a Principal Engineer at Systems Research@Google. Before
that, she was a Distinguished Research Staff Member and a senior
manager at IBM Almaden Research Center. Her current research focuses
on ML for databases, and platforms and infra-structure for
large-scale data analysis. Dr.Ozcan got her PhD degree in computer
science from University of Maryland, College Park, and her BSc degree
in computer engineering from METU, Ankara. She has over 23 years of
experience in industrial research, and has delivered core technologies
into IBM products. She has been a contributor to various SQL
standards, including SQL/XML, SQL/JSON and SQL/PTF. Dr.Ozcan
co-authored several conference papers and patents. She received the
VLDB Women in Database Research Award in 2022. She is an ACM
Distinguished Member, and the vice chair of ACM SIGMOD.
Abstract: In this talk, I will discuss using ML and LLMs
for data management problems. In the first part, I will review
natural languages interfaces to data, and how LLMs are fueling new
interest and solutions in this space. After reviewing some of the
existing issues, we argue that we need semantic data models and
better context for improving the accuracy of existing solutions.
In the second part of the talk, we introduce pre-trained
cardinality and cost models for databases. For learned database
tasks, the state-of-the-art is one-off models that need to be
trained individually per task and even per database which causes
extremely high training overheads. In this talk, we argue that a
new learning paradigm is needed that moves away from one-off models
towards more generalizable models that can be used with only
minimal overhead for an unseen database. I will present our
early-prototype and initial results, and conclude with a research
roadmap with many open challenges.
-
Panel (2.30-3.30pm): Foundation Models for and in Databases (Chair: Angela Bonifati, University of Lyon and CNRS)
- Moderator:
Angela Bonifati,
University of Lyon and
CNRS
Anglela Bonifati is a
Distinguished Professor
in Computer Science
(Professeur Classe
Exceptionnelle) at Lyon 1
University (France),
affiliated with the CNRS
Liris research lab. She
is the Head of the Database
group in the same
lab. She is
also an Adjunct
Professor at the
University of Waterloo
(Canada) in the Data
Systems Group (since
2020). She is a Senior
Member of the IUF
(Institut Universitaire
de France), a high
distinction that
recognizes top scientists
across all disciplines in
France. She is an IEEE
Senior Member and an ACM
member.
- Carsten Binnig, TU
Darmstadt
Carsten Binnig is a Full
Professor in the Computer
Science department at TU
Darmstadt and a Visiting
Researcher at the Google
Systems Research
Group. Carsten received
his Ph.D. at the
University of Heidelberg
in 2008. Afterwards, he
spent time as a
postdoctoral researcher in
the Systems Group at ETH
Zurich and at SAP working
on in-memory
databases. Currently, his
research focus is on the
design of modern cloud
data systems with a focus
on modern data center
hardware as well as
machine learning for
enabling efficient and
easy-to-use cloud data
systems. His work has been
awarded a Google Faculty
Award, as well as multiple
best paper and best demo
awards.
- Yiwen
Zhu, Gray Systems Lab,
Microsoft
Yiwen Zhu is a principal
applied data scientist
in the Gray System Lab
(GSL) at
Microsoft. Before that,
she received her B.S at
Tsinghua University in
2012 and Ph.D. in 2017
supervised by
Prof. Haris Koutsopoulos
(Northeastern
University) and
Prof. Nigel Wilson
(MIT). Her research
interests centers on the
vision of autonomous
cloud systems, utilizing
machine learning,
statistical inference,
and operation research
techniques. Additionally,
Yiwen leads the
development of an
internal large language
model (LLM) application
aimed at streamlining
and enhancing workflows
for software engineers.
- Fatma
Ozcan, Google
Fatma Ozcan is a
Principal Engineer at
Systems
Research@Google. Before
that, she was a
Distinguished Research
Staff Member and a
senior manager at IBM
Almaden Research
Center. Her current
research focuses on ML
for databases, and
platforms and
infra-structure for
large-scale data
analysis. Dr.Ozcan got
her PhD degree in
computer science from
University of Maryland,
College Park, and her
BSc degree in computer
engineering from METU,
Ankara. She has over 23
years of experience in
industrial research, and
has delivered core
technologies into IBM
products. She has been a
contributor to various
SQL standards, including
SQL/XML, SQL/JSON and
SQL/PTF. Dr.Ozcan
co-authored several
conference papers and
patents. She received
the VLDB Women in
Database Research Award
in 2022. She is an ACM
Distinguished Member,
and the vice chair of
ACM SIGMOD.
- Immanuel
Trummer, Cornell
Immanuel Trummer is an
assistant professor at
Cornell University and a
member of the Cornell
Database Group. His
papers were selected for
v “Best of VLDB”, “Best
of SIGMOD”, for the ACM
SIGMOD Research
Highlight Award, and for
publication in CACM as
CACM Research
Highlight. His online
lecture introducing
students to database
topics collected over a
million views. He
received the NSF CAREER
Award and multiple
Google Faculty Research
Awards.
- Umar Farooq Minhas, Apple
Umar Farooq Minhas is an
engineering leader in
Siri and Information
Intelligence in AIML at
Apple. Umar leads
various efforts to
provide ML-based
semantic annotations
(e.g., entity linking)
exploiting the Apple
Knowledge Graph to power
various intelligent
experiences in Siri Q&A,
Spotlight, and
Safari. Previously, Umar
was a Principal
Researcher working on ML
for systems at Microsoft
Research and a Research
Staff Member working on
big data at IBM Almaden
Research.
Coffee Break (3.30-4pm)
Session 4 (4-5pm) (Chair: Danica Porobic, Oracle)
-
Mallet: SQL Dialect
Translation with LLM Rule
Generation, Amadou
Ngom and Tim Kraska, MIT
-
Smart Science Needs Linked
Open Data with a Dash of Large
Language Models and Extended
Relations, Hasan Jamil,
University of Idaho
Workshop Steering Committee
- Rajesh Bordawekar, IBM T.J. Watson Research Center
- Oded Shmueli, Hirundo Ltd., and Emeritus Professor at Technion - Israel Institute of Technology
Workshop Program Chairs
Program Committee
- Madelon Hulsebos, UC Berkeley
- Sonia-Florina Horchidan, KTH Royal Institute of Technology
- Yuliang Li, Reality Lab Research, Meta
- Yasuko Matsubara, Osaka University
- Apoorva Nitsure, IBM Almaden Research Center
- Amit Somech, Bar-Ilan University
- Matthias Urban, TU Darmstadt
- Jun Wan, Databricks
- Wenlu Wang, Texas A&M University
Important Dates
- Paper Submission: Monday, 25th March 2024, 12 pm PST (EXTENDED)
- Notification of Acceptance: Monday, 15th April, 2024
- Camera-ready Submission: Monday, 6th May, 2024
Submission Site
All submissions will be handled electronically via EasyChair.
Formatting Guidelines
We will use the same document templates as the SIGMOD/PODS'24
conferences (the
ACM format). Like SIGMOD/PODS'24, the aiDM submission should be double-blind. It is the authors' responsibility to ensure that
their submissions adhere
strictly to the ACM
format. In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review.
The paper length for a full paper is limited upto 12
pages, with unlimited pages of references. However, shorter papers
(4 or 8 pages)
are encouraged as
well.
All accepted papers will be
indexed via the ACM digital
library and available for
download from the workshop
webpage in the digital
library.
|
|
|