aiDM 2018

aiDM 2018
First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM)

Sunday, June 10, 2018, Room Fort Bend

In conjunction with SIGMOD/PODS 2018

Links

Overview

Topics of Interest

Important Dates

Submission Instructions

Organization

Workshop Overview

Recently, the Artificial Intelligence (AI) field has been experiencing a resurgence. AI broadly covers a wide swath of techniques which include logic-based approaches, probabilistic graphical models, and machine learning/deep learning approaches. Advances in hardware capabilities, such as Graphics Processing Units (GPUs), software components (e.g., accelerated libraries, programming frameworks), and systems infrastructures (e.g., GPU-enabled cloud providers) has led to a wide-spread adaptation of AI techniques to a variety of domains. Examples of such domains include image classification, autonomous driving, automatic speech recognition (ASR) and conversational systems (chatbots). AI solutions not only support multiple datatypes (e.g., free text, images, or speech), but are also available in various configurations, from personal devices to large-scale distributed systems.
In spite of the wide ranging applications of AI techniques, its interactions with the data management systems remains in infancy. At present, a majority of database management systems (DBMS) are being used primarily as a repository for feeding input data and storing results. Recently, there has been some activity in using AI techniques in data management systems, e.g., enabling natural language interfaces to relational databases and applying machine learning techniques for query optimizations. However, a lot more needs to done to fully exploit the power of AI for data management workloads.
We propose to organize a one-day workshop that will bring together people from academia and industry to discuss various ways of integrating AI techniques with data management systems. The primary goal of the proposed workshop is to explore opportunities for AI techniques for enhancing different components of the data management systems, e.g., user interfaces, tooling, performance optimizations, new query types, and workloads. Special emphasis would be given to transparent exploitation of AI techniques using existing data management for enterprise class workloads. We hope this workshop will identify important areas of research and spur new efforts in this emerging field.

Topics of Interest

The goal of the workshop is to take a holistic view of various AI technologies and investigate how they can be applied to different component of an end-to-end data management pipeline. Special emphasis would be given to how AI techniques could be used for enhancing user experience by reducing complexity in tools, or providing newer insights, or providing better user interfaces. Topics of interest include, but are not restricted to:

Characterizing different AI approaches: Logic-based, Probabilistic Graphical models, and machine learning/deep learning approaches

Evaluation of different learning approaches: unsupervised learning, supervised or reinforced learning, transfer learning, zero-shot learning, adversarial networks, and deep probabilistic models

New AI-enabled business intelligence (BI) queries for relational databases

Natural language queries and chatbot interfaces

Natural language result summarization

Impact of the lack of model interpretability

Evaluating quality of approximate results from AI-enabled queries

Supporting multiple datatypes (e.g., images or time-series data)

Supporting semi-structured, streaming, and graph databases

Reasoning over knowledge bases

Data exploration and visualization

Integrating structured and unstructured data sources

AI-enabled data integration strategies

Re-inforcement Learning for Database Tuning

Impact of AI on tooling, e.g., ETL or data cleaning

Performance implications of AI-enabled queries

Case studies of AI-accelerated workloads

Social Implications of AI-enabled database

Keynote Presentations

Learning Data Systems Components

Tim Kraska, MIT

Abstract: Database systems have a long history of carefully selecting efficient algorithms, e.g., a merge vs hash-join, based on data statistics. Yet, existing databases remain general purpose systems and are not built on a case-by-case basis for the specific needs and data characteristics of a user. For example, if the goal would be to build a highly tuned system to store and query fixed-length records with continuous integers keys (e.g., the keys 1 to 100M), one would not use a conventional design. In this setting, a traditional B-Tree to index the key would not make much sense, since they key itself can be used as an offset, making it a constant $O(1)$ operation to look-up any key. Similar for sorting; if the keys come from a dense integer domain, we can simplify the sorting of incoming records based on the primary key, as the key can be again used as an offset to put the data directly into a sorted array, reducing the complexity of sorting from $O(N \log N)$ to $O(N)$. Joins over dense integer keys also simplify to lookups, requiring only a direct lookup using the key as an offset. Maybe surprisingly, the same optimizations are still be possible for other data patterns. For example, if the data contains only even numbers or if the data contains duplicates following a deterministic bell-curved shape, etc., we can still do similar optimizations. In other words, knowing the exact data distributions allows to highly optimize almost any algorithm and data structure the database system uses and potentially even change the complexity class of some algorithms. Of course, in most real-world use cases the data does not perfectly follow a known pattern, and it is usually not worthwhile to engineer a specialized system for every use case. However, if it were be possible to learn the data pattern, correlations, etc. of the data, it might be possible to automatically synthesize index structures, sort and join algorithms, and even query optimizers algorithms that leverage these patterns for significant performance gains. Ideally, if the pattern can be efficiently learned, it might allow to change the complexity class of algorithms as done in the example above. In this talk, I will start from this premise, present initial results of synthesizing index structures and other algorithms as well as outline directions for future work.

Bio: Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory. Currently, his research focuses on building systems for machine learning, and using machine learning for systems. Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Research, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science, received the 2017 VMware Systems Research Award, an NSF CAREER Award, an Air Force Young Investigator award, two Very Large Data Bases (VLDB) conference best-demo awards, and a best-paper award from the IEEE International Conference on Data Engineering (ICDE).

Bringing together Data Integration, Semantic Web and Machine Learning for Drug Safety and Efficacy Predictions

Achille Fokoue, IBM Research AI

Abstract: Accurate predictions of potential interactions between drugs, proteins, genes, and diseases are a major requirement both in drug development and in patient care. Fortunately, the increasingly large amount of drug-related (open) data published on the Web, describing various properties of drugs and their relationships to other drugs, proteins, genes, diseases, related concepts and entities, provides the raw material needed to enable such accurate predictions. In this talk, I will present an end-to-end system that takes in various Web data sources as input and provides as output drug-drug interaction (DDI) and drug-protein interaction (DPI) predictions along with an explanation of why two entities (drug-drug or drug-protein) may interact. The system first creates a knowledge graph out of the input data sources through large-scale semantic integration, and then performs link prediction among drug/protein entities in the graph through large-scale similarity analysis and deep representation learning. Finally, I will discuss some of the key lessons learned from building such a system that might be relevant to the data management community.

Bio: Achille Fokoue is a Research Staff Member and Master Inventor at IBM Research AI. His research has focused on knowledge representation and reasoning, semantic web technologies, and machine learning. He has developed theories, algorithms and systems for scaling reasoning over large and expressive description logics knowledge bases that tolerate inconsistencies and uncertainties. He has also successfully applied semantic web technologies and machine learning in a variety of domains ranging from healthcare & life sciences to text analysis. In particular, in the past three years, he has been exploring the problem of predicting and elucidating the mechanisms of drug-drug interactions and drug-protein interactions using similarity based and deep learning based approaches on knowledge-graphs curated and linked from a variety of sources on the web. He is author or coauthor of 16 patents, 80+ scientific reports and manuscripts that have been cited, in aggregate, more than 2,000 times.

Learning to Make Usable Database Systems

H. V. Jagadish, University of Michigan in Ann Arbor

Abstract: Database systems have traditionally expected users to be experts both in the programming interface/language and in the specifics of the schema. Over the past decade or so, the importance of usability has gradually been realized, and several worthwhile efforts have taken important steps towards making database systems usable. The main challenge is getting the system to understand and act upon poorly and partially specified user intent. But this is a type of problem that artificial intelligence is often used to solve. So AI should be a major tool in improving database usability. In this talk, I will describe some efforts we have made in this direction. Based on this experience, I will speculate on what AI is likely to be good for, and what it will not be as good for.

Bio: H. V. Jagadish is Bernard A Galler Collegiate Professor of Electrical Engineering and Computer Science, and Distinguished Scientist at the Institute for Data Science, at the University of Michigan in Ann Arbor. Prior to 1999, he was Head of the Database Research Department at AT&T Labs, Florham Park, NJ. Professor Jagadish is well known for his broad-ranging research on information management, and has approximately 200 major papers and 37 patents. He is a fellow of the ACM, "The First Society in Computing," (since 2003) and of AAAS (since 2018). He served on the board of the Computing Research Association (2009-2018). He has been an Associate Editor for the ACM Transactions on Database Systems (1992-1995), Program Chair of the ACM SIGMOD annual conference (1996), Program Chair of the ISMB conference (2005), a trustee of the VLDB (Very Large DataBase) foundation (2004-2009), Founding Editor-in-Chief of the Proceedings of the VLDB Endowment (2008-2014), and Program Chair of the VLDB Conference (2014). Among his many awards, he won the ACM SIGMOD Contributions Award in 2013 and the David E Liddle Research Excellence Award (at the University of Michigan) in 2008. He has recently launched an online course on Data Ethics. See https://www.edx.org/course/data-science-ethics-michiganx-ds101x

Workshop Program
(8 am: 6.30 pm), Room Fort Bend

Session 1 (8 - 10.30 am)

Contextual Intelligence for Unified Data Governance, Edward Seabolt, Eser Kandogan and Mary Roth, IBM Almaden Research Center

StreamScope: Automatic Pattern Discovery over Data Streams, Kouki Kawabata, Yasuko Matsubara and Yasushi Sakurai, Kumamoto University

(Keynote Presentation, 9-10.30 am) Bringing together Data Integration, Semantic Web and Machine Learning for Drug Safety and Efficacy Predictions, Achille Fokoue, IBM Research AI

Coffee Break (10.30 - 11 am)

Session 2 (11 am - 12.30 pm)

GridFormation: Towards Self-Driven Online Data Partitioning using Reinforcement Learning, Gabriel Campero Durand, Marcus Pinnecke, Rufat Piriyev, Mahmoud Mohsen, David Broneske, Gunter Saake, Maya Sekeran, Fabián Matias Rodriguez and Laxmi Balami, University of Magdeburg

Deep Reinforcement Learning for Join Order Enumeration, Ryan Marcus and Olga Papaemmanouil, Brandeis University

Deep Reinforcement-Learning Framework for Exploratory Data Analysis, Tova Milo and Amit Somech, Tel Aviv University

Lunch Break (12.30 - 2 pm)

Session 3 (2 - 3.30 pm)

(Keynote Presentation) Learning Data Systems Components Tim Kraska, MIT

Poster Session (3.30 - 4.30 pm)

Session 4 (4.30 - 6.30 pm)

(Keynote Presentation 4.30-6 pm) Learning to Make Usable Database Systems H. V. Jagadish, University of Michigan in Ann Arbor

AI and Data Management: Open Issues and Future Directions Open Discussion

Organization

Workshop Co-Chairs

Rajesh Bordawekar, IBM T.J. Watson Research Center

Oded Shmueli, Computer Science Department, Technion

For questions regarding the workshop please send email to bordaw AT us DOT ibm DOT com.
Program Committee

Sandeep Agrawal, Oracle

Subhabrata Mukherjee, Amazon

Tin Kam Ho, IBM Watson

Chid Apte, IBM Research

Vijil Chenthamarakshan, IBM Research

Tim Kraska, Brown University

Jens Dittrich, Saarland University

H. V. Jagadish, Univ. of Michigan

Tim Oates, Univ. of Maryland, Baltimore County

Sharad Mehrotra, University of California, Irvine

Rajesh Parekh, Facebook

Daisy Zhe Wang, University of Florida

Fei Chiang, McMaster University

Ken Pu, University of Ontario Institute of Technology

Yongjoo Park, Univ. of Michigan

Important Dates

Paper Submission: Friday, 16th March 2018

Notification of Acceptance: Monday, 16th April, 2018

Camera-ready Submission: Monday, 30th April, 2018

Workshop Date: Monday, 10th June, 2018

Submission Instructions

Submission Site

All submissions will be handled electronically via EasyChair.

Formatting Guidelines

We will use the same document templates as the SIGMOD/PODS'18 conferences (using the 2017 ACM format).
It is the authors' responsibility to ensure that their submissions adhere strictly to the 2017 ACM format . In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review.

The paper length for a full paper is limited upto 8 pages. However, shorter papers (4 pages) are encouraged as well.

All accepted papers will be indexed via the ACM digital library and available for download from the workshop webpage in the digital library.