Process-Aware Data Quality Assessment


Ognjen Savković
Freire Universität Bozen
September 2016


Data quality is one of the most important problems in data management. In this thesis we focus on two problems of data quality: stability and completeness. Stability and completeness are two sides of the same coin. We define query stability problem as the problem of checking whether the answers of a query will stay the same, i.e., they will not change in the future, when data is changing by an underlying business process. We define query completeness as the problem of checking whether a query posed over a partially complete database returns the answers as if the database was complete, i.e., whether all data required by the query is already present. We investigate query stability for conjunctive queries. In particular, (i) we define a formalism that combines an explicit representation of the control flow of a process with a specification of how data is read and inserted into the database. (ii) We consider different restrictions of the process model and the state of the system, such as negation in process conditions, cyclic executions, read access to written data, presence of pending process instances, and the possibility to start fresh process instances. (iii) We identify for which restriction combinations stability of conjunctive queries is decidable and for which not. (iv) For all decidable cases we provide encodings into suitable variants of Datalog, ranging from non-recursive Datalog to Datalog with negation under stable model semantics. (v) We show that the encodings are optimal with respect to the worst-case complexity of the problem. (vi) We study complexity measures for all parameters of the problem: database instance, process instances, process model, and query. Next, we investigate the problem of query completeness for conjunctive queries. In our setting, partial completeness of a database is expressed using meta- data that describes which records are present in the database. (vii) We develop techniques to reason about query completeness over partially complete databases. (vii) The reasoning procedures can also take into account integrity constraints that hold over a database such as finite domain and foreign key constraints. (viii) For each combination of constraints, we establish a characterization of completeness reasoning and based on that we show how to encode the problem into a logic program. All encodings can be executed immediately by Prolog engines or ASP solvers. (ix) To deal with the case when a query is incomplete, we compute queries that approximate the original query by being more general or more special than the original query, but that are complete. Our results have direct practical implications. (x) Our approach to stability problem provides a possible way for implementations using SQL and Datalog engines. (xi) Our reasoning on completeness has been tested and it can process inputs that mirror the requirements of a real world scenario. (xii) We have implemented a completeness management demo tool, called MAGIK, that can check completeness of queries and that can compute complete query approximations.