I've been meaning for weeks to blog about a contributed article in the April issue of the Communications of the ACM — Database and Information-Retrieval Methods for Knowledge Discovery, Gerhard Weikum, Gjergji Kasneci, Maya Ramanath, and Fabian Suchanek — in which the authors advocate the integration of database systems (DB) and information retrieval (IR) methods to address emerging applications in large (especially web-scale) datasets.
The paper outlines the historical domain of each of the two approaches to information managing; DB for accounting, records and transaction processing systems, IR for text understanding, statistical ranking, and so on. In the 40 years of research that has gone into DB and IR there have been a few attempts to bridge between them (Datalog, probabilistic relational-algebra, similarity joins, &c), but none of them has really made the leap from research to production. The authors argue that managing large libraries of semi-structured text marked up with metadata — that is, essentially, Semantic Web-style data management — will be the use case that finally brings DB and IR together.
While reading the article, I couldn't help but see these two disciplines as the forerunners of what I'll call the two schools of CEP. There are those who feel that CEP is not CEP unless it uses the tools of IR, and others who feel that reactive streaming DB technologies are also a form of CEP. This post is not an effort to restart that argument, but rather a statement that the future of CEP — like the future of web-scale data management — is likely to involve a rapprochement of these two strands of research.