AUTOMATING THE INTEGRATION OF HETEROGENEOUS DATABASES
Due to the wide range of geographic scales and complex tasks the Government must administer, its data is split in many different ways and is collected at different times by different agencies. The resulting massive data heterogeneity means one cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. Addressing inter-dataset issues is critical to the success of developing the next generation of comprehensive emissions inventories, which is a key aim of efforts such as Networked Environmental Information System for Global Emissions Inventories (NEISGEI).
To date, all approaches to wrap data collections, or even to create mappings across comparable datasets, require manual effort. Despite some promising work, the automated creation of such mappings is still in its infancy, since equivalences and differences manifest themselves at all levels, from individual data values through metadata to the explanatory text surrounding the data collection as a whole. More general methods are required to effectively address this problem.
Viewing the data mapping problem as a variant of the cross-language mapping problem of Machine Translation (MT), we propose to employ new statistical text alignment and clustering algorithms developed in Natural Language Processing to discover correspondences across comparable datasets at all levels. If our automatically learned mappings are effective, we should be able to significantly reduce the amount of manual labor required in database wrapping.
To evaluate our work, we will collaborate with research partners at Washington University in St. Louis, who is building a network to support data integration, and visualization tools to display geographically based data.
We are working with two varieties of domain data. Air quality data is being provided by EPA staff at the California Air Resources Board in Sacramento, who periodically integrate data from some 35 regional Air Quality Management Districts throughout California into a single California-wide database, and pass this along to the Federal EPA in North Carolina. Fire emissions data will be provided by a different set of EPA offices, the USDA/Forest Service, and the Department of Interior.
To the extent this work succeeds, it has the potential to significantly reduce the amount of human work involved in creating single-point access to multiple heterogeneous databases. This problem is faced by thousands of large enterprises with numerous data collections, from Government agencies at all levels to the chemical and automotive industries to startup companies that link together and integrate websites. By automatically postulating mappings across databases/metadata, the proposed algorithms will enable the database wrapper builder (whether fully manual or semi-automated) to work more quickly and effectively. It will also help with the creation of metadata standards.
In particular, we will provide our results to our partner agencies in the EPA so that they can transform their data at will. Working with our partners at the Federal EPA, we will also after the first year work on mapping appropriate data collections of other US states and countries (such as Mexico).
Use semi-automated information integration methods to generate translation protocols between related information sources, e.g., between local air quality management districts and CARB, or between state environmental resources agencies and US EPA. Use data-intensitive parallel translation algorithms from the machine translation community on large sample data sets to automatically generate and maintain such translation protocols. (Earlier, more detailed description of this technology.)