Freyja is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). This website is a companion of the research papers revolving around this project.
Freyja's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. With this scores, we can elaborate rankings and present a list of the best candidates to perform a join with.
The source code of the system can be found in the following Github repository
The repository contains a README file with all the necessary instructions to run the system.
To develop the predictive model employed in Freyja we curated our own data lake, collecting 160 datasets from open repositories such as Kaggle and OpenML, whose domains varied amply. This search yielded a total of 110,378 candidate pairs of textual attributes, which we filtered by their degree of intersection to define potential join candidates. The resulting 4,318 joins were, then, manually labeled as either semantic or syntactic. The benchmark can be downloaded from here.
We believe in transparent and shareable research [1], [2].
Reproducing the results described in the paper can be done by following the instructions provided in the Github repository, indicating the desired data lake to test. Next we include references to the original papers in which the employed benchmars were presented:
To further facilitate the task, we also provide the ground truths employed to evaluate the model for the four external benchmarks, two of them being samples of the original (and exceedingly big) ground truhts. To develope these ground truths we have removed unnecessary columns from the original version and unified the header names in order to streamline the execution of some of the processes.
Additionally, in the repository we showcase two notebooks that illustrate the development of the project.
Last update: 2024/07/30 by Marc Maynou