Freyja

: Scalable Data Discovery Using Profiles

Freyja is a system that supports data discovery over data lakes (i.e., large scale heterogeneous data repositories). This website is a companion of the research papers revolving around this project.

Freyja's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. With this scores, we can elaborate rankings and present a list of the best candidates to perform a join with.

People

Publications

2021

Towards Scalable Data Discovery Short paper published in EDBT 2021
Effective and scalable data discovery with NextiaJD Demo paper published in EDBT 2021

2020

An integration data tool for joinable tables based on Apache Spark Master thesis

Resources

Software repository

The source code of the system can be found in the following Github repository

The repository contains a README file with all the necessary instructions to run the system.

Ground truth

To develop the predictive model employed in Freyja we curated our own data lake, collecting 160 datasets from open repositories such as Kaggle and OpenML, whose domains varied amply. This search yielded a total of 110,378 candidate pairs of textual attributes, which we filtered by their degree of intersection to define potential join candidates. The resulting 4,318 joins were, then, manually labeled as either semantic or syntactic. The benchmark can be downloaded from here.

Reproducibility

We believe in transparent and shareable research [1], [2].

Reproducing the results described in the paper can be done by following the instructions provided in the Github repository, indicating the desired data lake to test. Next we include references to the original papers in which the employed benchmars were presented:

SANTOS: Relationship-based Semantic Table Union Search [paper], [github]
Table Union Search on Open Data (TUS) [paper], [github]
Dataset Discovery in Data Lakes (D3L) [paper], [github]

To further facilitate the task, we also provide the ground truths employed to evaluate the model for the four external benchmarks, two of them being samples of the original (and exceedingly big) ground truhts. To develope these ground truths we have removed unnecessary columns from the original version and unified the header names in order to streamline the execution of some of the processes.

Additionally, in the repository we showcase two notebooks that illustrate the development of the project.

Join Quality Measure: first, we study of the relationship between semantic and syntatic joins in the defined ground truth, employing several join assessment metrics to separate the two typologies. Then, we define the metodology to generate our join quality metric, born from the linial combination of Multiset Jaccard and the cardinality proportion. Finally, we test the improvements of applying such a measurement.
Join Quality Measure: we develop the predictive model to be used to approximate the quality metric. To do so, we explore several base predictors with different hyperparameter configurations. Moreover, we define four different models, with a varying number of profile features. We evaluate each of these models in all the benchmarks.

Last update: 2024/07/30 by Marc Maynou

Freyja

: Scalable Data Discovery Using Profiles

People

Marc Maynou

Sergi Nadal

Javier Flores

Raquel Panadero

Oscar Romero