Freyja is a system that supports efficient data discovery over data lakes (i.e. large-scale, heterogeneous data repositories). This website is a companion of the research papers revolving around this project.
Freyja's novelty lies on a learning-based approach to data discovery relying on dataset profiles. These are succinct representations that capture the underlying characteristics of the schemata and data values of datasets, which can be efficiently extracted in a parallel fashion. Profiles are then compared, to predict the quality of a join operation among a pair of attributes from different datasets. With these scores, we can elaborate rankings and present a list of the best candidates to perform a join with.
Key Contributions:
The source code of the system can be found in the following Github repository. This repository contains:
To develop the predictive model employed in Freyja we curated a data lake, collecting 160 datasets from open repositories such as Kaggle and OpenML, whose domains varied amply. This search yielded a total of 110,378 candidate pairs of textual attributes, which we filtered by their degree of intersection to define potential join candidates. The resulting 4,318 joins were, then, manually labeled as either semantic or syntactic.
The benchmark can be downloaded from here.
For the sake of reproducibility, we provide all the additional artifacts that were generated during the development and testing of the approach. These are mainly used in the experimentations showcased by the Notebooks. This includes:
This data can be downloaded from here.
We believe in transparent and shareable research [1], [2]. Reproducing the results described in the paper can be done by following the instructions provided in the Github repository, indicating the desired data lake to test. We have included the specific script used to obtain the P@K and R@K results displayed in the paper.
We used a total of 7 benchmarks, all to test efficiency (i.e. preprocessing and query speed, alongside the overall capacity to scale) and 6 of these to test effectiveness (i.e. quality of the reankings generated). One of these benchmarks was designed by our team and used to train Freyja's model. Next, we include references to the original papers and repositories of the remaining 6 benchmarks:
Each of these benchmarks has a ground truth (except for SANTOS Big, hence it is only used to test effectiveness). That is, for a series of query columns that we want to find joins for, the ground truth simply indicates which candidate columns should be found. The difference between the ideal set of candidates and the real joins found by systems produces the respective P@K and R@K scores.
To further facilitate the reproduction of our results, we provide all the ground truths employed to evaluate the benchmarks. The D3L and TUS Big ground truths were exceedingly big, so we removed unnecessary columns from the original version. Both OmniMatch ground truths were not adapted for our evaluation method, so we reestructured them and selected a big enough subset. In all cases, we unified the header names in order to streamline the execution of some of the processes.
Out of the four previous systems, we employ two of them as baselines to test the validity of Freyja: D3L and SANTOS. To further improve the evaluation, we included two more open-source systems:
Finally, to round up the comparative, we wanted to include two fully-embedding-based approaches:
Unfortunately, these tools do not have open implementations available. Hence, we were forced to develop our own iterations of the systems, leveraging the information provided in the respective papers and, in the case of DeepJoin, directly employing a set of instructions kidly provided by the authors. These implementations can be found in Freyja's' repository (DeepJoin, WarpGate), with a easy-to-use implementation and extensive documentation with installation guidelines and execution instructions. Details about the implementation decisions can be found here.
In the repository we present fours notebooks that illustrate the development of the project. All the datasets and artifacts referenced in these notebooks can be found here (same link as in the 'Other assets' section). Next you can find links to pages that present the content of the notebooks without the code:
Join Quality metric. Development of the join quality metric.
Models. Steps undertaken to develop the final model.
Plots and statistical analyses. Executed for both the effectiveness and efficiency results.
Justifying K. Further prove of the validity of K as a lightweight semantic assessment tool.
The following video showcases a demo of Freyja, executed from a Python notebook. The video defines a use case and goes through all the necessary stages to perform join discovery, integrating the process inside a data augmentation pipeline.
Last updated: 2025/10/14