Plots and statistical analyses

Link to notebook. The notebook includes all the necessary code to create the data that we need and generate the plots. Nonetheless, an iteration of this data is already contained here, under the k_justification folder.

Here we attach all the results obtained throughout the development of Freyja. This encompasses two main areas:

Next, we provide the raw results for all tests:

A. Query Time Efficiency Analysis (ANOVA and Tukey's HSD)

We performed N=30 independent runs for query execution time for each system-benchmark pair. A one-way ANOVA test was conducted to determine overall performance differences, followed by Tukey's Honest Significant Difference (HSD) post-hoc test for pairwise comparisons against Freyja.

ANOVA Results for Query Execution Time
Benchmark F-Statistic p-value Result
FreyjaBM 17799.8203 2.8223 · 10^{-273} Significant differences exist
SANTOS Small 20020.1715 1.8979 · 10^{-278} Significant differences exist
TUS Small 21659.8954 6.5143 · 10^{-282} Significant differences exist
D3L 21548.4121 1.0989 · 10^{-281} Significant differences exist
TUS Big 23554.9405 1.3253 · 10^{-285} Significant differences exist
SANTOS Large 26808.6888 2.6698 · 10^{-291} Significant differences exist
OM CG 44294010.4899 0 Significant differences exist
OM CR 46319696.2963 0 Significant differences exist
Tukey's HSD Post-hoc Test: Pairwise Comparison of Mean Query Times (Freyja vs. Competitor)
Benchmark Competitor meandiff (s) p-adj lower (95% CI) upper (95% CI) Reject H0
FreyjaBM D3L -129.8883 0.0000 -132.9585 -126.8182 True
DeepJoin 0.0203 1.0000 -3.0498 3.0905 False
KGLiDS 0.0307 1.0000 -3.0395 3.1008 False
SANTOS 244.8883 0.0000 241.8182 247.9585 True
Starmie 0.2723 1.0000 -2.7978 3.3425 False
WarpGate -0.0263 1.0000 -3.0965 3.0438 False
SANTOS Small D3L -141.4730 0.0000 -144.5064 -138.4396 True
DeepJoin 0.0693 1.0000 -2.9641 3.1027 False
KGLiDS -0.0820 1.0000 -3.1154 2.9514 False
SANTOS 254.6730 0.0000 251.6396 257.7064 True
Starmie 1.5473 0.7331 -1.4861 4.5807 False
WarpGate -0.1040 1.0000 -3.1374 2.9294 False
TUS Small D3L -171.8287 0.0000 -174.8276 -168.8297 True
DeepJoin 0.4043 0.9997 -2.5946 3.4033 False
KGLiDS -0.4323 0.9995 -3.4313 2.5666 False
SANTOS 248.1953 0.0000 245.1964 251.1943 True
Starmie 1.7673 0.5799 -1.2316 4.7663 False
WarpGate -0.6310 0.9959 -3.6300 2.3680 False
D3L D3L -152.2390 0.0000 -155.3111 -149.1669 True
DeepJoin -0.1383 1.0000 -3.2104 2.9337 False
KGLiDS -0.1063 1.0000 -3.1784 2.9657 False
SANTOS 266.5057 0.0000 263.4336 269.5777 True
Starmie 3.2457 0.0308 0.1736 6.3177 True
WarpGate -0.1460 1.0000 -3.2181 2.9261 False
TUS Big D3L -192.8350 0.0000 -196.0143 -189.6557 True
DeepJoin -0.6843 0.9953 -3.8636 2.4949 False
KGLiDS -0.8090 0.9885 -3.9883 2.3703 False
SANTOS 273.0350 0.0000 269.8557 276.2143 True
Starmie 2.9763 0.0830 -0.2029 6.1556 False
WarpGate -1.2823 0.8931 -4.4616 1.8969 False
SANTOS Large D3L -238.9717 0.0000 -242.2367 -235.7067 True
DeepJoin 0.2030 1.0000 -3.0620 3.4680 False
KGLiDS -1.0663 0.9593 -4.3313 2.1987 False
SANTOS 282.9050 0.0000 279.6400 286.1700 True
Starmie 15.2267 0.0000 11.9617 18.4917 True
WarpGate -1.9793 0.5460 -5.2443 1.2857 False
OM CG D3L -111.9770 0.0000 -112.0388 -111.9152 True
DeepJoin 0.0363 0.5832 -0.0255 0.0982 False
KGLiDS 0.0457 0.3003 -0.0162 0.1075 False
SANTOS 237.9240 0.0000 237.8622 237.9858 True
Starmie 0.0347 0.5265 -0.0216 0.0909 False
WarpGate -0.0337 0.6685 -0.0955 0.0282 False
OM CR D3L -109.9077 0.0000 -109.9680 -109.8473 True
DeepJoin 0.0243 0.8931 -0.0360 0.0847 False
KGLiDS 0.0543 0.1081 -0.0060 0.1147 False
SANTOS 238.9673 0.0000 238.9070 239.0277 True
Starmie 0.0460 0.2377 0.2377 0.1049 False
WarpGate -0.0303 0.7462 -0.0907 0.0300 False

B. Effectiveness Analysis (Wilcoxon Signed-Rank Test)

The Wilcoxon Signed-Rank test results below are provided without aggressive scaling, relying on the split format to improve readability.

B.1 Freyja Benchmark Effectiveness

This section presents the pairwise effectiveness comparison between Freyja and all competing systems (SANTOS, Starmie, D3L, WarpGate, DeepJoin, and KGLiDS) on the FreyjaBM benchmark (a non-synthetic data lake). The Wilcoxon signed-rank test results confirm that performance differences between Freyja and all competitors are statistically relevant for all tested values of k.

Freyja Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). nan indicates indeterminate rank sum.
Freyja Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). nan indicates indeterminate rank sum.
B.2 SANTOS Small Benchmark Effectiveness

This section details the Wilcoxon signed-rank test results comparing Freyja against all baselines on the SANTOS Small benchmark (a synthetic data lake). The analysis for this benchmark, which spans k=1 to k=10, shows that the performance differences between Freyja and its competitors are generally statistically significant, particularly for higher values of k. Notable exceptions occur at the initial ranking positions (k={1, 2}) when comparing Freyja against Starmie, WarpGate, DeepJoin, and KGLiDS, where the differences were found to be not statistically significant.

SANTOS Small Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* nan* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0429* 0.0588 0.0339* 0.0167* 0.0108* 0.0112* 0.0111* 0.0111* 0.0075* 0.0000*
D3L nan* 0.0000* 0.0000* 0.0167* 0.0108* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0412* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0438* 0.0588 0.0339* 0.0167* 0.0108* 0.0112* 0.0111* 0.0111* 0.0075* 0.0000*
DeepJoin 0.0457* 0.0455* 0.0339* 0.0167* 0.0108* 0.0112* 0.0111* 0.0111* 0.0075* 0.0049*
KGLiDS 0.0000* 0.0588 0.0339* 0.0167* 0.0108* 0.0112* 0.0111* 0.0111* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
SANTOS Small Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0449* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0422* 0.0588 0.0339* 0.0169* 0.0112* 0.0113* 0.0114* 0.0112* 0.0075* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0169* 0.0112* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.04141* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0435* 0.0588* 0.0394* 0.0169* 0.0112* 0.0113* 0.0114* 0.0112* 0.0000* 0.0000*
DeepJoin 0.0457* 0.0588* 0.0394* 0.0169* 0.0112* 0.0113* 0.0114* 0.0112* 0.0075* 0.0049*
KGLiDS 0.0.0449* 0.0588* 0.0394* 0.0169* 0.0112* 0.0113* 0.0114* 0.0112* 0.0076* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.3 TUS Small Benchmark Effectiveness

This section contains the Wilcoxon signed-rank test results for the TUS Small benchmark (a synthetic data lake). The results confirm that the performance differences between Freyja and its competitors (including the Multiset Jaccard ablation baseline, MJ) are statistically relevant for all tested values of k.

TUS Small Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0011* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0011* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
TUS Small Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0013* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0014* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0014* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.4 TUS Big Benchmark Effectiveness

This section provides the Wilcoxon signed-rank test results for the TUS Big benchmark (a synthetic data lake), which spans k=1 to k=6. The results indicate that the differences between Freyja and most systems are statistically significant. The primary exception found is at \mathbf{k=1} when comparing Precision (P@k) against DeepJoin and Recall (R@k) against Starmie, where the differences were not statistically significant.

TUS Big Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0479* 0.0171* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
TUS Big Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0479* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0449* 0.0171* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.5 D3L Benchmark Effectiveness

This section provides the Wilcoxon signed-rank test results comparing Freyja against all baselines on the D3L benchmark. As a non-synthetic benchmark where ground truth joins are manually annotated, D3LBM poses a higher difficulty when detecting joins compared to synthetic datasets. The evaluation spans k=1 to k=10. The Wilcoxon test results confirm that performance differences between Freyja and all compared systems (SANTOS, Starmie, D3L, and the ablation baseline MJ) are statistically relevant for all tested values of k.

D3L Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
D3L Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=10
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* nan* nan* nan* nan*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.6 Omnimatch - City Governance Benchmark Effectiveness

This section provides the Wilcoxon signed-rank test results for the OMCG benchmark (a synthetic data lake), which spans k=5 to k=30. The results indicate that the differences between Freyja and all systems are statistically significant.

Omnimatch - City Government Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Omnimatch - City Government Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.7 Omnimatch - Culture Recreation Benchmark Effectiveness

This section provides the Wilcoxon signed-rank test results for the OMCR benchmark (a synthetic data lake), which spans k=5 to k=30. The results indicate that the differences between Freyja and all systems are statistically significant.

Omnimatch - Culture Recreation Benchmark: Wilcoxon Test p-values for Precision (P@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Omnimatch - Culture Recreation Benchmark: Wilcoxon Test p-values for Recall (R@k)
Competitor k=1 k=2 k=3 k=4 k=5 k=6
SANTOS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
Starmie 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
D3L 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
MJ (Freyja without K) 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
WarpGate 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
DeepJoin 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
KGLiDS 0.0000* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.

C. False Discovery Rate (FDR) Correction

This section details the results of the multiple comparison adjustment applied to the statistical tests performed across all benchmarks and metrics. To control the expected proportion of false positives resulting from performing numerous statistical comparisons, we applied the Benjamini-Hochberg (B-H) procedure. This procedure was conducted by aggregating all p-values obtained from the Tukey's HSD tests (Query Time) and the Wilcoxon signed-rank tests (Effectiveness P@k/R@k) into a single list. By choosing a standard False Discovery Rate level Q = 0.05, the resulting cutoff p-value for significance was determined to be 0.0455. In total, 765 tests (95.15% of all tests) were declared statistically significant after B-H correction, confirming the robustness of the derived conclusions.

Summary of Benjamini-Hochberg (B-H) Procedure Results
Metric Value
Total Number of Tests (m) 804
FDR Level (Q) 0.05
Resulting Cutoff p-value (p value) 0.0468
Number of Rejected Null Hypotheses (Declared Significant) 765
Percentage of Rejected Hypotheses 95.15\%
Expected False Discovery Rate (FDR) ≤ 5\%

← Back to main page

last updated: 2026/01/07