Link to notebook. The notebook includes all the necessary code to create the data that we need and generate the plots. Nonetheless, an iteration of this data is already contained here, under the k_justification folder.
Here we attach all the results obtained throughout the development of Freyja. This encompasses two main areas:
Next, we provide the raw results for all tests:
We performed N=30 independent runs for query execution time for each system-benchmark pair. A one-way ANOVA test was conducted to determine overall performance differences, followed by Tukey's Honest Significant Difference (HSD) post-hoc test for pairwise comparisons against Freyja. The Wilcoxon Signed-Rank test results below are provided without aggressive scaling, relying on the split format to improve readability. This section presents the pairwise effectiveness comparison between Freyja and all competing systems (SANTOS, Starmie, D3L, WarpGate, DeepJoin, and KGLiDS) on the FreyjaBM benchmark (a non-synthetic data lake). The Wilcoxon signed-rank test results confirm that performance differences between Freyja and all competitors are statistically relevant for all tested values of k. This section details the Wilcoxon signed-rank test results comparing Freyja against all baselines on the SANTOS Small benchmark (a synthetic data lake). The analysis for this benchmark, which spans k=1 to k=10, shows that the performance differences between Freyja and its competitors are generally statistically significant, particularly for higher values of k. Notable exceptions occur at the initial ranking positions (k={1, 2}) when comparing Freyja against Starmie, WarpGate, DeepJoin, and KGLiDS, where the differences were found to be not statistically significant. This section contains the Wilcoxon signed-rank test results for the TUS Small benchmark (a synthetic data lake). The results confirm that the performance differences between Freyja and its competitors (including the Multiset Jaccard ablation baseline, MJ) are statistically relevant for all tested values of k. This section provides the Wilcoxon signed-rank test results for the TUS Big benchmark (a synthetic data lake), which spans k=1 to k=6. The results indicate that the differences between Freyja and most systems are statistically significant. The primary exception found is at \mathbf{k=1} when comparing Precision (P@k) against DeepJoin and Recall (R@k) against Starmie, where the differences were not statistically significant. This section provides the Wilcoxon signed-rank test results comparing Freyja against all baselines on the D3L benchmark. As a non-synthetic benchmark where ground truth joins are manually annotated, D3LBM poses a higher difficulty when detecting joins compared to synthetic datasets. The evaluation spans k=1 to k=10. The Wilcoxon test results confirm that performance differences between Freyja and all compared systems (SANTOS, Starmie, D3L, and the ablation baseline MJ) are statistically relevant for all tested values of k. This section provides the Wilcoxon signed-rank test results for the OMCG benchmark (a synthetic data lake), which spans k=5 to k=30. The results indicate that the differences between Freyja and all systems are statistically significant. This section provides the Wilcoxon signed-rank test results for the OMCR benchmark (a synthetic data lake), which spans k=5 to k=30. The results indicate that the differences between Freyja and all systems are statistically significant. This section details the results of the multiple comparison adjustment applied to the statistical tests performed across all benchmarks and metrics. To control the expected proportion of false positives resulting from performing numerous statistical comparisons, we applied the Benjamini-Hochberg (B-H) procedure. This procedure was conducted by aggregating all p-values obtained from the Tukey's HSD tests (Query Time) and the Wilcoxon signed-rank tests (Effectiveness P@k/R@k) into a single list. By choosing a standard False Discovery Rate level Q = 0.05, the resulting cutoff p-value for significance was determined to be 0.0455. In total, 765 tests (95.15% of all tests) were declared statistically significant after B-H correction, confirming the robustness of the derived conclusions.A. Query Time Efficiency Analysis (ANOVA and Tukey's HSD)
Benchmark
F-Statistic
p-value
Result
FreyjaBM
17799.8203
2.8223 · 10^{-273}
Significant differences exist
SANTOS Small
20020.1715
1.8979 · 10^{-278}
Significant differences exist
TUS Small
21659.8954
6.5143 · 10^{-282}
Significant differences exist
D3L
21548.4121
1.0989 · 10^{-281}
Significant differences exist
TUS Big
23554.9405
1.3253 · 10^{-285}
Significant differences exist
SANTOS Large
26808.6888
2.6698 · 10^{-291}
Significant differences exist
OM CG
44294010.4899
0
Significant differences exist
OM CR
46319696.2963
0
Significant differences exist
Benchmark
Competitor
meandiff (s)
p-adj
lower (95% CI)
upper (95% CI)
Reject H0
FreyjaBM
D3L
-129.8883
0.0000
-132.9585
-126.8182
True
DeepJoin
0.0203
1.0000
-3.0498
3.0905
False
KGLiDS
0.0307
1.0000
-3.0395
3.1008
False
SANTOS
244.8883
0.0000
241.8182
247.9585
True
Starmie
0.2723
1.0000
-2.7978
3.3425
False
WarpGate
-0.0263
1.0000
-3.0965
3.0438
False
SANTOS Small
D3L
-141.4730
0.0000
-144.5064
-138.4396
True
DeepJoin
0.0693
1.0000
-2.9641
3.1027
False
KGLiDS
-0.0820
1.0000
-3.1154
2.9514
False
SANTOS
254.6730
0.0000
251.6396
257.7064
True
Starmie
1.5473
0.7331
-1.4861
4.5807
False
WarpGate
-0.1040
1.0000
-3.1374
2.9294
False
TUS Small
D3L
-171.8287
0.0000
-174.8276
-168.8297
True
DeepJoin
0.4043
0.9997
-2.5946
3.4033
False
KGLiDS
-0.4323
0.9995
-3.4313
2.5666
False
SANTOS
248.1953
0.0000
245.1964
251.1943
True
Starmie
1.7673
0.5799
-1.2316
4.7663
False
WarpGate
-0.6310
0.9959
-3.6300
2.3680
False
D3L
D3L
-152.2390
0.0000
-155.3111
-149.1669
True
DeepJoin
-0.1383
1.0000
-3.2104
2.9337
False
KGLiDS
-0.1063
1.0000
-3.1784
2.9657
False
SANTOS
266.5057
0.0000
263.4336
269.5777
True
Starmie
3.2457
0.0308
0.1736
6.3177
True
WarpGate
-0.1460
1.0000
-3.2181
2.9261
False
TUS Big
D3L
-192.8350
0.0000
-196.0143
-189.6557
True
DeepJoin
-0.6843
0.9953
-3.8636
2.4949
False
KGLiDS
-0.8090
0.9885
-3.9883
2.3703
False
SANTOS
273.0350
0.0000
269.8557
276.2143
True
Starmie
2.9763
0.0830
-0.2029
6.1556
False
WarpGate
-1.2823
0.8931
-4.4616
1.8969
False
SANTOS Large
D3L
-238.9717
0.0000
-242.2367
-235.7067
True
DeepJoin
0.2030
1.0000
-3.0620
3.4680
False
KGLiDS
-1.0663
0.9593
-4.3313
2.1987
False
SANTOS
282.9050
0.0000
279.6400
286.1700
True
Starmie
15.2267
0.0000
11.9617
18.4917
True
WarpGate
-1.9793
0.5460
-5.2443
1.2857
False
OM CG
D3L
-111.9770
0.0000
-112.0388
-111.9152
True
DeepJoin
0.0363
0.5832
-0.0255
0.0982
False
KGLiDS
0.0457
0.3003
-0.0162
0.1075
False
SANTOS
237.9240
0.0000
237.8622
237.9858
True
Starmie
0.0347
0.5265
-0.0216
0.0909
False
WarpGate
-0.0337
0.6685
-0.0955
0.0282
False
OM CR
D3L
-109.9077
0.0000
-109.9680
-109.8473
True
DeepJoin
0.0243
0.8931
-0.0360
0.0847
False
KGLiDS
0.0543
0.1081
-0.0060
0.1147
False
SANTOS
238.9673
0.0000
238.9070
239.0277
True
Starmie
0.0460
0.2377
0.2377
0.1049
False
WarpGate
-0.0303
0.7462
-0.0907
0.0300
False
B. Effectiveness Analysis (Wilcoxon Signed-Rank Test)
B.1 Freyja Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). nan indicates indeterminate rank sum.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). nan indicates indeterminate rank sum.
B.2 SANTOS Small Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
nan*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0429*
0.0588
0.0339*
0.0167*
0.0108*
0.0112*
0.0111*
0.0111*
0.0075*
0.0000*
D3L
nan*
0.0000*
0.0000*
0.0167*
0.0108*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0412*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0438*
0.0588
0.0339*
0.0167*
0.0108*
0.0112*
0.0111*
0.0111*
0.0075*
0.0000*
DeepJoin
0.0457*
0.0455*
0.0339*
0.0167*
0.0108*
0.0112*
0.0111*
0.0111*
0.0075*
0.0049*
KGLiDS
0.0000*
0.0588
0.0339*
0.0167*
0.0108*
0.0112*
0.0111*
0.0111*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0449*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0422*
0.0588
0.0339*
0.0169*
0.0112*
0.0113*
0.0114*
0.0112*
0.0075*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0169*
0.0112*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.04141*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0435*
0.0588*
0.0394*
0.0169*
0.0112*
0.0113*
0.0114*
0.0112*
0.0000*
0.0000*
DeepJoin
0.0457*
0.0588*
0.0394*
0.0169*
0.0112*
0.0113*
0.0114*
0.0112*
0.0075*
0.0049*
KGLiDS
0.0.0449*
0.0588*
0.0394*
0.0169*
0.0112*
0.0113*
0.0114*
0.0112*
0.0076*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.3 TUS Small Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0011*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0011*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0013*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0014*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0014*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.4 TUS Big Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0479*
0.0171*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0479*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0449*
0.0171*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.5 D3L Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
nan*
nan*
nan*
nan*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.6 Omnimatch - City Governance Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
B.7 Omnimatch - Culture Recreation Benchmark Effectiveness
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
Competitor
k=1
k=2
k=3
k=4
k=5
k=6
SANTOS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
Starmie
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
D3L
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
MJ (Freyja without K)
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
WarpGate
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
DeepJoin
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
KGLiDS
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
0.0000*
* Denotes statistical significance (p ≤ 0.0455). MJ is the ablation study baseline.
C. False Discovery Rate (FDR) Correction
Metric
Value
Total Number of Tests (m)
804
FDR Level (Q)
0.05
Resulting Cutoff p-value (p value)
0.0468
Number of Rejected Null Hypotheses (Declared Significant)
765
Percentage of Rejected Hypotheses
95.15\%
Expected False Discovery Rate (FDR)
≤ 5\%
last updated: 2026/01/07