Datasets in LSCDBenchmark#
The following table shows all the data set we integrated into benchmark.
Data set |
LGS |
n |
N/V/A |
|U| |
AN |
JUD |
Task |
t1 |
t2 |
Reference |
Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
DWUG |
DE |
48 |
32/14/2 |
178 |
8 |
37k |
WiC, WSI, LSCD (B,G,C) |
1800–1899 |
1946–1990 |
2.2.0 |
|
DWUG |
EN |
40 |
36/4/0 |
189 |
9 |
29k |
WiC, WSI, LSCD (B,G,C) |
1810–1860 |
1960–2010 |
2.0.1 |
|
DWUG |
SV |
40 |
31/6/3 |
168 |
5 |
20k |
WiC, WSI, LSCD (B,G,C) |
1790–1830 |
1895–1903 |
2.0.1 |
|
DWUG |
ES |
100 |
51/24/25 |
40 |
12 |
62k |
WiC, WSI, LSCD (B,G,C) |
1810–1906 |
1994–2020 |
4.0.0 |
|
DiscoWUG |
DE |
75 |
39/16/20 |
49 |
8 |
24k |
WiC, WSI, LSCD (B,G,C) |
1800–1899 |
1946–1990 |
1.1.1 |
|
RefWUG |
DE |
22 |
15/1/6 |
19 |
5 |
4k |
WiC, WSI, LSCD (B,G,C) |
1750–1800 |
1850–1900 |
? |
1.1.0 |
NorDiaChange1 |
NO |
40 |
40/0/0 |
21 |
3 |
14k |
WiC, WSI, LSCD (B,G,C) |
1929–1965 |
1970–2013 |
1.0.0 |
|
NorDiaChange2 |
NO |
40 |
40/0/0 |
21 |
3 |
15k |
WiC, WSI, LSCD (B,G,C) |
1980–1990 |
2012–2019 |
1.0.0 |
|
DURel |
DE |
22 |
15/1/6 |
104 |
5 |
6k |
WiC, LSCD (C) |
1750–1800 |
1850–1900 |
3.0.0 |
|
SURel |
DE |
22 |
19/3/0 |
104 |
4 |
5k |
WiC, LSCD (C) |
general |
domain |
3.0.0 |
|
RuSemShift1 |
RU |
71 |
65/6/0 |
119 |
5 |
21k |
WiC, LSCD (C) |
1682–1916 |
1918–1990 |
2.0.0 |
|
RuSemShift2 |
RU |
69 |
57/12/0 |
105 |
5 |
18k |
WiC, LSCD (C) |
1918–1990 |
1991–2016 |
2.0.0 |
|
RuShiftEval1 |
RU |
111 |
111/0/0 |
60 |
3 |
10k |
WiC, LSCD (C) |
1682–1916 |
1918–1990 |
2.0.0 |
|
RuShiftEval2 |
RU |
111 |
111/0/0 |
60 |
3 |
10k |
WiC, LSCD (C) |
1918–1990 |
1991–2016 |
2.0.0 |
|
RuShiftEval3 |
RU |
111 |
111/0/0 |
60 |
3 |
10k |
WiC, LSCD (C) |
1682–1916 |
1991–2016 |
2.0.0 |
LGS = language, n = number of target words, N/V/A = number of nouns/verbs/adjectives, |U| = average number usages per word, AN = number of annotators, JUD = total number of judged usage pairs, Task = possible evaluation tasks, t1, t2 = time period 1/2, Reference = data set reference paper, Version = version used for experiments.
WUGs#
Word Usage Graphs (WUGs) is the graphs displaying the relation between the usage of words. Each usage of word is a node which is connected by the weighted edges. The edges dipict the human-annotated semantic proximity of use pairs. The WUG data set can be found on the WUGsite.
Reference#
[1] Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
[2] Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change.
[3] Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
[4] Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Enstad, and Alexandra Wittemann. 2022. NorDiaChange: Diachronic Semantic Change Dataset for Norwegian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.
[5] Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relat edness (DURel): A framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 169–174.
[6] Anna Hätty, Dominik Schlechtweg, and Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the 8th Joint Conference on Lexical and Computational Semantics, pages 1–8.
[7] Julia Rodina and Andrey Kutuzov. 2020. RuSemShift: a dataset of historical lexical semantic change in Russian. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). Association for Computational Linguistics.
[8] Andrey Kutuzov and Lidia Pivovarova. 2021. Rushifteval: a shared task on semantic shift detection for russian. Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Dialog conference.