Datasets in LSCDBenchmark#

The following table shows all the data set we integrated into benchmark.

Data set	LGS	n	N/V/A	\|U\|	AN	JUD	Task	t₁	t₂	Reference	Version
DWUG	DE	48	32/14/2	178	8	37k	WiC, WSI, LSCD (B,G,C)	1800–1899	1946–1990	Schlechtweg et al. (2021)	2.2.0
DWUG	EN	40	36/4/0	189	9	29k	WiC, WSI, LSCD (B,G,C)	1810–1860	1960–2010	Schlechtweg et al. (2021)	2.0.1
DWUG	SV	40	31/6/3	168	5	20k	WiC, WSI, LSCD (B,G,C)	1790–1830	1895–1903	Schlechtweg et al. (2021)	2.0.1
DWUG	ES	100	51/24/25	40	12	62k	WiC, WSI, LSCD (B,G,C)	1810–1906	1994–2020	Zamora-Reina et al. (2022)	4.0.0
DiscoWUG	DE	75	39/16/20	49	8	24k	WiC, WSI, LSCD (B,G,C)	1800–1899	1946–1990	Kurtyigit et al. (2021)	1.1.1
RefWUG	DE	22	15/1/6	19	5	4k	WiC, WSI, LSCD (B,G,C)	1750–1800	1850–1900	?	1.1.0
NorDiaChange1	NO	40	40/0/0	21	3	14k	WiC, WSI, LSCD (B,G,C)	1929–1965	1970–2013	Kutuzov et al. (2022)	1.0.0
NorDiaChange2	NO	40	40/0/0	21	3	15k	WiC, WSI, LSCD (B,G,C)	1980–1990	2012–2019	Kutuzov et al. (2022)	1.0.0
DURel	DE	22	15/1/6	104	5	6k	WiC, LSCD (C)	1750–1800	1850–1900	Schlechtweg et al. (2018)	3.0.0
SURel	DE	22	19/3/0	104	4	5k	WiC, LSCD (C)	general	domain	Hätty et al. (2019)	3.0.0
RuSemShift1	RU	71	65/6/0	119	5	21k	WiC, LSCD (C)	1682–1916	1918–1990	Rodina and Kutuzov (2020)	2.0.0
RuSemShift2	RU	69	57/12/0	105	5	18k	WiC, LSCD (C)	1918–1990	1991–2016	Rodina and Kutuzov (2020)	2.0.0
RuShiftEval1	RU	111	111/0/0	60	3	10k	WiC, LSCD (C)	1682–1916	1918–1990	Kutuzov and Pivovarova (2021)	2.0.0
RuShiftEval2	RU	111	111/0/0	60	3	10k	WiC, LSCD (C)	1918–1990	1991–2016	Kutuzov and Pivovarova (2021)	2.0.0
RuShiftEval3	RU	111	111/0/0	60	3	10k	WiC, LSCD (C)	1682–1916	1991–2016	Kutuzov and Pivovarova (2021)	2.0.0

LGS = language, n = number of target words, N/V/A = number of nouns/verbs/adjectives, |U| = average number usages per word, AN = number of annotators, JUD = total number of judged usage pairs, Task = possible evaluation tasks, t₁, t₂ = time period 1/2, Reference = data set reference paper, Version = version used for experiments.

WUGs#

Word Usage Graphs (WUGs) is the graphs displaying the relation between the usage of words. Each usage of word is a node which is connected by the weighted edges. The edges dipict the human-annotated semantic proximity of use pairs. The WUG data set can be found on the WUGsite.

Reference#

[1] Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

[2] Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change.

[3] Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).

[4] Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Enstad, and Alexandra Wittemann. 2022. NorDiaChange: Diachronic Semantic Change Dataset for Norwegian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.

[5] Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relat edness (DURel): A framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 169–174.

[6] Anna Hätty, Dominik Schlechtweg, and Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the 8th Joint Conference on Lexical and Computational Semantics, pages 1–8.

[7] Julia Rodina and Andrey Kutuzov. 2020. RuSemShift: a dataset of historical lexical semantic change in Russian. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). Association for Computational Linguistics.

[8] Andrey Kutuzov and Lidia Pivovarova. 2021. Rushifteval: a shared task on semantic shift detection for russian. Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Dialog conference.

Datasets in LSCDBenchmark

Contents

Datasets in LSCDBenchmark#

WUGs#

Reference#