Datasets in LSCDBenchmark#

The following table shows all the data set we integrated into benchmark.

Data set

LGS

n

N/V/A

|U|

AN

JUD

Task

t1

t2

Reference

Version

DWUG

DE

48

32/14/2

178

8

37k

WiC, WSI, LSCD (B,G,C)

1800–1899

1946–1990

Schlechtweg et al. (2021)

2.2.0

DWUG

EN

40

36/4/0

189

9

29k

WiC, WSI, LSCD (B,G,C)

1810–1860

1960–2010

Schlechtweg et al. (2021)

2.0.1

DWUG

SV

40

31/6/3

168

5

20k

WiC, WSI, LSCD (B,G,C)

1790–1830

1895–1903

Schlechtweg et al. (2021)

2.0.1

DWUG

ES

100

51/24/25

40

12

62k

WiC, WSI, LSCD (B,G,C)

1810–1906

1994–2020

Zamora-Reina et al. (2022)

4.0.0

DiscoWUG

DE

75

39/16/20

49

8

24k

WiC, WSI, LSCD (B,G,C)

1800–1899

1946–1990

Kurtyigit et al. (2021)

1.1.1

RefWUG

DE

22

15/1/6

19

5

4k

WiC, WSI, LSCD (B,G,C)

1750–1800

1850–1900

?

1.1.0

NorDiaChange1

NO

40

40/0/0

21

3

14k

WiC, WSI, LSCD (B,G,C)

1929–1965

1970–2013

Kutuzov et al. (2022)

1.0.0

NorDiaChange2

NO

40

40/0/0

21

3

15k

WiC, WSI, LSCD (B,G,C)

1980–1990

2012–2019

Kutuzov et al. (2022)

1.0.0

DURel

DE

22

15/1/6

104

5

6k

WiC, LSCD (C)

1750–1800

1850–1900

Schlechtweg et al. (2018)

3.0.0

SURel

DE

22

19/3/0

104

4

5k

WiC, LSCD (C)

general

domain

Hätty et al. (2019)

3.0.0

RuSemShift1

RU

71

65/6/0

119

5

21k

WiC, LSCD (C)

1682–1916

1918–1990

Rodina and Kutuzov (2020)

2.0.0

RuSemShift2

RU

69

57/12/0

105

5

18k

WiC, LSCD (C)

1918–1990

1991–2016

Rodina and Kutuzov (2020)

2.0.0

RuShiftEval1

RU

111

111/0/0

60

3

10k

WiC, LSCD (C)

1682–1916

1918–1990

Kutuzov and Pivovarova (2021)

2.0.0

RuShiftEval2

RU

111

111/0/0

60

3

10k

WiC, LSCD (C)

1918–1990

1991–2016

Kutuzov and Pivovarova (2021)

2.0.0

RuShiftEval3

RU

111

111/0/0

60

3

10k

WiC, LSCD (C)

1682–1916

1991–2016

Kutuzov and Pivovarova (2021)

2.0.0

LGS = language, n = number of target words, N/V/A = number of nouns/verbs/adjectives, |U| = average number usages per word, AN = number of annotators, JUD = total number of judged usage pairs, Task = possible evaluation tasks, t1, t2 = time period 1/2, Reference = data set reference paper, Version = version used for experiments.

WUGs#

Word Usage Graphs (WUGs) is the graphs displaying the relation between the usage of words. Each usage of word is a node which is connected by the weighted edges. The edges dipict the human-annotated semantic proximity of use pairs. The WUG data set can be found on the WUGsite.

Reference#

[1] Dominik Schlechtweg, Nina Tahmasebi, Simon Hengchen, Haim Dubossarsky, Barbara McGillivray. 2021. DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.

[2] Frank D. Zamora-Reina, Felipe Bravo-Marquez, Dominik Schlechtweg. 2022. LSCDiscovery: A shared task on semantic change discovery and detection in Spanish. In Proceedings of the 3rd International Workshop on Computational Approaches to Historical Language Change.

[3] Sinan Kurtyigit, Maike Park, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde. 2021. Lexical Semantic Change Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).

[4] Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Enstad, and Alexandra Wittemann. 2022. NorDiaChange: Diachronic Semantic Change Dataset for Norwegian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.

[5] Dominik Schlechtweg, Sabine Schulte im Walde, and Stefanie Eckmann. 2018. Diachronic Usage Relat edness (DURel): A framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 169–174.

[6] Anna Hätty, Dominik Schlechtweg, and Sabine Schulte im Walde. 2019. SURel: A Gold Standard for Incorporating Meaning Shifts into Term Extraction. In Proceedings of the 8th Joint Conference on Lexical and Computational Semantics, pages 1–8.

[7] Julia Rodina and Andrey Kutuzov. 2020. RuSemShift: a dataset of historical lexical semantic change in Russian. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020). Association for Computational Linguistics.

[8] Andrey Kutuzov and Lidia Pivovarova. 2021. Rushifteval: a shared task on semantic shift detection for russian. Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Dialog conference.