src.lemma#

Module Contents#

Classes#

RandomSampling

UsePairOptions

Lemma

Class representing one lemma in a DWUG-like dataset

Attributes#

Group

Sample

class src.lemma.RandomSampling#

Bases: pydantic.BaseModel

n: int#
replace: bool#
src.lemma.Group#
src.lemma.Sample#
class src.lemma.UsePairOptions#

Bases: pydantic.BaseModel

group: Group#
sample: Sample#
class src.lemma.Lemma#

Bases: pydantic.BaseModel

Class representing one lemma in a DWUG-like dataset (i.e., one of the words represented as folders in the data/ directory)

property name: str#

The name of the lemma, based on instance’s path

property uses_df: pandas.DataFrame#

Cached property that collects the corresponding uses.csv files, as well as preprocesses each use based on the provided configuration.

Returns:

The preprocessed DataFrame of uses for the corresponding lemma

Return type:

DataFrame

property uses_schema: pandera.DataFrameSchema#
property annotated_pairs_df: pandas.DataFrame#

Property that collects the annotated pairs of the corresponding lemma from its judgments.csv file. It performs validation based on annotated_pairs_schema.

Returns:

A DataFrame containing two columns (identifier1, identifier2)

Return type:

DataFrame

property augmented_annotated_pairs_df: pandas.DataFrame#

A version of annotated_pairs_df that incorporates grouping information. The base annotated_pairs_df is expanded with the groupings oƒ each of the identifiers in each row.

Returns:

The expanded DataFrame

Return type:

DataFrame

property annotated_pairs_schema: pandera.DataFrameSchema#

Schema for validating that a judgments.csv file contains two columns (identifier1, identifier2)

Returns:

The schema

Return type:

DataFrameSchema

property predefined_use_pairs_df: pandas.DataFrame#
property augmented_predefined_use_pairs_df: pandas.DataFrame#
groupings: tuple[str, str]#

Each of the DWUG datasets consists of word usages from multiple groups. In most cases, there are only two, which represent time periods. In other datasets, there are more than two, in which case they represent regional variations.

path: pydantic.DirectoryPath#

The path to the directory containing the corresponding lemma within its dataset. Must be a valid existing directory.

preprocessing: src.preprocessing.ContextPreprocessor#

A context preprocessing strategy

_uses_df: pandas.DataFrame#
_annotated_pairs_df: pandas.DataFrame#
_augmented_annotated_pairs: pandas.DataFrame#
_predefined_use_pairs_df: pandas.DataFrame#
_augmented_predefined_use_pairs_df: pandas.DataFrame#
_clusters_df: pandas.DataFrame#
useid_to_grouping() Dict[src.use.UseID, str]#

Method to generate a dictionary from use identifiers to their corresponding groupings

Returns:

A dictionary from use identifiers to use groupings

Return type:

Dict[UseID, str]

grouping_to_useid() dict[str, list[src.use.UseID]]#

Method to generate a dictionary from use groupings to a list of use identifiers corresponding to that grouping

Returns:

A dictionary from groupings to list of use identifier

Return type:

dict[str, list[UseID]]

_split_compare_uses() tuple[list[src.use.UseID], list[src.use.UseID]]#
_split_earlier_uses() tuple[list[src.use.UseID], list[src.use.UseID]]#
_split_later_uses() tuple[list[src.use.UseID], list[src.use.UseID]]#
split_uses(group: Group) tuple[list[src.use.UseID], list[src.use.UseID]]#

Splits the uses of a lemma into two separate lists of use identifiers, according to pairing

Parameters:

group (Group) – A pairing strategy

Returns:

_description_

Return type:

tuple[list[UseID], list[UseID]]

get_uses() list[src.use.Use]#
use_pairs(group: Group, sample: Sample) list[tuple[src.use.Use, src.use.Use]]#
_split_augmented_uses(group: Group, augmented_uses: pandas.DataFrame) tuple[list[src.use.UseID], list[src.use.UseID]]#