:py:mod:`src.lemma`
===================

.. py:module:: src.lemma


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   src.lemma.RandomSampling
   src.lemma.UsePairOptions
   src.lemma.Lemma


Attributes
~~~~~~~~~~

.. autoapisummary::

   src.lemma.Group
   src.lemma.Sample


.. py:class:: RandomSampling


   Bases: :py:obj:`pydantic.BaseModel`

   .. py:attribute:: n
      :type: int

      
   .. py:attribute:: replace
      :type: bool

      
.. py:data:: Group

   
.. py:data:: Sample

   
.. py:class:: UsePairOptions


   Bases: :py:obj:`pydantic.BaseModel`

   .. py:attribute:: group
      :type: Group

      
   .. py:attribute:: sample
      :type: Sample

      
.. py:class:: Lemma


   Bases: :py:obj:`pydantic.BaseModel`

   Class representing one lemma in a DWUG-like dataset
   (i.e., one of the words represented as folders in the data/ directory)

   .. py:property:: name
      :type: str

      The name of the lemma, based on instance's path


   .. py:property:: uses_df
      :type: pandas.DataFrame

      Cached property that collects the corresponding uses.csv files,
      as well as preprocesses each use based on the provided configuration.

      :return: The preprocessed DataFrame of uses for the corresponding lemma
      :rtype: DataFrame


   .. py:property:: uses_schema
      :type: pandera.DataFrameSchema


   .. py:property:: annotated_pairs_df
      :type: pandas.DataFrame

      Property that collects the annotated pairs of the corresponding lemma
      from its judgments.csv file. It performs validation based on :attr:`annotated_pairs_schema`.

      :return: A DataFrame containing two columns (identifier1, identifier2)
      :rtype: DataFrame


   .. py:property:: augmented_annotated_pairs_df
      :type: pandas.DataFrame

      A version of :attr:`annotated_pairs_df` that incorporates grouping information.
      The base :attr:`annotated_pairs_df` is expanded with the groupings oƒ each of the identifiers in each row. 

      :return: The expanded DataFrame
      :rtype: DataFrame


   .. py:property:: annotated_pairs_schema
      :type: pandera.DataFrameSchema

      Schema for validating that a judgments.csv file contains two columns (identifier1, identifier2)


      :return: The schema
      :rtype: DataFrameSchema


   .. py:property:: predefined_use_pairs_df
      :type: pandas.DataFrame


   .. py:property:: augmented_predefined_use_pairs_df
      :type: pandas.DataFrame


   .. py:attribute:: groupings
      :type: tuple[str, str]

      Each of the DWUG datasets consists of word usages from multiple groups.
      In most cases, there are only two, which represent time periods. In other
      datasets, there are more than two, in which case they represent regional variations.


   .. py:attribute:: path
      :type: pydantic.DirectoryPath

      The path to the directory containing the corresponding lemma within its dataset.
      Must be a valid existing directory.


   .. py:attribute:: preprocessing
      :type: src.preprocessing.ContextPreprocessor

      A context preprocessing strategy


   .. py:attribute:: _uses_df
      :type: pandas.DataFrame

      
   .. py:attribute:: _annotated_pairs_df
      :type: pandas.DataFrame

      
   .. py:attribute:: _augmented_annotated_pairs
      :type: pandas.DataFrame

      
   .. py:attribute:: _predefined_use_pairs_df
      :type: pandas.DataFrame

      
   .. py:attribute:: _augmented_predefined_use_pairs_df
      :type: pandas.DataFrame

      
   .. py:attribute:: _clusters_df
      :type: pandas.DataFrame

      
   .. py:method:: useid_to_grouping() -> Dict[src.use.UseID, str]

      Method to generate a dictionary from use identifiers to their corresponding groupings

      :return: A dictionary from use identifiers to use groupings
      :rtype: Dict[UseID, str]


   .. py:method:: grouping_to_useid() -> dict[str, list[src.use.UseID]]

      Method to generate a dictionary from use groupings to a 
      list of use identifiers corresponding to that grouping

      :return: A dictionary from groupings to list of use identifier
      :rtype: dict[str, list[UseID]]


   .. py:method:: _split_compare_uses() -> tuple[list[src.use.UseID], list[src.use.UseID]]


   .. py:method:: _split_earlier_uses() -> tuple[list[src.use.UseID], list[src.use.UseID]]


   .. py:method:: _split_later_uses() -> tuple[list[src.use.UseID], list[src.use.UseID]]


   .. py:method:: split_uses(group: Group) -> tuple[list[src.use.UseID], list[src.use.UseID]]

      Splits the uses of a lemma into two separate lists of use identifiers, according to `pairing`

      :param group: A pairing strategy
      :type group: Group
      :return: _description_
      :rtype: tuple[list[UseID], list[UseID]]


   .. py:method:: get_uses() -> list[src.use.Use]


   .. py:method:: use_pairs(group: Group, sample: Sample) -> list[tuple[src.use.Use, src.use.Use]]


   .. py:method:: _split_augmented_uses(group: Group, augmented_uses: pandas.DataFrame) -> tuple[list[src.use.UseID], list[src.use.UseID]]