:py:mod:`src.preprocessing`
===========================

.. py:module:: src.preprocessing


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   src.preprocessing.ContextPreprocessor
   src.preprocessing.Toklem
   src.preprocessing.Raw
   src.preprocessing.Lemmatize
   src.preprocessing.Tokenize
   src.preprocessing.Normalize


Attributes
~~~~~~~~~~

.. autoapisummary::

   src.preprocessing.log


.. py:data:: log

   
.. py:class:: ContextPreprocessor(**data)


   Bases: :py:obj:`pydantic.BaseModel`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:attribute:: spelling_normalization
      :type: dict[str, str] | None

      Dictionary of substring replacements to apply on the contexts


   .. py:method:: start_char_index(token_index: int, tokens: list[str]) -> int
      :staticmethod:

      Finds the index of the first character of the target token, i.e. `tokens[token_index]`

      :param token_index: the index of the target word in the list of tokens
      :type token_index: int
      :param tokens: the list of tokens
      :type tokens: list[str]
      :raises ValueError: If the token is not found
      :return: the start character index of the target word
      :rtype: int


   .. py:method:: normalize_spelling(context: str, start: int) -> tuple[str, int]

      Applies the preprocessor's spelling normalization table and
      the new start character index of the target word after all modifications

      :param context: Context sentence of the target word
      :type context: str
      :param start: Start character index of the target word
      :type start: int
      :return: A tuple consisting of the modified string and the new start character index
      :rtype: tuple[str, int]


   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, Any]
      :abstractmethod:

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(*args, **kwargs) -> tuple[str, int, int]
      :abstractmethod:


   .. py:method:: __call__(s: pandas.Series) -> pandas.Series

      Applies the preprocessing strategy based on a pandas.Series from a uses.csv file

      :param s: _description_
      :type s: Series
      :return: _description_
      :rtype: Series


.. py:class:: Toklem(**data)


   Bases: :py:obj:`ContextPreprocessor`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, str | int]

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(context: str, index: int, lemma: str) -> tuple[str, int, int]

      Applies the preprocessing strategy in a standalone manner

      :param context: The context sentence of the target word
      :type context: str
      :param index: The start character index of the target word
      :type index: int
      :param lemma: The lemma of the target word
      :type lemma: str
      :return: A tuple consisting of the modified string, and the start and end character indices of the target word
      :rtype: tuple[str, int, int]


.. py:class:: Raw(**data)


   Bases: :py:obj:`ContextPreprocessor`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, str | int]

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(context: str, start: int, end: int) -> tuple[str, int, int]

      Returns the unmodified context and the character indices of the target word

      :param context: The context sentence of the target word
      :type context: str
      :param start: The start character index of the target word
      :type start: int
      :param end: The end character index of the target word
      :type end: int
      :return: A tuple consisting of the unmodified string, and the start and end character indices of the target word
      :rtype: tuple[str, int, int]


.. py:class:: Lemmatize(**data)


   Bases: :py:obj:`ContextPreprocessor`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, str | int]

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(context: str, index: int) -> tuple[str, int, int]

      Applies the preprocessing strategy in a standalone manner

      :param context: The context sentence of the target word
      :type context: str
      :param index: The start character index of the target word
      :type index: int
      :return: A tuple consisting of the modified string, and the start and end character indices of the target word
      :rtype: tuple[str, int, int]


.. py:class:: Tokenize(**data)


   Bases: :py:obj:`ContextPreprocessor`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, str | int]

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(context: str, index: int) -> tuple[str, int, int]

      Applies the preprocessing strategy in a standalone manner

      :param context: The context sentence of the target word
      :type context: str
      :param index: The start character index of the target word
      :type index: int
      :return: A tuple consisting of the modified string, and the start and end character indices of the target word
      :rtype: tuple[str, int, int]


.. py:class:: Normalize(**data)


   Bases: :py:obj:`ContextPreprocessor`

   Base class for all kinds of context preprocessing strategies

   :param BaseModel: _description_
   :type BaseModel: _type_
   :raises ValueError: _description_
   :raises NotImplementedError: _description_
   :raises NotImplementedError: _description_
   :return: _description_
   :rtype: _type_

   .. py:attribute:: default
      :type: str

      Column to extract from a Series if a given use does not contain a pre-normalized context


   .. py:method:: fields_from_series(s: pandas.Series) -> dict[str, str | int]

      Selects fields from a pandas Series

      :param s: A row in a uses.csv file
      :type s: Series
      :raises NotImplementedError: _description_
      :return: A dictionary of parameters relevant to pass to the preprocess function
      :rtype: dict[str, Any]


   .. py:method:: preprocess(context: str, index: int) -> tuple[str, int, int]

      Applies the preprocessing strategy in a standalone manner

      :param context: The context sentence of the target word
      :type context: str
      :param index: The start character index of the target word
      :type index: int
      :return: A tuple consisting of the modified string, and the start and end character indices of the target word
      :rtype: tuple[str, int, int]