invert4geom.cross_validation
============================

.. py:module:: invert4geom.cross_validation


Functions
---------

.. autoapisummary::

   invert4geom.cross_validation.resample_with_test_points
   invert4geom.cross_validation.grav_cv_score
   invert4geom.cross_validation.grav_optimal_parameter
   invert4geom.cross_validation.constraints_cv_score
   invert4geom.cross_validation.zref_density_optimal_parameter
   invert4geom.cross_validation.random_split_test_train
   invert4geom.cross_validation.split_test_train
   invert4geom.cross_validation.kfold_df_to_lists
   invert4geom.cross_validation.eq_sources_score
   invert4geom.cross_validation.regional_separation_score


Module Contents
---------------

.. py:function:: resample_with_test_points(data_spacing, data, region)

   take a dataframe of coordinates and make all rows that fall on the data_spacing
   grid training points. Add rows at each point which falls on the grid points of
   half the data_spacing, assign these with label "test". If other data is present
   in dataframe, will sample at each new location.

   :param data_spacing: full spacing size which will be halved
   :type data_spacing: float
   :param data: dataframe with coordinate columns "easting" and "northing", all other columns
                will be sampled at new grid spacing
   :type data: pandas.DataFrame
   :param region: region to create grid over, in the form (min_easting, max_easting, min_northing,
                  max_northing)
   :type region: tuple[float, float, float, float]

   :returns: a new dataframe with new column "test" of booleans which shows whether each row
             is a testing or training point.
   :rtype: pandas.DataFrame


.. py:function:: grav_cv_score(training_data, testing_data, progressbar = True, rmse_as_median = False, plot = False, **kwargs)

   Find the score, represented by the root mean (or median) squared error (RMSE),
   between the testing gravity data, and the predict gravity data after an
   inversion. Follows methods of :footcite:t:`uiedafast2017`. Used in
   `optimization.optimize_inversion_damping()`.

   :param training_data: rows of the gravity data frame which are just the training data
   :type training_data: pandas.DataFrame
   :param testing_data: rows of the gravity data frame which are just the testing data
   :type testing_data: pandas.DataFrame
   :param rmse_as_median: calculate the RMSE as the median as opposed to the mean, by default False
   :type rmse_as_median: bool, optional
   :param progressbar: choose to show the progress bar for the forward gravity calculation, by default
                       True
   :type progressbar: bool, optional
   :param plot: choose to plot the observed and predicted data grids, and their difference,
                located at the testing points, by
                default False
   :type plot: bool, optional

   :returns: * **score** (*float*) -- the root mean squared error, between the testing gravity data and the predicted
               gravity data
             * **results** (*tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]*) -- a tuple of the inversion results.

   .. rubric:: References

   :footcite:t:`uiedafast2017`


.. py:function:: grav_optimal_parameter(training_data, testing_data, param_to_test, rmse_as_median = False, progressbar = True, plot_grids = False, plot_cv = False, results_fname = None, **kwargs)

   Calculate the cross validation scores for a set of parameter values and return the
   best score and value.

   :param training_data: just the training data rows
   :type training_data: pandas.DataFrame
   :param testing_data: just the testing data rows
   :type testing_data: pandas.DataFrame
   :param param_to_test: first value is a string of the parameter that is being tested, and the second
                         value is a list of the values to test
   :type param_to_test: tuple[str, list[float]]
   :param rmse_as_median: calculate the RMSE as the median as opposed to the mean, by default False
   :type rmse_as_median: bool, optional
   :param progressbar: display a progress bar for the number of tested values, by default True
   :type progressbar: bool, optional
   :param plot_grids: plot all the grids of observed and predicted data for each parameter value, by
                      default False
   :type plot_grids: bool, optional
   :param plot_cv: plot a graph of scores vs parameter values, by default False
   :type plot_cv: bool, optional
   :param results_fname: file name to save results to, by default "tmp" with an attached random number
   :type results_fname: str, optional

   :returns: float, float, list[float], list[float], ]
             the inversion results, the optimal parameter value, the score associated with
             it, the parameter values and the scores for each parameter value
   :rtype: tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float],


.. py:function:: constraints_cv_score(grav_df, constraints_df, rmse_as_median = False, **kwargs)

   Find the score, represented by the root mean squared error (RMSE), between the
   constraint point elevation, and the inverted topography at the constraint points.
   Follows methods of :footcite:t:`uiedafast2017`. Used in
   `optimization.optimize_inversion_zref_density_contrast()`.

   :param grav_df: gravity dataframe with columns "res", "reg", and "gravity_anomaly"
   :type grav_df: pandas.DataFrame
   :param constraints_df: constraints dataframe with columns "easting", "northing", and "upward"
   :type constraints_df: pandas.DataFrame
   :param rmse_as_median: calculate the RMSE as the median of the , as opposed to the mean, by default
                          False
   :type rmse_as_median: bool, optional

   :returns: * **score** (*float*) -- the root mean squared error, between the constraint point elevation and the
               inverted topography at the constraint points
             * **results** (*tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]*) -- a tuple of the inversion results.

   .. rubric:: References

   .. footbibliography::


.. py:function:: zref_density_optimal_parameter(grav_df, constraints_df, starting_topography = None, zref_values = None, density_contrast_values = None, starting_topography_kwargs = None, regional_grav_kwargs = None, rmse_as_median = False, progressbar = True, plot_cv = False, results_fname = None, **kwargs)

   Calculate the cross validation scores for a set of zref and density values and
   return the best score and values. If only 1 parameter is needed to be test, can pass
   a single value of the other parameter. This uses constraint points, where the target
   topography is known. The inverted topography at each of these points is compared to
   the known value and used to calculate the score.

   :param grav_df: dataframe with gravity data and coordinates, must have coordinate columns
                   "easting", "northing", and "upward", and gravity data column "gravity_anomaly"
   :type grav_df: pandas.DataFrame
   :param constraints_df: dataframe with points where the topography of interest has been previously
                          measured, to be used for score, must have coordinate columns "easting",
                          "northing", and "upward".
   :type constraints_df: pandas.DataFrame
   :param starting_topography: starting topography to use to create the starting prism model. If not provided,
                               will make a flat surface at each provided zref value using the region and
                               spacing values provided in starting_topography_kwargs.
   :type starting_topography: xarray.DataArray | None, optional
   :param zref_values: Reference level values to test, by default None
   :type zref_values: list[float] | None, optional
   :param density_contrast_values: Density contrast values to test, by default None
   :type density_contrast_values: list[float] | None, optional
   :param starting_topography_kwargs: region, spacing and dampings used to create a flat starting topography for each
                                      zref value, by default None.
   :type starting_topography_kwargs: dict[str, typing.Any] | None, optional
   :param regional_grav_kwargs: Keywords used to calculate the regional field, by default None. If method is
                                `constraints` for constraint point minimization, must separate the constraints
                                into testing and training sets and provide the training set to this argument and
                                the testing set to `constraints_df` to avoid biasing the scores.
   :type regional_grav_kwargs: dict[str, typing.Any] | None, optional
   :param rmse_as_median: Use the median instead of the root mean square as the scoring metric, by default
                          False
   :type rmse_as_median: bool, optional
   :param progressbar: display a progress bar for the number of tested values, by default True
   :type progressbar: bool, optional
   :param plot_cv: plot a graph of scores vs parameter values, by default False
   :type plot_cv: bool, optional
   :param results_fname: file name to save results to, by default "tmp" with an attached random number
   :type results_fname: str, optional

   :returns: float, float, float, list[typing.Any], list[float], ]
             the inversion results, the optimal parameter value, the score associated with
             it, the parameter values and the scores for each parameter value
   :rtype: tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float],


.. py:function:: random_split_test_train(data_df, test_size = 0.3, random_state = 10, plot = False)

   split data into training and testing sets randomly with a specified percentage of
   points to be in the test set set by test_size.

   :param data_df: data to be split, must have columns "easting" and "northing".
   :type data_df: pandas.DataFrame
   :param test_size: decimal percentage of points to put in the testing set, by default 0.3
   :type test_size: float, optional
   :param random_state: number to set th random splitting, by default 10
   :type random_state: int, optional
   :param plot: choose to plot the results, by default False
   :type plot: bool, optional

   :returns: dataframe with a new column "test" which is a boolean value of whether the row
             is in the training or testing set.
   :rtype: pandas.DataFrame


.. py:function:: split_test_train(data_df, method, spacing = None, shape = None, n_splits = 5, random_state = 10, plot = False)

   Split data into training or testing sets either using KFold (optional blocked) or
   LeaveOneOut methods.

   :param data_df: dataframe with coordinate columns "easting" and "northing"
   :type data_df: pandas.DataFrame
   :param method: choose between "LeaveOneOut" or "KFold" methods.
   :type method: str
   :param spacing: grid spacing to use for Block K-Folds, by default None
   :type spacing: float | tuple[float, float] | None, optional
   :param shape: number of blocks to use for Block K-Folds, by default None
   :type shape: tuple[float, float] | None, optional
   :param n_splits: number for folds to make for K-Folds method, by default 5
   :type n_splits: int, optional
   :param random_state: random state used for both methods, by default 10
   :type random_state: int, optional
   :param plot: plot the separated training and testing dataset, by default False
   :type plot: bool, optional

   :returns: a dataset with a new column for each fold in the form fold_0, fold_1 etc., with
             the value "train" or "test"
   :rtype: pandas.DataFrame


.. py:function:: kfold_df_to_lists(df)

   convert a single dataframe with fold columns in the form fold_0, fold_1 etc. into
   a list of testing dataframes for each fold and a list of training dataframes for
   each fold.

   :param df: dataframe with fold columns in the form fold_0, fold_1 etc., as output by
              function `split_test_train()`.
   :type df: pandas.DataFrame

   :returns: * **test_dfs** (*list[pandas.DataFrame]*) -- a list of testing dataframes for each fold
             * **train_dfs** (*list[pandas.DataFrame]*) -- a list of training dataframes for each fold


.. py:function:: eq_sources_score(coordinates, data, delayed = False, weights = None, **kwargs)

   Calculate the cross-validation score for fitting gravity data to equivalent sources.
   Uses Verde's cross_val_score function to calculate the score.
   All kwargs are passed to the harmonica.EquivalentSources class.

   :param coordinates: tuple of easting, northing, and upward coordinates of the gravity data
   :type coordinates: tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
   :param data: the gravity data
   :type data: pandas.Series | numpy.ndarray
   :param delayed: compute the scores in parallel if True, by default False
   :type delayed: bool, optional
   :param weights: optional weight values for each gravity data point, by default None
   :type weights: numpy.ndarray | None, optional

   :keyword damping: The positive damping regularization parameter. Controls how much
                     smoothness is imposed on the estimated coefficients.
                     If None, no regularization is used.
   :kwtype damping: float | None
   :keyword points: List containing the coordinates of the equivalent point sources.
                    Coordinates are assumed to be in the following order:
                    (``easting``, ``northing``, ``upward``).
                    If None, will place one point source below each observation point at
                    a fixed relative depth below the observation point.
                    Defaults to None.
   :kwtype points: list[numpy.ndarray] | None
   :keyword depth: Parameter used to control the depth at which the point sources will be
                   located.
                   If a value is provided, each source is located beneath each data point
                   (or block-averaged location) at a depth equal to its elevation minus
                   the ``depth`` value.
                   If set to ``"default"``, the depth of the sources will be estimated as
                   4.5 times the mean distance between first neighboring sources.
                   This parameter is ignored if *points* is specified.
                   Defaults to ``"default"``.
   :kwtype depth: float | str
   :keyword block_size: Size of the blocks used on block-averaged equivalent sources.
                        If a single value is passed, the blocks will have a square shape.
                        Alternatively, the dimensions of the blocks in the South-North and
                        West-East directions can be specified by passing a tuple.
                        If None, no block-averaging is applied.
                        This parameter is ignored if *points* are specified.
                        Default to None.
   :kwtype block_size: float | tuple[float, float] | None
   :keyword parallel: If True any predictions and Jacobian building is carried out in
                      parallel through Numba's ``jit.prange``, reducing the computation time.
                      If False, these tasks will be run on a single CPU. Default to True.
   :kwtype parallel: bool
   :keyword dtype: The desired data-type for the predictions and the Jacobian matrix.
                   Default to ``"float64"``.
   :kwtype dtype: str

   :returns: a float of the score, the higher the value to better the fit.
   :rtype: float


.. py:function:: regional_separation_score(testing_df, score_as_median = False, **kwargs)

   Evaluate the effectiveness of the gravity regional-residual separation.
   The optimal regional component is that which results in a residual component which
   is lowest at constraint points, while still contains a high amplitude elsewhere.

   :param testing_df: dataframe containing a priori measurements of the topography of interest with
                      columns "upward", "easting", and "northing"
   :type testing_df: pandas.DataFrame
   :param score_as_median: switch from using the root mean square to the root median square for the score,
                           by default is False., by default False
   :type score_as_median: bool, optional
   :param \*\*kwargs: additional keyword arguments for the specified method.
   :type \*\*kwargs: typing.Any,

   :returns: * **residual_constraint_score** (*float*) -- the RMS of the residual at constraint points
             * **residual_amplitude_score** (*float*) -- the RMS of the residuals amplitude at all grid points
             * **true_reg_score** (*float | None*) -- the RMSE between the true regional field and the estimated field, if provided,
               otherwise None
             * **df_anomalies** (*pandas.DataFrame*) -- the dataframe of the regional and residual gravity anomalies


