invert4geom.cross_validation#

Functions#

resample_with_test_points(data_spacing, data, region)

take a dataframe of coordinates and make all rows that fall on the data_spacing

grav_cv_score(training_data, testing_data[, ...])

Find the score, represented by the root mean (or median) squared error (RMSE),

grav_optimal_parameter(training_data, testing_data, ...)

Calculate the cross validation scores for a set of parameter values and return the

constraints_cv_score(grav_df, constraints_df[, ...])

Find the score, represented by the root mean squared error (RMSE), between the

zref_density_optimal_parameter(grav_df, constraints_df)

Calculate the cross validation scores for a set of zref and density values and

random_split_test_train(data_df[, test_size, ...])

split data into training and testing sets randomly with a specified percentage of

split_test_train(data_df, method[, spacing, shape, ...])

Split data into training or testing sets either using KFold (optional blocked) or

kfold_df_to_lists(df)

convert a single dataframe with fold columns in the form fold_0, fold_1 etc. into

eq_sources_score(coordinates, data[, delayed, weights])

Calculate the cross-validation score for fitting gravity data to equivalent sources.

regional_separation_score(testing_df[, score_as_median])

Evaluate the effectiveness of the gravity regional-residual separation.

Module Contents#

resample_with_test_points(data_spacing, data, region)[source]#

take a dataframe of coordinates and make all rows that fall on the data_spacing grid training points. Add rows at each point which falls on the grid points of half the data_spacing, assign these with label “test”. If other data is present in dataframe, will sample at each new location.

Parameters:
  • data_spacing (float) – full spacing size which will be halved

  • data (pandas.DataFrame) – dataframe with coordinate columns “easting” and “northing”, all other columns will be sampled at new grid spacing

  • region (tuple[float, float, float, float]) – region to create grid over, in the form (min_easting, max_easting, min_northing, max_northing)

Returns:

a new dataframe with new column “test” of booleans which shows whether each row is a testing or training point.

Return type:

pandas.DataFrame

grav_cv_score(training_data, testing_data, progressbar=True, rmse_as_median=False, plot=False, **kwargs)[source]#

Find the score, represented by the root mean (or median) squared error (RMSE), between the testing gravity data, and the predict gravity data after an inversion. Follows methods of Uieda and Barbosa[1]. Used in optimization.optimize_inversion_damping().

Parameters:
  • training_data (pandas.DataFrame) – rows of the gravity data frame which are just the training data

  • testing_data (pandas.DataFrame) – rows of the gravity data frame which are just the testing data

  • rmse_as_median (bool, optional) – calculate the RMSE as the median as opposed to the mean, by default False

  • progressbar (bool, optional) – choose to show the progress bar for the forward gravity calculation, by default True

  • plot (bool, optional) – choose to plot the observed and predicted data grids, and their difference, located at the testing points, by default False

  • kwargs (Any)

Returns:

  • score (float) – the root mean squared error, between the testing gravity data and the predicted gravity data

  • results (tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]) – a tuple of the inversion results.

Return type:

tuple[float, tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float]]

References

Uieda and Barbosa[1]

grav_optimal_parameter(training_data, testing_data, param_to_test, rmse_as_median=False, progressbar=True, plot_grids=False, plot_cv=False, results_fname=None, **kwargs)[source]#

Calculate the cross validation scores for a set of parameter values and return the best score and value.

Parameters:
  • training_data (pandas.DataFrame) – just the training data rows

  • testing_data (pandas.DataFrame) – just the testing data rows

  • param_to_test (tuple[str, list[float]]) – first value is a string of the parameter that is being tested, and the second value is a list of the values to test

  • rmse_as_median (bool, optional) – calculate the RMSE as the median as opposed to the mean, by default False

  • progressbar (bool, optional) – display a progress bar for the number of tested values, by default True

  • plot_grids (bool, optional) – plot all the grids of observed and predicted data for each parameter value, by default False

  • plot_cv (bool, optional) – plot a graph of scores vs parameter values, by default False

  • results_fname (str, optional) – file name to save results to, by default “tmp” with an attached random number

  • kwargs (Any)

Returns:

float, float, list[float], list[float], ] the inversion results, the optimal parameter value, the score associated with it, the parameter values and the scores for each parameter value

Return type:

tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float],

constraints_cv_score(grav_df, constraints_df, rmse_as_median=False, **kwargs)[source]#

Find the score, represented by the root mean squared error (RMSE), between the constraint point elevation, and the inverted topography at the constraint points. Follows methods of Uieda and Barbosa[1]. Used in optimization.optimize_inversion_zref_density_contrast().

Parameters:
  • grav_df (pandas.DataFrame) – gravity dataframe with columns “res”, “reg”, and “gravity_anomaly”

  • constraints_df (pandas.DataFrame) – constraints dataframe with columns “easting”, “northing”, and “upward”

  • rmse_as_median (bool, optional) – calculate the RMSE as the median of the , as opposed to the mean, by default False

  • kwargs (Any)

Returns:

  • score (float) – the root mean squared error, between the constraint point elevation and the inverted topography at the constraint points

  • results (tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]) – a tuple of the inversion results.

Return type:

tuple[float, tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float]]

References

zref_density_optimal_parameter(grav_df, constraints_df, starting_topography=None, zref_values=None, density_contrast_values=None, starting_topography_kwargs=None, regional_grav_kwargs=None, rmse_as_median=False, progressbar=True, plot_cv=False, results_fname=None, **kwargs)[source]#

Calculate the cross validation scores for a set of zref and density values and return the best score and values. If only 1 parameter is needed to be test, can pass a single value of the other parameter. This uses constraint points, where the target topography is known. The inverted topography at each of these points is compared to the known value and used to calculate the score.

Parameters:
  • grav_df (pandas.DataFrame) – dataframe with gravity data and coordinates, must have coordinate columns “easting”, “northing”, and “upward”, and gravity data column “gravity_anomaly”

  • constraints_df (pandas.DataFrame) – dataframe with points where the topography of interest has been previously measured, to be used for score, must have coordinate columns “easting”, “northing”, and “upward”.

  • starting_topography (xarray.DataArray | None, optional) – starting topography to use to create the starting prism model. If not provided, will make a flat surface at each provided zref value using the region and spacing values provided in starting_topography_kwargs.

  • zref_values (list[float] | None, optional) – Reference level values to test, by default None

  • density_contrast_values (list[float] | None, optional) – Density contrast values to test, by default None

  • starting_topography_kwargs (dict[str, Any] | None, optional) – region, spacing and dampings used to create a flat starting topography for each zref value, by default None.

  • regional_grav_kwargs (dict[str, Any] | None, optional) – Keywords used to calculate the regional field, by default None. If method is constraints for constraint point minimization, must separate the constraints into testing and training sets and provide the training set to this argument and the testing set to constraints_df to avoid biasing the scores.

  • rmse_as_median (bool, optional) – Use the median instead of the root mean square as the scoring metric, by default False

  • progressbar (bool, optional) – display a progress bar for the number of tested values, by default True

  • plot_cv (bool, optional) – plot a graph of scores vs parameter values, by default False

  • results_fname (str, optional) – file name to save results to, by default “tmp” with an attached random number

  • kwargs (Any)

Returns:

float, float, float, list[typing.Any], list[float], ] the inversion results, the optimal parameter value, the score associated with it, the parameter values and the scores for each parameter value

Return type:

tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float],

random_split_test_train(data_df, test_size=0.3, random_state=10, plot=False)[source]#

split data into training and testing sets randomly with a specified percentage of points to be in the test set set by test_size.

Parameters:
  • data_df (pandas.DataFrame) – data to be split, must have columns “easting” and “northing”.

  • test_size (float, optional) – decimal percentage of points to put in the testing set, by default 0.3

  • random_state (int, optional) – number to set th random splitting, by default 10

  • plot (bool, optional) – choose to plot the results, by default False

Returns:

dataframe with a new column “test” which is a boolean value of whether the row is in the training or testing set.

Return type:

pandas.DataFrame

split_test_train(data_df, method, spacing=None, shape=None, n_splits=5, random_state=10, plot=False)[source]#

Split data into training or testing sets either using KFold (optional blocked) or LeaveOneOut methods.

Parameters:
  • data_df (pandas.DataFrame) – dataframe with coordinate columns “easting” and “northing”

  • method (str) – choose between “LeaveOneOut” or “KFold” methods.

  • spacing (float | tuple[float, float] | None, optional) – grid spacing to use for Block K-Folds, by default None

  • shape (tuple[float, float] | None, optional) – number of blocks to use for Block K-Folds, by default None

  • n_splits (int, optional) – number for folds to make for K-Folds method, by default 5

  • random_state (int, optional) – random state used for both methods, by default 10

  • plot (bool, optional) – plot the separated training and testing dataset, by default False

Returns:

a dataset with a new column for each fold in the form fold_0, fold_1 etc., with the value “train” or “test”

Return type:

pandas.DataFrame

kfold_df_to_lists(df)[source]#

convert a single dataframe with fold columns in the form fold_0, fold_1 etc. into a list of testing dataframes for each fold and a list of training dataframes for each fold.

Parameters:

df (pandas.DataFrame) – dataframe with fold columns in the form fold_0, fold_1 etc., as output by function split_test_train().

Returns:

  • test_dfs (list[pandas.DataFrame]) – a list of testing dataframes for each fold

  • train_dfs (list[pandas.DataFrame]) – a list of training dataframes for each fold

Return type:

tuple[list[pandas.DataFrame], list[pandas.DataFrame]]

eq_sources_score(coordinates, data, delayed=False, weights=None, **kwargs)[source]#

Calculate the cross-validation score for fitting gravity data to equivalent sources. Uses Verde’s cross_val_score function to calculate the score. All kwargs are passed to the harmonica.EquivalentSources class.

Parameters:
Keyword Arguments:
  • damping (float | None) – The positive damping regularization parameter. Controls how much smoothness is imposed on the estimated coefficients. If None, no regularization is used.

  • points (list[numpy.ndarray] | None) – List containing the coordinates of the equivalent point sources. Coordinates are assumed to be in the following order: (easting, northing, upward). If None, will place one point source below each observation point at a fixed relative depth below the observation point. Defaults to None.

  • depth (float | str) – Parameter used to control the depth at which the point sources will be located. If a value is provided, each source is located beneath each data point (or block-averaged location) at a depth equal to its elevation minus the depth value. If set to "default", the depth of the sources will be estimated as 4.5 times the mean distance between first neighboring sources. This parameter is ignored if points is specified. Defaults to "default".

  • block_size (float | tuple[float, float] | None) – Size of the blocks used on block-averaged equivalent sources. If a single value is passed, the blocks will have a square shape. Alternatively, the dimensions of the blocks in the South-North and West-East directions can be specified by passing a tuple. If None, no block-averaging is applied. This parameter is ignored if points are specified. Default to None.

  • parallel (bool) – If True any predictions and Jacobian building is carried out in parallel through Numba’s jit.prange, reducing the computation time. If False, these tasks will be run on a single CPU. Default to True.

  • dtype (str) – The desired data-type for the predictions and the Jacobian matrix. Default to "float64".

Returns:

a float of the score, the higher the value to better the fit.

Return type:

float

regional_separation_score(testing_df, score_as_median=False, **kwargs)[source]#

Evaluate the effectiveness of the gravity regional-residual separation. The optimal regional component is that which results in a residual component which is lowest at constraint points, while still contains a high amplitude elsewhere.

Parameters:
  • testing_df (pandas.DataFrame) – dataframe containing a priori measurements of the topography of interest with columns “upward”, “easting”, and “northing”

  • score_as_median (bool, optional) – switch from using the root mean square to the root median square for the score, by default is False., by default False

  • **kwargs (Any,) – additional keyword arguments for the specified method.

Returns:

  • residual_constraint_score (float) – the RMS of the residual at constraint points

  • residual_amplitude_score (float) – the RMS of the residuals amplitude at all grid points

  • true_reg_score (float | None) – the RMSE between the true regional field and the estimated field, if provided, otherwise None

  • df_anomalies (pandas.DataFrame) – the dataframe of the regional and residual gravity anomalies

Return type:

tuple[float, float, float | None, pandas.DataFrame]