invert4geom.cross_validation

invert4geom.cross_validation#

Functions#

`resample_with_test_points`(data_spacing, data, region)	take a dataframe of coordinates and make all rows that fall on the data_spacing
`grav_cv_score`(training_data, testing_data[, ...])	Find the score, represented by the root mean (or median) squared error (RMSE),
`grav_optimal_parameter`(training_data, testing_data, ...)	Calculate the cross validation scores for a set of parameter values and return the
`constraints_cv_score`(grav_df, constraints_df[, ...])	Find the score, represented by the root mean squared error (RMSE), between the
`zref_density_optimal_parameter`(grav_df, constraints_df)	Calculate the cross validation scores for a set of zref and density values and
`random_split_test_train`(data_df[, test_size, ...])	split data into training and testing sets randomly with a specified percentage of
`split_test_train`(data_df, method[, spacing, shape, ...])	Split data into training or testing sets either using KFold (optional blocked) or
`kfold_df_to_lists`(df)	convert a single dataframe with fold columns in the form fold_0, fold_1 etc. into
`eq_sources_score`(coordinates, data[, delayed, weights])	Calculate the cross-validation score for fitting gravity data to equivalent sources.
`regional_separation_score`(testing_df[, score_as_median])	Evaluate the effectiveness of the gravity regional-residual separation.

Module Contents#

resample_with_test_points(data_spacing, data, region)[source]#

take a dataframe of coordinates and make all rows that fall on the data_spacing grid training points. Add rows at each point which falls on the grid points of half the data_spacing, assign these with label “test”. If other data is present in dataframe, will sample at each new location.

Parameters:

data_spacing (float) – full spacing size which will be halved
data (pandas.DataFrame) – dataframe with coordinate columns “easting” and “northing”, all other columns will be sampled at new grid spacing
region (tuple[float, float, float, float]) – region to create grid over, in the form (min_easting, max_easting, min_northing, max_northing)

Returns:

a new dataframe with new column “test” of booleans which shows whether each row is a testing or training point.

Return type:

pandas.DataFrame

grav_cv_score(training_data, testing_data, progressbar=True, rmse_as_median=False, plot=False, **kwargs)[source]#

Find the score, represented by the root mean (or median) squared error (RMSE), between the testing gravity data, and the predict gravity data after an inversion. Follows methods of Uieda and Barbosa[1]. Used in optimization.optimize_inversion_damping().

Parameters:

training_data (pandas.DataFrame) – rows of the gravity data frame which are just the training data
testing_data (pandas.DataFrame) – rows of the gravity data frame which are just the testing data
rmse_as_median (bool, optional) – calculate the RMSE as the median as opposed to the mean, by default False
progressbar (bool, optional) – choose to show the progress bar for the forward gravity calculation, by default True
plot (bool, optional) – choose to plot the observed and predicted data grids, and their difference, located at the testing points, by default False
kwargs (Any)

Returns:

score (float) – the root mean squared error, between the testing gravity data and the predicted gravity data
results (tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]) – a tuple of the inversion results.

Return type:

tuple[float, tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float]]

References

Uieda and Barbosa[1]

grav_optimal_parameter(training_data, testing_data, param_to_test, rmse_as_median=False, progressbar=True, plot_grids=False, plot_cv=False, results_fname=None, **kwargs)[source]#

Calculate the cross validation scores for a set of parameter values and return the best score and value.

Parameters:

training_data (pandas.DataFrame) – just the training data rows
testing_data (pandas.DataFrame) – just the testing data rows
param_to_test (tuple[str, list[float]]) – first value is a string of the parameter that is being tested, and the second value is a list of the values to test
rmse_as_median (bool, optional) – calculate the RMSE as the median as opposed to the mean, by default False
progressbar (bool, optional) – display a progress bar for the number of tested values, by default True
plot_grids (bool, optional) – plot all the grids of observed and predicted data for each parameter value, by default False
plot_cv (bool, optional) – plot a graph of scores vs parameter values, by default False
results_fname (str, optional) – file name to save results to, by default “tmp” with an attached random number
kwargs (Any)

Returns:

float, float, list[float], list[float], ] the inversion results, the optimal parameter value, the score associated with it, the parameter values and the scores for each parameter value

Return type:

tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float],

constraints_cv_score(grav_df, constraints_df, rmse_as_median=False, **kwargs)[source]#

Find the score, represented by the root mean squared error (RMSE), between the constraint point elevation, and the inverted topography at the constraint points. Follows methods of Uieda and Barbosa[1]. Used in optimization.optimize_inversion_zref_density_contrast().

Parameters:

grav_df (pandas.DataFrame) – gravity dataframe with columns “res”, “reg”, and “gravity_anomaly”
constraints_df (pandas.DataFrame) – constraints dataframe with columns “easting”, “northing”, and “upward”
rmse_as_median (bool, optional) – calculate the RMSE as the median of the , as opposed to the mean, by default False
kwargs (Any)

Returns:

score (float) – the root mean squared error, between the constraint point elevation and the inverted topography at the constraint points
results (tuple[pandas.DataFrame, pandas.DataFrame, dict[str, typing.Any], float]) – a tuple of the inversion results.

Return type:

tuple[float, tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float]]

References

zref_density_optimal_parameter(grav_df, constraints_df, starting_topography=None, zref_values=None, density_contrast_values=None, starting_topography_kwargs=None, regional_grav_kwargs=None, rmse_as_median=False, progressbar=True, plot_cv=False, results_fname=None, **kwargs)[source]#

Calculate the cross validation scores for a set of zref and density values and return the best score and values. If only 1 parameter is needed to be test, can pass a single value of the other parameter. This uses constraint points, where the target topography is known. The inverted topography at each of these points is compared to the known value and used to calculate the score.

Parameters:

grav_df (pandas.DataFrame) – dataframe with gravity data and coordinates, must have coordinate columns “easting”, “northing”, and “upward”, and gravity data column “gravity_anomaly”
constraints_df (pandas.DataFrame) – dataframe with points where the topography of interest has been previously measured, to be used for score, must have coordinate columns “easting”, “northing”, and “upward”.
starting_topography (xarray.DataArray | None, optional) – starting topography to use to create the starting prism model. If not provided, will make a flat surface at each provided zref value using the region and spacing values provided in starting_topography_kwargs.
zref_values (list[float] | None, optional) – Reference level values to test, by default None
density_contrast_values (list[float] | None, optional) – Density contrast values to test, by default None
starting_topography_kwargs (dict[str, Any] | None, optional) – region, spacing and dampings used to create a flat starting topography for each zref value, by default None.
regional_grav_kwargs (dict[str, Any] | None, optional) – Keywords used to calculate the regional field, by default None. If method is constraints for constraint point minimization, must separate the constraints into testing and training sets and provide the training set to this argument and the testing set to constraints_df to avoid biasing the scores.
rmse_as_median (bool, optional) – Use the median instead of the root mean square as the scoring metric, by default False
progressbar (bool, optional) – display a progress bar for the number of tested values, by default True
plot_cv (bool, optional) – plot a graph of scores vs parameter values, by default False
results_fname (str, optional) – file name to save results to, by default “tmp” with an attached random number
kwargs (Any)

Returns:

float, float, float, list[typing.Any], list[float], ] the inversion results, the optimal parameter value, the score associated with it, the parameter values and the scores for each parameter value

Return type:

tuple[ tuple[pandas.DataFrame, pandas.DataFrame, dict[str, Any], float],

random_split_test_train(data_df, test_size=0.3, random_state=10, plot=False)[source]#

split data into training and testing sets randomly with a specified percentage of points to be in the test set set by test_size.

Parameters:

data_df (pandas.DataFrame) – data to be split, must have columns “easting” and “northing”.
test_size (float, optional) – decimal percentage of points to put in the testing set, by default 0.3
random_state (int, optional) – number to set th random splitting, by default 10
plot (bool, optional) – choose to plot the results, by default False

Returns:

dataframe with a new column “test” which is a boolean value of whether the row is in the training or testing set.

Return type:

pandas.DataFrame

split_test_train(data_df, method, spacing=None, shape=None, n_splits=5, random_state=10, plot=False)[source]#

Split data into training or testing sets either using KFold (optional blocked) or LeaveOneOut methods.

Parameters:

data_df (pandas.DataFrame) – dataframe with coordinate columns “easting” and “northing”
method (str) – choose between “LeaveOneOut” or “KFold” methods.
spacing (float | tuple[float, float] | None, optional) – grid spacing to use for Block K-Folds, by default None
shape (tuple[float, float] | None, optional) – number of blocks to use for Block K-Folds, by default None
n_splits (int, optional) – number for folds to make for K-Folds method, by default 5
random_state (int, optional) – random state used for both methods, by default 10
plot (bool, optional) – plot the separated training and testing dataset, by default False

Returns:

a dataset with a new column for each fold in the form fold_0, fold_1 etc., with the value “train” or “test”

Return type:

pandas.DataFrame

kfold_df_to_lists(df)[source]#

convert a single dataframe with fold columns in the form fold_0, fold_1 etc. into a list of testing dataframes for each fold and a list of training dataframes for each fold.

Parameters:

df (pandas.DataFrame) – dataframe with fold columns in the form fold_0, fold_1 etc., as output by function split_test_train().

Returns:

test_dfs (list[pandas.DataFrame]) – a list of testing dataframes for each fold
train_dfs (list[pandas.DataFrame]) – a list of training dataframes for each fold

Return type:

tuple[list[pandas.DataFrame], list[pandas.DataFrame]]

eq_sources_score(coordinates, data, delayed=False, weights=None, **kwargs)[source]#

Calculate the cross-validation score for fitting gravity data to equivalent sources. Uses Verde’s cross_val_score function to calculate the score. All kwargs are passed to the harmonica.EquivalentSources class.

Parameters:

coordinates (tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]) – tuple of easting, northing, and upward coordinates of the gravity data
data (pandas.Series | numpy.ndarray) – the gravity data
delayed (bool, optional) – compute the scores in parallel if True, by default False
weights (numpy.ndarray | None, optional) – optional weight values for each gravity data point, by default None
kwargs (Any)

Keyword Arguments:

damping (float | None) – The positive damping regularization parameter. Controls how much smoothness is imposed on the estimated coefficients. If None, no regularization is used.
points (list[numpy.ndarray] | None) – List containing the coordinates of the equivalent point sources. Coordinates are assumed to be in the following order: (easting, northing, upward). If None, will place one point source below each observation point at a fixed relative depth below the observation point. Defaults to None.
depth (float | str) – Parameter used to control the depth at which the point sources will be located. If a value is provided, each source is located beneath each data point (or block-averaged location) at a depth equal to its elevation minus the depth value. If set to "default", the depth of the sources will be estimated as 4.5 times the mean distance between first neighboring sources. This parameter is ignored if points is specified. Defaults to "default".
block_size (float | tuple[float, float] | None) – Size of the blocks used on block-averaged equivalent sources. If a single value is passed, the blocks will have a square shape. Alternatively, the dimensions of the blocks in the South-North and West-East directions can be specified by passing a tuple. If None, no block-averaging is applied. This parameter is ignored if points are specified. Default to None.
parallel (bool) – If True any predictions and Jacobian building is carried out in parallel through Numba’s jit.prange, reducing the computation time. If False, these tasks will be run on a single CPU. Default to True.
dtype (str) – The desired data-type for the predictions and the Jacobian matrix. Default to "float64".

Returns:

a float of the score, the higher the value to better the fit.

Return type:

float

regional_separation_score(testing_df, score_as_median=False, **kwargs)[source]#

Evaluate the effectiveness of the gravity regional-residual separation. The optimal regional component is that which results in a residual component which is lowest at constraint points, while still contains a high amplitude elsewhere.

Parameters:

testing_df (pandas.DataFrame) – dataframe containing a priori measurements of the topography of interest with columns “upward”, “easting”, and “northing”
score_as_median (bool, optional) – switch from using the root mean square to the root median square for the score, by default is False., by default False
**kwargs (Any,) – additional keyword arguments for the specified method.

Returns:

residual_constraint_score (float) – the RMS of the residual at constraint points
residual_amplitude_score (float) – the RMS of the residuals amplitude at all grid points
true_reg_score (float | None) – the RMSE between the true regional field and the estimated field, if provided, otherwise None
df_anomalies (pandas.DataFrame) – the dataframe of the regional and residual gravity anomalies

Return type:

tuple[float, float, float | None, pandas.DataFrame]