QC checks on grouped reports¶
- marine_qc.do_mds_buddy_check(lat, lon, date, value, climatology, standard_deviation, limits, number_of_obs_thresholds, multipliers, ignore_indexes=None)[source]
Do the old style buddy check.
The buddy check compares an observation to the average of its near neighbours (called the buddy mean). Depending on how many neighbours there are and their proximity to the observation being tested a multiplier is set. If the difference between the observation and the buddy mean is larger than the multiplier times the standard deviation then the observation fails the buddy check. If no buddy observations are found within the specified limits, then the limits are expanded until the check runs out of specified limits or observations are found within the limits.
- Parameters:
lat (
SequenceNumberType) – 1-dimensional latitude array.lon (
SequenceNumberType) – 1-dimensional longitude array.date (
SequenceDatetimeType) – 1-dimensional date array.value (
SequenceNumberType) – 1-dimensional anomaly array.climatology (
ClimArgType) – The climatological average(s) used to calculate anomalies. Can be a scalar, sequence, a one-dimensional NumPy array, a pandas Series, aClimatology, a path-like string on disk, a xarray Dataset or a xarray DataArray.standard_deviation (
Climatology) – Field of standard deviations of 1x1xpentad standard deviations.limits (
list[list]) – Limits a list of lists. Each list member is a three-membered list specifying the longitudinal, latitudinal, and time range within which buddies are sought at each level of search.number_of_obs_thresholds (
list[list]) – Number of observations corresponding to each multiplier in multipliers. The initial list should be the same length as the limits list.multipliers (
list[list]) – Multiplier, x, used for buddy check mu +- x * sigma. The list should have the same structure as number_of_obs_threshold.ignore_indexes (
list[int], optional) – List of row numbers to be skipped.
- Return type:
- Returns:
SequenceIntType– Same type as input, but with integer valuesReturns array/sequence/Series of 1s if the MDS buddy check fails
Returns or array/sequence/Series of 0s otherwise.
- Raises:
TypeError – If inspect_arrays does not return np.ndarrays.
Notes
The limits, number_of_obs_thresholds, and multipliers parameters are rather complex. The buddy check basically looks within a lat-lon-time range specified by the first element in limits. If there are more than zero observations in the search range then a multiplier is chosen based on how many observations there are.
If the first element of limits is [1,1,2] then we first look within a distance equivalent to 1 degree latitude and longitude at the equator and 2 pentads in time. If there are more than zero observations then we calculate the buddy mean, and we consult the number_of_obs_threshold. If, for example, this is [0, 5, 15, 100] then we look for the first entry where the number of obs is greater than that threshold. We then look up the multiplier in the appropriate list (say [4, 3.5, 3.0, 2.5]). If the difference between an observation and the buddy mean is greater than the multiplier times the standard deviation at that point then it fails the buddy check. So, if there were 10 observations then the multiplier would be 3.5.
Previous versions had default values for the parameters of:
limits = [[1, 1, 2], [2, 2, 2], [1, 1, 4], [2, 2, 4]]
number_of_obs_thresholds = [[0, 5, 15, 100], [0], [0, 5, 15, 100], [0]]
multipliers = [[4.0, 3.5, 3.0, 2.5], [4.0], [4.0, 3.5, 3.0, 2.5], [4.0]]
- marine_qc.do_bayesian_buddy_check(lat, lon, date, value, climatology, stdev1, stdev2, stdev3, prior_probability_of_gross_error, quantization_interval, one_sigma_measurement_uncertainty, limits, noise_scaling, maximum_anomaly, fail_probability, ignore_indexes=None)[source]
Do the Bayesian buddy check.
The bayesian buddy check assigns a probability of gross error to each observation, which is rounded down to the tenth and then multiplied by 10 to yield a flag between 0 and 9.
- Parameters:
lat (
SequenceNumberType) – 1-dimensional latitude array.lon (
SequenceNumberType) – 1-dimensional longitude array.date (
SequenceDatetimeType) – 1-dimensional date array.value (
SequenceNumberType) – 1-dimensional anomaly array.climatology (
ClimArgType) – The climatological average(s) used to calculate anomalies. Can be a scalar, sequence, a one-dimensional NumPy array, a pandas Series, aClimatology, a path-like string on disk, a xarray Dataset or a xarray DataArray.stdev1 (
Climatology) – Field of standard deviations representing standard deviation of difference between target gridcell and complete neighbour average (grid area to neighbourhood difference).stdev2 (
Climatology) – Field of standard deviations representing standard deviation of difference between a single observation and the target gridcell average (point to grid area difference).stdev3 (
Climatology) – Field of standard deviations representing standard deviation of difference between random neighbour gridcell and full neighbour average (uncertainty in neighbour average).prior_probability_of_gross_error (
float) – Prior probability of gross error, which is the background rate of gross errors.quantization_interval (
float) – Smallest possible increment in the input values.one_sigma_measurement_uncertainty (
float) – Estimated one sigma measurement uncertainty.limits (
list[int]) – List with three members which specify the search range for the buddy check.noise_scaling (
float) – Tuning parameter used to multiply stdev2. This was determined to be approximately 3.0 by comparison with observed point data. stdev2 was estimated from OSTIA data and typically underestimates the point to area-average difference by this factor.maximum_anomaly (
float) – Largest absolute anomaly, assumes that the maximum and minimum anomalies have the same magnitude.fail_probability (
float) – Probability of gross error that corresponds to a failed test. Anything with a probability of gross error greater than fail_probability will be considered failing.ignore_indexes (
list[int], optional) – List of row numbers to be skipped.
- Return type:
- Returns:
SequenceIntType– Same type as input, but with integer valuesReturns array/sequence/Series of 2s if there are no buddies in the specified limits
Returns array/sequence/Series of 1s if the bayesian buddy check fails
Returns or array/sequence/Series of 0s otherwise.
- Raises:
TypeError – If inspect_arrays does not return np.ndarrays.
Notes
In previous versions the default values for the parameters were:
prior_probability_of_gross_error = 0.05
quantization_interval = 0.1
limits = [2, 2, 4]
noise_scaling = 3.0
one_sigma_measurement_uncertainty = 1.0
maximum_anomaly = 8.0
fail_probability = 0.3
- marine_qc.do_multiple_grouped_check(data, qc_dict=None, preproc_dict=None, return_method='all')[source]
Apply one or more buddy-check quality-control (QC) functions to a DataFrame or Series.
- Parameters:
data (
pd.Seriesorpd.DataFrame) – Hashable input data.qc_dict (
Mapping, optional) – Nested QC dictionary. Keys represent arbitrary user-specified names for the checks. The values are dictionaries which contain the keys “func” (name of the QC function), “names” (input data names as keyword arguments, that will be retrieved from data) and, if necessary, “arguments” (the corresponding keyword arguments). For more information see Examples.preproc_dict (
Mapping, optional) – Nested pre-processing dictionary. Keys represent variable names that can be used by qc_dict. The values are dictionaries which contain the keys “func” (name of the pre-processing function), “names” (input data names as keyword arguments, that will be retrieved from data), and “inputs” (list of input-given variables). For more information see Examples.return_method (
{"all", "passed", "failed"}, default:"all") – If “all”, return QC dictionary containing all requested QC check flags. If “passed”: return QC dictionary containing all requested QC check flags until the first check passes. Other QC checks are flagged as unstested (3). If “failed”: return QC dictionary containing all requested QC check flags until the first check fails. Other QC checks are flagged as unstested (3).
- Return type:
- Returns:
pd.DataFrameorpd.Series– A DataFrame (or Series if the input was a Series) whose columns correspond to the QC names inqc_dictand whose values contain QC flags for each row. Flags depend on the QC functions used.- Raises:
NameError – If a function listed in qc_dict or preproc_dict is not defined. If columns listed in qc_dict or preproc_dict are not available in data.
ValueError – If return_method is not one of [“all”, “passed”, “failed”] If variable names listed in qc_dict or preproc_dict are not valid parameters of the QC function.
Notes
If a variable is pre-processed using preproc_dict, mark the variable name as “__preprocessed__” in qc_dict. For example: “climatology”: “__preprocessed__”.
For more information, see
do_multiple_individual_checks().