QC checks on sequential reports

marine_qc.do_few_check(value)[source]

Check if number of observations is less than 3.

Parameters:

value (SequenceNumberType) – One-dimensional array of values to be analyzed. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if number of observations is less than 3.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If either input is not 1-dimensional.

  • TypeError – If inspect_arrays does not return np.ndarrays.

marine_qc.do_spike_check(value, lat, lon, date, max_gradient_space, max_gradient_time, delta_t, n_neighbours)[source]

Perform IQUAM-like spike check.

Parameters:
  • value (SequenceNumberType) – One-dimensional array of values to be analyzed. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lat (SequenceNumberType) – One-dimensional array of latitudes in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lon (SequenceNumberType) – One-dimensional array of longitudes in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • date (SequenceDatetimeType) – One-dimensional array of datetime values. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • max_gradient_space (float, default: 0.5) – Maximum allowed spatial gradient. The unit is “units of value” per kilometer.

  • max_gradient_time (float, default: 1.0) – Maximum allowed temporal gradient. The unit is “units of value” per hour.

  • delta_t (float, default: 2.0) – Temperature delta used in the comparison. Typically set to 2.0 for ships and 1.0 for drifting buoys.

  • n_neighbours (int, default: 5) – Number of neighboring points considered in the analysis.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if the spike check fails.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If either input is not 1-dimensional or if their lengths do not match.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

In previous versions, default values for the parameters were:

  • max_gradient_space: float = 0.5

  • max_gradient_time: float = 1.0

  • delta_t: float = 2.0

  • n_neighbours: int = 5

marine_qc.do_track_check(vsi, dsi, lat, lon, date, max_direction_change, max_speed_change, max_absolute_speed, max_midpoint_discrepancy)[source]

Perform one pass of the track check.

This is an implementation of the MDS track check code which was originally written in the 1990s. I don’t know why this piece of historic trivia so exercises my mind, but it does: the 1990s! I wish my code would last so long.

Parameters:
  • vsi (SequenceNumberType) – One-dimensional reported speed array in km/h. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • dsi (SequenceNumberType) – One-dimensional reported heading array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lat (SequenceNumberType) – One-dimensional latitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lon (SequenceNumberType) – One-dimensional longitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • date (SequenceDatetimeType) – One-dimensional date array. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • max_direction_change (float, default: 60.0) – Maximum valid direction change in degrees.

  • max_speed_change (float, default: 10.0) – Maximum valid speed change in km/h.

  • max_absolute_speed (float, default: 40.0) – Maximum valid absolute speed in km/h.

  • max_midpoint_discrepancy (float, default: 150.0) – Maximum valid midpoint discrepancy in meters.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if the track check fails.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If either input is not 1-dimensional or if their lengths do not match.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

If number of observations is less than three, the track check always passes.

In previous versions, the default values of the parameters were:

  • max_direction_change = 60.0

  • max_speed_change = 10.0

  • max_absolute_speed = 40.0

  • max_midpoint_discrepancy = 150.0

marine_qc.do_iquam_track_check(lat, lon, date, speed_limit, delta_d, delta_t, n_neighbours)[source]

Perform the IQUAM track check as detailed in Xu and Ignatov 2013.

The track check calculates speeds between pairs of observations and counts how many exceed a threshold speed. The ob with the most violations of this limit is flagged as bad and removed from the calculation. Then the next worst is found and removed until no violations remain.

Parameters:
  • lat (SequenceNumberType) – One-dimensional latitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lon (SequenceNumberType) – One-dimensional longitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • date (SequenceDatetimeType) – One-dimensional date array. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • speed_limit (float) – Speed limit of platform in kilometers per hour. Typically, 60.0 for ships and 15.0 for drifting buoys.

  • delta_d (float) – Latitude tolerance in degrees.

  • delta_t (float) – Time tolerance in hundredths of an hour.

  • n_neighbours (int) – Number of neighbouring points considered in the analysis.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if the IQUAM QC fails.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If either input is not 1-dimensional or if their lengths do not match.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

Previous versions had default values for the parameters of:

  • speed_limit = 60.0 for ships and 15.0 for drifting buoys

  • delta_d = 1.11

  • delta_t = 0.01

  • n_neighbours = 5

marine_qc.find_saturated_runs(at, dpt, lat, lon, date, min_time_threshold, shortest_run)[source]

Perform checks on persistence of 100% rh while going through the voyage.

While going through the voyage repeated strings of 100 %rh (AT == DPT) are noted. If a string extends beyond 20 reports and two days/48 hrs in time then all values are set to fail the repsat qc flag.

Parameters:
  • at (SequenceNumberType) – One-dimensional air temperature array. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • dpt (SequenceNumberType) – One-dimensional dew point temperature array. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lat (SequenceNumberType) – One-dimensional latitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • lon (SequenceNumberType) – One-dimensional longitude array in degrees. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • date (SequenceDatetimeType) – One-dimensional date array. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • min_time_threshold (float, default: 48.0) – Minimum time threshold in hours.

  • shortest_run (int, default: 4) – Shortest number of observations.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if a saturated run is found.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If either input is not 1-dimensional or if their lengths do not match.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

In previous version, default values for the parameters were:

  • min_time_threshold = 48.0

  • shortest_run = 4

marine_qc.find_multiple_rounded_values(value, min_count, threshold)[source]

Find instances when more than “threshold” of the observations are whole numbers and set the ‘round’ flag.

Used in the humidity QC where there are times when the values are rounded and this may have caused a bias.

Parameters:
  • value (SequenceNumberType) – One-dimensional array of values. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • min_count (int, default: 20) – Minimum number of rounded figures that will trigger the test.

  • threshold (float, default: 0.5) – Minimum fraction of all observations that will trigger the test.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if the value is a whole number.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError – If threshold is not between 0.0 and 1.0.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

Previous versions had default values for the parameters of:

  • min_count = 20

  • threshold = 0.5

marine_qc.find_repeated_values(value, min_count, threshold)[source]

Find cases where more than a given proportion of SSTs have the same value.

This function goes through a voyage and finds any cases where more than a threshold fraction of the observations have the same values for a specified variable.

Parameters:
  • value (SequenceNumberType) – One-dimensional array of values. Can be a sequence (e.g., list or tuple), a one-dimensional NumPy array, or a pandas Series.

  • min_count (int, default: 20) – Minimum number of repeated values that will trigger the test.

  • threshold (float, default: 0.7) – Smallest fraction of all observations that will trigger the test.

Return type:

SequenceIntType

Returns:

SequenceIntType – Same type as input, but with integer values

  • Returns array/sequence/Series of 1s if the value is repeated.

  • Returns array/sequence/Series of 0s otherwise.

Raises:
  • ValueError

    • If threshold is not between 0.0 and 1.0.

  • TypeError – If inspect_arrays does not return np.ndarrays.

Notes

Previous versions had default values for the parameters of:

  • min_count = 20

  • threshold = 0.7

marine_qc.do_multiple_sequential_check(data, groupby=None, qc_dict=None, preproc_dict=None, return_method='all')[source]

Apply one or more sequential quality-control (QC) functions to groups of a DataFrame or Series.

Typically for time-ordered or track-based checks.

Parameters:
  • data (pd.Series or pd.DataFrame) – Hashable input data.

  • groupby (str, iterable of str, or pandas GroupBy, optional) – Specifies how the data should be grouped before applying QC functions. If a string or iterable of strings, data.groupby is called on those keys. If a pandas.DataFrameGroupBy object is provided, its groups are used directly. Any groups that contain indices not present in data are automatically trimmed. If None, the entire input data is treated as a single group. For more information see Examples.

  • qc_dict (Mapping, optional) – Nested QC dictionary. Keys represent arbitrary user-specified names for the checks. The values are dictionaries which contain the keys “func” (name of the QC function), “names” (input data names as keyword arguments, that will be retrieved from data) and, if necessary, “arguments” (the corresponding keyword arguments).

  • preproc_dict (Mapping, optional) – Nested pre-processing dictionary. Keys represent variable names that can be used by qc_dict. The values are dictionaries which contain the keys “func” (name of the pre-processing function), “names” (input data names as keyword arguments, that will be retrieved from data), and “inputs” (list of input-given variables). For more information see Examples.

  • return_method ({"all", "passed", "failed"}, default: "all") – If “all”, return QC dictionary containing all requested QC check flags. If “passed”: return QC dictionary containing all requested QC check flags until the first check passes. Other QC checks are flagged as unstested (3). If “failed”: return QC dictionary containing all requested QC check flags until the first check fails. Other QC checks are flagged as unstested (3).

Return type:

DataFrame | Series

Returns:

pd.DataFrame or pd.Series – A DataFrame (or Series if the input was a Series) whose columns correspond to the QC names in qc_dict and whose values contain QC flags for each row. Flags depend on the QC functions used.

Raises:
  • NameError – If a function listed in qc_dict or preproc_dict is not defined. If columns listed in qc_dict or preproc_dict are not available in data.

  • ValueError – If return_method is not one of [“all”, “passed”, “failed”] If variable names listed in qc_dict or preproc_dict are not valid parameters of the QC function.

Notes

If a variable is pre-processed using preproc_dict, mark the variable name as “__preprocessed__” in qc_dict. For example: “climatology”: “__preprocessed__”.

For more information, see do_multiple_individual_checks().