Roslin Bioinformatics - VIPER

Inheritance Checking: Exploring and Cleaning Inconsistent Data (using 'Masking')

The Genotype Checking algorithm is initially run when a genotype file is loaded. Thereafter the user controls when to recalculate inheritance inconsistencies after changes have been made to the data, this limits the amount of distracting and time consuming recalculation and redrawing. Several interactive controls allow the user to 'Mask' (i.e. reversibly remove) suspect data points.


Masking Data

In order to confirm and then delete suspect data points VIPER uses the concept of 'Masking' data. Masking means the reversible deletion of data, and allows the effect of data deletion to be tested by reapplying the inheritance algorithm to the altered datasets and confirming whether data has been cleaned satisfactorily. By applying masking to successive suspect data points highlighted in the interface VIPER can be used to iteratively explore and remove all the inheritance inconsistencies reported for a dataset.

There are four reversible mechanisms for removing data.

  1. Filtering of unreliable markers from the dataset (using Errorgram Filters.)
  2. Masking markers individually by selection in the 'Marker Table'.
  3. Masking genotypes for a specific individual, either for a single marker or all marker genotypes for that individual as described below
  4. Removing pedigree relationships for individuals i.e. paternal and/or maternal relationships as described below.

Data cleaning is complete when the minimal set of maskings has been applied which removes all sources of inheritance inconsistency reported by the genotype checking algorithm. The cleaned dataset, from which all masked data points are excluded can then be exported as separate pedigree genotype (and log) files.

Errorgram Filters

Applying either of the 'Marker Errorgram' filters applies Masks to the exclude markers above or below the chosen threshold (Figure 1). Recalculation and redrawing of the pedigree and tabulated data is automatically synchronized with the sliders, but can be decoupled by un-ticking the righthand selection box, to only trigger algorithm updating once the mouse is released. (This is preferable for large datasets where recalculation and redrawing takes several seconds). The two filters are coupled, and filtering with the upper 'Master Marker Errorgram' changes the window of Markers displayed in the middle 'Filtered Marker Errorgram'. In general the Master filter is used to delete poor quality markers from the dataset, whilst the Filtered control is just use to temporarily select particular marker sets for detailed analysis.

The lower errorgram simply controls the threshold at which errors are highlighted on the individuals and families in the pedigree, and repainting should be responsive even with large datsets.

Figure 1. Errorgram Sliders act as filters.

These errorgram filter settings have excluded markers with more than 140 or fewer than 30 errors from the analysis.The bottom errorgram has been set the error colour map to highlight individuals with more than 3 errors. Note that algorithm recalculation has been decoupled from the action of the middle errorgram by unchecking the repaint box on the right hand side.

Masking on Selected Individuals

Right mouse-clicking on an offspring or family (hexagon) or parent (rectangle) icon in the pedigree displays a menu which allows the user to mask (remove) all genotypes for the individual, to mask the asserted sire (paternity) link or to mask the asserted dam (maternity) link (Figure 2a). [If the pedigree data is currently focused on a single marker rather than aggregate data, only the genotype for that marker will be masked, and it is not possible to remove parent links in this view.]

Figure2a: Pedigree before selecting Mask operations

Pedigree before selecting Mask operations

The right-mouse-click context menu is shown for individual G722, with the three masking possibilities for this individual. The action of 'Mask Individual' depends on whether the aggregate marker view or a focused marker view is active.

Recalculation is not immediately applied when masks are applied to an individual or family (nor when masks are applied by selecting markers in the 'Marker Table', see below, Figure 3). Once any object is masked, the 'Recalculate Errors' button becomes activate and highlighted (Figure 2b) allowing the user to force error recalculation once they have completed the desired masking operations. This allows the user to apply (or remove) several masking's in one operation. Masking's are reverted using the same menu controls, or by using the 'Undo Recent Masking's' button the current set of changes are abandoned.

Figure2b: Pedigree with a set of Mask operations awaiting recalculation

Pedigree with a set of Mask operations awating recalculation

Blue glyphs show which masks have been selected though not yet applied. Blue upwards triangle represents a broken paternity link, blue downwards triangle, broken maternity link. The blue hatching shows genotype masking. Both the 'Recalculate Errors' and 'Undo Recent Masking's' buttons are now activated, and the Recalculate button is highlighted to draw attention to the need for error recalculation.

Once the 'Recalculate Errors' button is pressed, these changes to the data are applied, and a new checkpoint box is created in the tree of operations in the 'History' panel Figure 2c). Masking's can be reverted on individuals using the right-click context menu controls, or the user can step back to the previous state of the data by clicking on a node in the 'History' panel. The masking's that have been applied to Individuals are also listed in the 'Masked Ind Table' (Figure 2d) from where the maskings may also be reverted.

Figure2c: Pedigree after applying a set of Mask operations

Pedigree after applying a set of Mask operations The pedigree has been redrawn following reapplication of the inheritance algorithm and recalculation of the error metrics. Pedigree changes result in the layout being altered, and G703 and G707 now belong to new families with unknown dam or sire. A new checkpoint box in the 'History' panel provides a navigable summary of the changes applied.

Figure2d: Masked Individuals Table

Some pedigree changes will cause major reorganisation of the layout and offspring and families may disappear from one generation. This is shown in Figure 3, where removing both parents from G558 has the effect of removing the family from Generation 1. Note that a 'broken link' icon is used to indicate 'orphaning' where G558 occurs as a parent of Generation 2.

Figure3: Orphaning an individual can remove a family from a generation

Masking Selected Markers

In addition to the Errorgram filtering of markers described above, it is possible to mask markers by selection in the 'Marker Table'. In this table the error rates for each marker are listed, in sortable columns. Masked markers are greyed out, and it is possible to select individual markers for masking by ticking the selection box in 'Hand Mask' column (Figure 4a).

Figure 4a: Marker Table detailing the current error metrics and the masking applied.

The Marker Table details the Markers as filtered in Figure 1 above, together with 3 further individual maskings applied manually. Masked markers are greyed out.

Selecting a single marker (by clicking on its name, Figure 4b) focuses and highlights that marker, and toggles the main pedigree display from aggregate metrics to report metrics for that single marker. Focus can be changed by selecting a different marker, or reverted to the aggregate view by using the 'Clear Focus Marker' button. [NOTE: from v1.01 this button has been renamed 'Show All Markers' and it is also possible to toggle between views by clicking on marker names.]

Figure 4b: Marker Table showing a single marker highlighted for display

Like the other tables, the Marker Table can be resized, minimized or unpinned from the application.

Mask Remaining Errors

The 'Mask Remaining Errors' button (see Figures 2a,b,c) can be used to apply maskings to all remaining genotypes reporting as being in error. In order to correctly identify the minimal set of genetic inconsistencies for deletion this should only be performed after all the major systematic errors in the dataset have been identified. After error recalculation this operation may have to be repeated if the masked genotypes are not themselves the true causal errors and expose errors in relatives.