Inferring ongoing cancer evolution from single tumour biopsies using synthetic supervised learning

Tom W. Ouellette and Philip Awadalla

Supplementary

Table of Contents


I. Description of synthetic tumour generation methods
    1. Pseudo-algorithms for stochastic simulations and synthetically sampled tumours
      • Supplementary Figure 1. Comparison of single population genetic statistics and deep learning models for differentiating between positive selection and neutral evolution
      • Supplementary Figure 2. Predicting the number of subclones (0, 1, 2) in 2.8 million synthetic tumours
      • Pseudo-algorithms for stochastic branching process (positive selection), generative sampling process (neutral evolution), and paired synthetic data generation
    2. Checking simulation model specification relative to sequenced patient tumours
      • Supplementary Figure 3. Evaluating validity of synthetic data generation scheme with respect to real patient data (removal of low frequency variants based on mean sequencing depth)
      • Supplementary Figure 4. Evaluating validity of synthetic data generation scheme with respect to real patient data (removal of low frequency variants based on mean effective coverage)
      • Supplementary Figure 5. Comparison of nearest neighbour search when using mean effective coverage versus mean sequencing depth.

II. Deep learning model performance for base evolutionary inference tasks
    1. Examining probability estimates for detecting selection under varying subclone characteristics
      • Supplementary Figure 6. Accurately detecting positive selection and subclonality is dependent on the sequencing depth, number of subclonal mutations, and subclone frequency at time of biopsy
    2. Evaluating model performance in 2.8 million simulated tumours
      • Supplementary Figure 7. Comparison of single population genetic statistics and deep learning models for differentiating between positive selection and neutral evolution
      • Supplementary Figure 8. Predicting the number of subclones (0, 1, 2) in 2.8 million synthetic tumours
      • Supplementary Figure 9. Correlation between true subclone frequency and predicted subclone frequency using synthetic supervised learning (TumE) and a population genetics informed mixture model (MOBSTER)
      • Supplementary Figure 10. Error in predicting frequency of 2 detectable subclones with synthetic supervised learning (TumE)
      • Supplementary Figure 11. Relationship between frequencies of subclones in the 2 subclone setting and the mean percentage error for the highest frequency subclone (1st subclone).
      • Supplementary Figure 12. Relationship between frequencies of subclones in the 2 subclone setting and the mean percentage error for the lowest frequency subclone (2nd subclone).
    3. Testing generalizability of selection estimates using an orthogonal cancer evolution simulator
      • Supplementary Figure 13. Evaluation of TumE evolutionary classification estimates in an orthogonally simulated dataset of 900 synthetic tumours
      • Supplementary Figure 14. Subclone frequency estimates are only accurate at detectable frequency ranges
    4. Impact of variable birth and death rates on estimating the number of subclones and subclone frequency
      • Supplementary Figure 15. TumE performance (precision and recall) for predicting the number of subclones across 26 different birth rate and death rate combinations in 6.7 million synthetic tumours.
      • Supplementary Figure 16. TumE performance (precision and recall) for predicting the frequency of a single subclone across 26 different birth rate and death rate combinations in 6.7 million synthetic tumours.
    5. Robustness to variable tumour purity and incorrect purity estimates
      • Supplementary Figure 17. Evaluation of the peak-finding/heuristic VAF adjustment method.
      • Supplementary Figure 18. Supplementary Figure 18. False positive rate for positive selection (>= 1 subclone) at variable sequencing depths, tumour purities, and errors in purity estimates in 6000 synthetic tumours.

III. Analysis in high-quality whole-genome and whole-exome sequenced tumour biopsies
    1. Application of TumE to high-quality PCAWG samples
      • Supplementary Figure 19. VAF distributions with annotated TumE fits for 75 PCAWG samples with either zero or one detected subclone.

IV. Transfer learning with TumE
    1. Inferring additional evolutionary parameters using an alternative simulation framework TEMULATOR
      • Supplementary Figure 20. Viable fitness and emergence time parameter combinations for detectable subclones (~10 - 40% VAF) in the TEMULATOR simulation framework
      • Supplementary Figure 21. Comparison of predictive performance for inferring evolutionary parameters with and without pre-trained TumE models.
      • Supplementary Figure 22. Comparison of mean percentage error with and without post-hoc mutation rate correction.
      • Supplementary Figure 23. Mean percentage error for inferring parameters from TEMULATOR simulations (mutation rate, subclone cellular fraction, subclone emergence time, and subclone fitness).