LoF gene
Loss-of-function Genes Associated with aGVHD Risk
We defined a gene with any truncation or frameshift variants as LoF gene ("stopgain", "startloss", "frameshift insertion", "frameshift deletion" and "frameshift block substitution" called by ANNOVAR). If a LoF gene with any recessive LoF mutation, we defined it as recessive LoF gene, otherwise dominate.
Statistical methods
The reported p-value of LoF gene is the coefficient p-value in logistic regression fitted with aGVHD outcome. To avoid the problem of separation in logistic regression when minor allele frequency is rare, we used a penalized likelihood based method Firth logistic regression implemented by "logistf" package in R (Heinze and Schemper, 2002; Puhr et al., 2017).
LoF gene pair
Donor and Recipient Loss-of-function Gene Pairs Associated with aGVHD Risk
When a specific donor gene G1 and a specific recipient gene G2 got LoF mutation simultaneously, we defined "G1- G2" as a donor-recipient LoF gene pair.
Statistical methods
The reported p-value of LoF gene pair is the coefficient p-value in logistic regression fitted with aGVHD outcome. To avoid the problem of separation in logistic regression when minor allele frequency is rare, we used a penalized likelihood based method Firth logistic regression implemented by “logistf” package in R (Heinze and Schemper, 2002; Puhr et al., 2017).
SNP or inDel
Short Variants Associated with aGVHD Risk
The Short variants were called by GATK in individual mode. The "mismatches" were calculated as the different allele number between donor and recipient at the same loci, and encode as 0, 1 or 2, which means same alleles/only 1 allele is different/both alleles are different respectively.
Statistical methods
The association with aGVHD for the short variants (SNPs, inDels and mismatches) were calculated by PLINK (logistic regression with additive/dominant/recessive model).
SV
Structure Variations Associated with aGVHD Risk
The SVs were called by Lumpy. SVs reads with either PE reads or SR reads less than 5 were filtered.
Statistical methods
The reported p-value of SV is the coefficient p-value in logistic regression fitted with aGVHD outcome. To avoid the problem of separation in logistic regression when minor allele frequency is rare, we used a penalized likelihood based method Firth logistic regression implemented by "logistf" package in R (Heinze and Schemper, 2002; Puhr et al., 2017).
CNV
Genome Copy Number Variations Associated with aGVHD Risk
Copy number of each 50,000 bp window along the whole genome were estimated by Control-FREEC. And the association were calculated with the estimated copy number in each genomic window.
Statistical methods
The reported p-value of Genome Copy Number is the coefficient p-value in logistic regression fitted with aGVHD outcome. To avoid the problem of separation in logistic regression when minor allele frequency is rare, we used a penalized likelihood based method Firth logistic regression implemented by "logistf" package in R (Heinze and Schemper, 2002; Puhr et al., 2017).