This document accompanies our paper, Calculating Polygenic Risk Scores (PRS) in UK Biobank: A Practical Guide for Epidemiologists, and describes the pipeline for computing an existing polygenic risk score (PRS) in UK Biobank (UKB) imputed genotype data, including the various quality control (QC) metrics that should be inspected.
We demonstrate the process using a PRS for low-density lipoprotein (LDL) cholesterol (LDL-C PRS) developed in (Klarin et al. 2018).
The complete script produced by this pipeline is ‘demo.sh’.
Data and software required are listed below:
Pre-requisite: Researchers need to have access to UKB genetic data through an approved UK Biobank application.
Operating system: The commands introduced below are written for Linux and Mac OS operating systems.
Please note that we attempt to be path-agnostic throughout this tutorial. See Appendix 1 for details on how to run command line programs, and Appendix 2 for tips on managing filepaths using variables to reduce typo risk.
Additional online resources:
Structure: We start by outlining how to use bgenix to quickly extract SNPs from the UK Biobank imputation data. Next we outline the commands needed to produce and filter on standard quality control metrics in PLINK 2, with particular attention to their suitability for imputed data. Finally we show how to compute PRS.
The Betas file is the per-SNP summary statistics for the PRS, including the list of SNPs in the PRS and their corresponding betas (effect sizes, or weightings). This information can typically be found in the supplementary materials of the paper where a PRS was derived or from online repositories such as PGS Catalog.
The first thing to do is ensure that the list of SNPs and betas is suitably formatted. Scores from online repositories may be consistently formatted according to a schema, but those from the supplementary materials of papers often have additional information in the form of header of footer rows, or may have been formatted in ways that can’t easily be read by software tools. In this case, it is generally easier to manually arrange the betas into a suitable format than to try and tweak the code to read a particular niche layout you may never encounter again.
For our working example, LDL-C PRS, the SNPs and betas were given in supplementary table 11 of (Klarin et al. 2018). We checked that this PRS was developed using the same genome build as UKB (GRCh37), manually removed header rows and footnotes, ensured consistent formatting and saved the relevant information from the original Excel sheets (.xlsx) as ‘Betas.csv’.
Tip: We recommend saving the Betas file as .csv format because
The first few rows of our example ‘Betas.csv’ file are shown below.
Where:
Note that the following code expects the columns to have these names and be arranged in this order.
In order to extract these SNPs from the UKB genetic data using bgenix, we need to make a file containing white-space separated SNP identifiers.
Bgenix expects either of the following:
We can use awk
to produce a list of rsID stored in ‘rsidlist.txt’ and chromosome
position identifiers in the form chr:pos1-pos2
stored in
‘chrposlist.txt’ from ‘Betas.csv’:
awk -F, '{ if (NR>1) { print $1 }}' Betas.csv > rsidlist.txt
awk -F, '{ if (NR>1) { print sprintf("%02d", $2)":"$3"-"$3 }}' Betas.csv > chrposlist.txt
Notes on the command:
-F,
informs awk
the field separator is
comma ,
NR
is an awk
variable for the number of
lines in the file. if (NR>1)
means the command acts on
all except the first line, skipping the header row$1
means the first field in the line, $2
is the second and so on. This command therefore relies on the ordering
of the columns in the filesprintf
returns a formatted string, in this case
sprintf("%02d", $2)
formats the values from column 2 to
have 2 digits (i.e. leading zeroes for chromosomes 1-9)>
is the pipe operator, which redirects the output
from awk
(which would by default go to stdout
)
to the specified fileOutputs would look like the following (both without header):
rsidlist.txt
rs6511720
rs4420638
rs629301
rs2328223
…
chrposlist.txt
19:11202306-11202306
19:45422946-45422946
01:109818306-109818306
20:17845921-17845921
…
Note: Bgenix can only specify up to 9980 chr:position ranges in a query at a time. If more than 9980 variants are required, then either rsID identifiers must be used or the extraction must be split into multiple sections.
We extract SNPs from UKB imputation data on autosomes (non-sex chromosomes), which are stored in BGEN V1.2 format format (.bgen, .sample, .bgen.bgi):
Where:
The full description of these files can be found in UK Biobank documentation.
The process of SNP extraction is outlined in Figure 2:
We iterate over the 22 autosome files in bash, using bgenix to extract the variants listed in ‘rsidlist.txt’ and/or ‘chrposlist.txt’ from each one. The extracted SNPs are saved as single-chromosome files, ‘chr_N.bgen’ where N = 1,…,22.
Then cat-bgen is used to concatenate the extracted data into one single file, here called ‘initial_chr.bgen’. An index for this .bgen file, ‘initial_chr.bgen.bgi’ is created using bgenix for efficient access to specific variants.
Finally, another bash loop deletes the intermediate single-chromosome files for tidiness.
cmd=""
for i in {1..22}
do
bgenix -g ukb_imp_chr${i}_v3.bgen -incl-rsids rsidlist.txt -incl-range chrposlist.txt > chr_${i}.bgen
cmd=$cmd"chr_${i}.bgen "
done
# Combine the .bgen files for each chromosome into one
cat-bgen -g $cmd -og initial_chr.bgen -clobber
# Write index file .bgen.bgi
bgenix -g initial_chr.bgen -index -clobber
# Remove the individual chromosome files
for i in {1..22}
do
rm chr_${i}.bgen
done
Notes on Bash commands:
{1..22}
produces a sequence of integers from 1 to 22
(bash version 3.0+ required)${i}
accesses the current value of the bash variable
i
cmd=$cmd"chr_${i}.bgen"
concatenates strings, in this
case the names of the files so that at the end of the loop,
cmd
contains the names of all the files createdrm
remove (delete) fileNotes on bgenix options:
-clobber
instructs bgenix and cat-bgen to overwrite
this file if it already exists-index
creates a new index file from the specified
.bgen fileIn the UK Biobank imputed data, multi-allelic SNPs are represented as multiple rows in the .bgen file, all with the same rsID, chromosome number and position. This means the rsID (or chromosome and position) is not sufficient to uniquely identify the variant.
To visually inspect the data, use the -list
command in
bgenix, which prints a list of the variants in the file to stdout. A SNP
with 3 alleles [A, C, G], for example, would be given by the following
two rows:
alternate_ids | rsID | chromosome | position | number_of_alleles | first_allele | alternative_alleles |
---|---|---|---|---|---|---|
1:5678_A_G | rs1234 | 1 | 5678 | 2 | G | A |
1:5678_C_G | rs1234 | 1 | 5678 | 2 | G | C |
The alternate_id
column here is a unique identifier from
UK Biobank that incorporates allele information where needed to provide
a unique identifier for each row.
Note: Although the UKB “alternate_id” is visible in the
data when we use the -list
command, we cannot filter the
data by this ID.
Filtering our variants by rsID or chr:pos will extract (or exclude) all alleles of the corresponding multi-allelic SNP. In order to uniquely identify the alleles of interest we need to include allelic information in our filtering. Since the .bgen.bgi index file that corresponds to each .bgen file is actually a sqlite3 database file, we can open this file using sqlite3, and query it directly using the SQL database language.
The index is stored in a table called Variants (to view the full schema, see the bgenix index documentation). By loading our table of SNPs into the database, and joining this to the Variants table on both the identifier (rsID, chr:pos or both) and the allele pairs, we can create a new table filtered to just the alleles of interest.
When doing this we need to take into account the alignment of the SNPs. We know that the UK Biobank data is aligned to GrCh37, with the first allele in the bgen files (‘allele1’ column) the reference allele on the forwards strand. Our betas file, on the other hand, has the alleles labelled as effect and noneffect - and this labelling might be determined by which allele is more common, or has a particular direction of assocation with the trait of interest.
For this reason, when we do our join in SQL, we can’t just assume that all our effect alleles will be in the same column of the index file. To account for this, we use two join statements, one that matches up variants where ‘allele1’ is our noneffect allele (and ‘allele2’ is our effect), and one with the opposite alignment. Notice that we flip the sign of the beta on the second join, as discussed in Section 2.4.1 of our companion paper.
We then take the union of these two joins in order to get a single table with all our variants, where we have aligned the allele labelling such that our effect allele is the one in the ‘allele2’ column.
# Import the betas into the sqlite database as a table called Betas
sqlite3 initial_chr.bgen.bgi "DROP TABLE IF EXISTS Betas;"
sqlite3 -separator "," initial_chr.bgen.bgi ".import Betas.csv Betas"
sqlite3 initial_chr.bgen.bgi "DROP TABLE IF EXISTS Joined;"
# And inner join it to the index table (Variants), making a new table (Joined)
# By joining on alleles as well as chromosome and position
# we can ensure only the relevant alleles from any multi-allelic SNPs are retained
sqlite3 -header -csv initial_chr.bgen.bgi \
"CREATE TABLE Joined AS
SELECT Variant.*, Betas.chr_name, Betas.Beta FROM Variant INNER JOIN Betas
ON Variant.chromosome = printf('%02d', Betas.chr_name)
AND Variant.position = Betas.chr_position
AND Variant.allele1 = Betas.noneffect_allele
AND Variant.allele2 = Betas.effect_allele
UNION
SELECT Variant.*, Betas.chr_name, -Betas.Beta FROM Variant INNER JOIN Betas
ON Variant.chromosome = printf('%02d', Betas.chr_name)
AND Variant.position = Betas.chr_position
AND Variant.allele1 = Betas.effect_allele AND
Variant.allele2 = Betas.noneffect_allele;"
# Filter the .bgen file to include only the alleles specified in the Betas for each SNP
bgenix -g initial_chr.bgen -table Joined > single_allelic.bgen
# And produce an index file for the new .bgen
bgenix -g single_allelic.bgen -index
Notes on the SQL commands:
DROP TABLE IF EXISTS Betas;
Will drop (delete) the
table if it already exists, otherwise we will get an error when we
attempt to create it.import Betas.csv Betas
imports the data from the file
Betas.csv into a table called ‘Betas’CREATE TABLE Joined AS ...
creates a table called
‘Joined’ containing the data returned from the following querySELECT Variant.*
selects all columns from the Variant
table (*
is a wildcard)Betas.chr_name
selects the ‘chr_name’ column from the
Betas tableprintf('%02d', Betas.chr_name)
formats the chromosome
number from the Betas table to be two digits, with a leading zero if
necessary, to match the format in the Variant tableAnd the bgenix commands:
-table Joined
tells bgenix to output only the variants
in the ‘Joined’ tableWhile bgenix offers fast and simple variable extraction, it cannot be used for computation of summary statistics. We will describe how to use PLINK 2 for this, as after an initial data conversion step it is very fast and can compute statistics and apply filters in a single command. An alternative is QCTOOL, which does not require file conversion, but is much slower - for further details on how to compute summary statistics in QCTOOL, see Appendix 3
When PLINK 2 reads in a .bgen file it will automatically convert it to PLINK 2 binary format. For large data files, this can be time-consuming, so it is often faster to save the results of this conversion and run future PLINK 2 commands on the converted files, rather than performing file conversion again for each operation.
The PLINK 2 binary format is a set of three files:
A draft specification for these files is available in the PLINK 2 GitHub repo. Example .pvar and .psam files are displayed below.
CHROM | POS | ID | REF | ALT |
---|---|---|---|---|
1 | 123456 | 1:123456 | G | T |
2 | 234567 | 2:234567 | T | G |
3 | 345678 | 3:345678 | C | T |
FID | IID | SEX |
---|---|---|
1 | 1 | 1 |
2 | 2 | 2 |
3 | 3 | 2 |
Where:
For more information, see the PLINK 2 documentation.
Note:
PLINK and Bgen: PLINK 2 collapses the genotype probability triples stored in .bgen files down to allelic dosages (stored with 4 decimal place precision), which means this conversion is “lossy” or irreversible. For the calculation of polygenic risk scores, we are only interested in the dosages, and therefore this does not affect us. An example was given in PLINK 2 google forum by Christopher Chang.
Hard-calls in PLINK 2: In addition to allelic dosages, the PLINK 2 binary file format also records the hard-calls (calculated according to a given hard-call threshold) because some of the PLINK 2’s commands can only be used on hard-calls.
Here we load the data from .bgen and .sample files, and convert it to PLINK 2 format. In addition, since we can supply multiple options to PLINK 2 in one command, we relabel the variant IDs to ensure they have unique identifiers, and print out an allele frequency report that we will use to identify ambiguous SNPs.
Sample file: Note that although we have filtered to a subset of variants we still use our original .sample file, as our data still contains all the participants and they are arranged in the same order in the .bgen file.
plink2 --bgen single_allelic.bgen ref-first \
\
--hard-call-threshold 0.1 \
--sample ukbA_imp_chrN_v3_sP.sample \
--memory 15000 \$r_\$a \
--set-all-var-ids @:#_\
--freq \
--make-pgen --out raw
Notes on the command:
--bgen
and --sample
: are how we specify
the .bgen and .sample files we want to read in--hard-call-threshold
: this is a dosage import setting,
that lets us specify the threshold value at which we want to define
hard-calls. Here we have used the default threshold of 0.1--memory
: Manually specify memory usage, in this case
to 15000MB--set-all-var-ids
: sets SNPs identifiers according to
the specified template string, in this case chr:pos_ref_alt. See PLINK
2 docs for more info--freq
: computes allele frequencies and outputs them to
.afreq file--make-pgen
: generates files in PLINK 2 binary
format--out
: specifies the prefix for all output files - here
we have used ‘raw’Computational consideration: When using a single node on a computing cluster, PLINK 2 can see all RAM on the cluster not only what’s available in the nodes selected to run the job. In such cases, it can be necessary to manually specify the amount of memory PLINK 2 will use, or it may attempt to access more than permitted.
This command outputs the genetic data in PLINK 2 binary format (raw.pgen, raw.pvar, raw.psam), and an allele frequency report printed to raw.afreq.
CHROM | ID | REF | ALT | ALT_FREQS | OBS_CT |
---|---|---|---|---|---|
1 | 1:1687482 | G | T | 0.486 | 974818 |
1 | 1:2187085 | T | G | 0.382 | 974818 |
1 | 1:3328659 | C | T | 0.139 | 974818 |
Where:
Having printed the list of allele frequencies, we can use
awk
to quickly identify palindromic SNPs (ie SNPs where
both alleles are (A, T) or (C, G)) which have effect allele frequencies
close enough to 0.5 that we cannot unambiguously distinguish between the
effect and non-effect alleles.
In this case, we will consider “ambiguous” SNPs to be those with EAF between 0.4 and 0.6.
awk '/^[^#]/ { if( $5>0.4 && $5<0.6 && ( ($3=="A" && $4=="T") || ($4=="T" && $3=="A") || ($3=="C" && $4=="G") || ($4=="G" && $3=="C") ) ) { print $0 }}' \
> exclrsIDs_ambiguous.txt raw.afreq
Notes on command:
/^[^#]/
instructs awk
to ignore lines that
start with #, as these are header lines or commentsprint $0
means print all fields in each lineLater we will use the list we have printed here to exclude the ambiguous SNPs we have identified.
It is good practice to perform some quality checks on the variants in our data before using them in analyses.
In the UK Biobank imputed data, there is no missing data since any missing calls have been imputed. The imputation quality of each variant is quantified by the imputation information, and variants with a low imputation information score (for example, < 0.4) may be considered unreliable and removed from analyses. In addition, SNPs that are very rare and have low minor allele frequency (MAF) will not give adequate power to make meaningful statistical statements, and may be removed from analyses.
In order to filter on imputation information in PLINK 2, it is easiest to use a single file containing the imputation information score for each variant in the UK Biobank imputed data.
Here we achieve this by concatenating the UKB Imputation MAF+Info
files (documented in UKB
Resource 531) using awk
, adding in the chromosome
number as the first column and the unique chr:pos_ref_alt identifier as
the last column.
Note that this command will only need to be run once, and all future work can use the resulting file.
for i in {1..22}
do
awk -v chr=$i 'BEGIN {FS="\t"; OFS="\t"} { print chr,$0,chr":"$3"_"$4"_"$5 }' ukb_mfi_chr${i}_v3.txt
done > ukb_mfi_all_v3.tsv
Having produced a single tab-separated file containing the imputation
information score for each variant, we can now use the
--extract-col-cond
flag in PLINK 2 to exclude any variants
with imputation information below our chosen minimum value.
We already saw that PLINK 2 can generate an allele frequency report -
but instead of having to filter this with another software tool (eg
awk
or R) to produce a list of SNPs to exclude, we can
apply a minor allele frequency threshold in a single PLINK 2 command:
--maf [minimum frequency]
Finally, we can apply all these exclusions in one command:
plink2 --pfile raw \
\
--memory 15000 \
--exclude exclrsIDs_ambiguous.txt --extract-col-cond-min 0.4 \
--extract-col-cond ukb_mfi_all_v3.tsv 9 10 \
--maf 0.005 \
--write-snplist \
--make-pgen --out snpQC
Notes on command:
--exclude
takes a list of variant IDs and excludes
them. In this case, we have supplied the list of ambiguous SNPs we identified earlier.--extract-col-cond
takes the name of the tab-separated
file containing the scores, the column number to condition on and the
column number of the variant IDs. We specify the minimum value using
--extract-col-cond-min
(see Plink
documentation). In this case, we are excluding SNPs with imputation information less than 0.4, using the
ukb_mfi_all_v3.tsv file we created previously.--maf
will exclude variants with minor allele frequency
less than the specified value, in this case 0.005The outputs of the command above are:
--write-snplist
)Typically, participants with poor quality genotyping data are also excluded from genetic analyses.
The UK Biobank provides a field (Field 22020) which indicates whether an individual met their internal quality control checks for inclusion in the calculation of principal components, and we have saved the IDs of such individuals in a text file called “usedinpca.txt”.
This field restricts to individuals who:
Have missing rate in autosomes \(\le\) 0.02
Are not outliers for missingness or heterozygosity
Are in a maximal set of unrelated individuals
Are not sex-discordant
as detailed in Bycroft2018 Supplementary Materials
It is common for analysis of UK Biobank data to use this pre-curated selection of participants, so for straightforward sample QC we just keep individuals recorded in ‘usedinpca.txt’.
plink2 --pfile raw \
\
--memory 15000 \
--extract snpQC.snplist \
--keep-fam usedinpca.txt \
--write-samples --out sampleQC
Note:
We use --keep-fam
because our list of IDs has only
one column, so PLINK 2 thinks they are family IDs. If we provided a file
with two columns, then PLINK 2 would read the first column as the family
ID and the second as the individual ID - in this case, to filter based
on individual IDs we would use --keep
. In the UK Biobank,
there is no family information, so the family and individual IDs are
identical.
We run SNP QC and sample QC in separate commands due to the
default order of
operation in PLINK 2. If we put SNP and sample QC in the same
command, PLINK 2 would run our sample QC (--keep-fam
)
before SNP QC (e.g. --maf
, --geno
,…), which is
not what we want.
The output is sampleQC.id (from
--write-samples
) which contains the resulting individual
IDs and family IDs (i.e. same format as the first 2 columns of
‘raw.psam’).
The formula for computing PRS for individual \(j\), \(PRS_{j}\) is specified below:
\[PRS_{j}=\displaystyle \frac{\sum_{i=1}^{N} {\beta_{i}*dosage_{ij}}}{P*M_{j}}\]
where:
Note:
To compute PRS in PLINK 2, we need to rearrange the SNPs and betas
from the Betas.csv file into the format the PLINK 2 --score
command expects.
We need to have whitespace-separated columns in the following order (with no header):
We will use chr:pos_ref_alt identifiers as variant IDs, as this is what we used in the .pgen file. In order to be sure that the IDs match those in our data, with the effect and noneffect alleles labelled the same way round, we will extract the SNPs and betas from the ‘Joined’ table we created in the .bgen.bgi index file earlier.
Recall that we decided to align this table with noneffect alleles in the ‘allele1’ column, and effect alleles in ‘allele2’.
sqlite3 -separator " " -list initial_chr.bgen.bgi \
"SELECT chr_name || ':' || position || '_' ||allele1 || '_' || allele2, allele2, Beta FROM Joined;" > score.txt
Note on command:
||
is the concatenation operator in SQL - we are using
it to construct the unique variant ID by concatenating
chr:pos_eff_neffOnce we have prepared the score.txt file we can run the
--score
command in PLINK 2.
plink2 --pfile raw \
\
--memory 15000 \
--extract snpQC.snplist \
--keep sampleQC.id \
--score score.txt no-mean-imputation --out PRS
Notes on the command:
--extract
)
and samples (--keep
) that passed our QC.--score
) with betas stored
in ‘score.txt’ and specify no-mean-imputation
option to
scale each individual’s score according to the number of non-missing
SNPs without imputing missing SNPs.The output is PRS.sscore which takes the following form:
FID | IID | NMISS_ALLELE_CT | NAMED_ALLELE_DOSAGE_SUM | SCORE1_AVG |
---|---|---|---|---|
1 | 1 | 446 | 221.369 | -0.001 |
2 | 2 | 446 | 294.455 | -0.001 |
3 | 3 | 446 | 214.173 | 0.001 |
Where for each individual:
Note: PLINK 2 may throw the following warning: “warning: X –score file entries were skipped due to missing variant IDs”.
This might be fine, since the betas file (and therefore the score.txt file) may contain some SNPs that were removed in QC, or unavailable in the UK Biobank data. If these numbers don’t align, then something might have gone wrong!
Linux OS, Intel CPU on computing cluster.
Throughout this tutorial we have assumed that all the required tools
(eg bgenix, sqlite3, plink) are on your PATH
to avoid any
considerations of which directory the executables are located in.
To run these you will need to do one of the following:
$PATH
variable and
then just use the tool name
export PATH=$PATH:/place/with/sqlite/executable
then
sqlite3 -separator "," initial_chr.bgen.bgi ".import Betas.csv Betas"
/path/to/sqlite3 -separator "," initial_chr.bgen.bgi ".import Betas.csv Betas"
./
./sqlite3 -separator "," initial_chr.bgen.bgi ".import Betas.csv Betas"
In these demo code chunks we have assumed all files are within the same working directory, to simplify the code chunks and avoid imposing our ideas of how to organise the files.
If you wish to access files in other locations, you can prespecify the filepaths once at the beginning of the script as a bash variable
UKBDATAPATH="path/to/data/"
Note that there is no whitespace around the =
sign.
You can then use this bash variable as a shorthand to avoid repeatedly typing out long paths, for example:
bgenix -g ${UKBDATAPATH}ukb_imp_chr1_v3.bgen -incl-rsids rsidlist.txt -incl-range chrposlist.txt > chr_1.bgen
An additional benefit is that if you ever wish to change the path where the data is stored, you only need to update it once where the bash variable is defined.
QCTOOL can also be used to compute the per-variant and per-sample QC metrics and filter data. While QCTOOL can handle imputed data better in some ways (eg can be used to compute the imputation information score) is is very slow for large data sets.
# Produce the SNP summary statistics
qctool -g initial_chr.bgen -filetype bgen -snp-stats -osnp snps-info-initial.txt
# Produce the sample summary statistics
qctool -g initial_chr.bgen -filetype bgen -s ukbA_imp_chrN_v3_sP.sample -sample-stats -osample sample-info-initial.txt
Although QCTOOL can compute these statistics it cannot directly
filter on them, so any desired filters (eg impute info < 0.4) must be
applied using another software tool, such as awk
or R.
# Produce the SNP summary statistics
qctool -g initial_chr.bgen -filetype bgen -s ukbA_imp_chrN_v3_sP.sample -os outputs/mace/sampleQC.sample -incl-samples usedinpca.txt -incl-rsids outputs/mace/exclHWErsIDs.txt
Linkage disequilibrium is a measure of the correlation between neighbouring genetic variants, .
Typically, during the construction of a PRS, variants in high LD will have been identified and removed by methods such as clumping and pruning.
In order to verify this, or apply more stringent thresholds, the
PLINK 2 command --indep-pairwise
can be used to prune the
data, resulting in a subset of variants with LD below a specified
threshold.
plink2 --pfile raw \
\
--memory 15000 \
--indep-pairwise 200 1 0.3 --out ld
The command is described in more detail in the PLINK 2 documentation, but in brief:
This command will output a pair of variant ID lists, one containing variants to be kept (below specified LD threshold) and one containing variants to be removed.
For more detailed investigation of LD, including pairwise comparison reports or identification of “tag” variants, we recommend PLINK 1.9.
The UK Biobank population is predominantly White (around 95% of participants self-reported white ethnicity). Since Hardy-Weinberg Equilbrium is sensitive to population structure and we may expect to see differences in SNP frequency by ethnicity, we wish to restrict to White British participants prior to testing for HWE.
In internal QC, Bycroft2018 used principal components analysis to determine the genetic ethnicity of participants, and identified those who self-reported as “White British” and had very similar ancestral backgrounds.
UK Biobank provided a field (Field 22006) indicating individuals who were categorised as genetically “White British”, and we have saved the IDs of these individuals in a text file called “whitebrit.txt”.
plink2 --pfile raw \
\
--memory 15000 \
--extract snpQC.snplist \
--keep sampleQC.id \
--keep-fam whitebrit.txt \
--hwe 0.000001 midp \
--write-snplist --out HWE
Note on the command:
--keep sampleQC.id
)--keep-fam whitebrit.txt
)--hwe 0.000001 midp
filter out SNPs that have HWE exact
test p-value less than the chosen threshold (in this case
10-6) with midp
modifier which is recommended in
the Plink
docs.The output is HWE.snplist which contains a list of SNP IDs that have passed HWE criteria.