agricola CLI
agricola provides a command-line interface for running local-ancestry-aware GWAS.
The whole-genome regression and single variant tests can be performed in
two steps (step1, step2) or in a single pipeline (all-steps).
Basic Usage
To get help, use the --help flag:
agricola --help
agricola step2 --help
Examples
A typical agricola run may look like:
agricola all-steps \
--plink-list tests/data/plinks.txt \
--lanc-list tests/data/lancs.txt \
--pheno-file tests/data/pheno.tsv \
--covar-file tests/data/covar.tsv \
--output-list tests/data/outs.txt \
--variant-file1 tests/data/variants.txt \
--trait-type qt
Alternatively, steps 0/1 and 2 can be performed separately:
agricola step1 \
--plink-list tests/data/plinks.txt \
--lanc-list tests/data/lancs.txt \
--pheno-file tests/data/pheno.tsv \
--covar-file tests/data/covar.tsv \
--output step1_preds \
--variant-file tests/data/variants.txt \
--trait-type qt
agricola step2 \
--plink-list tests/data/plinks.txt \
--lanc-list tests/data/lancs.txt \
--pheno-file tests/data/pheno.tsv \
--covar-file tests/data/covar.tsv \
--step1-prefix step1_preds \
--output-list tests/data/outs.txt \
--trait-type qt
File Formats
Inputs
Genotype
agricola accepts genotype files in plink2 .pgen format (with corresponding .pvar and .psam). Please see the plink2 documentation for futher details. agricola accepts either 1) a single pgen file with multiple chromosomes, or 2) a set of plink2 files, each corresponding to a separate chromosome.
Info
Extension to other formats such as bgen or vcf is possible but is not currently a priority. agricola requires phasing information, so unphased formats such as .bed are not possible.
Warning
It is assumed that all pgen/pvar files are sorted by chromosome and position.
Local Ancestry
agricola accepts only .lanc files, as defined by admix-kit, for local ancestry. This format was chosen for its flexibility, low memory overhead, and simplicity. Please see the admix-kit documentation for further details.
To make working with this format easier, we introduce the lanctools Python package and CLI tool. lanctools can convert RFMix msp.tsv files or FLARE vcf.gz files into .lanc format. We provide an example below. Please see the lanctools documentation for further details.
# convert FLARE to .lanc format
lanctools convert-rfmix --file chr1.msp.tsv --plink-prefix chr1 --output chr1.lanc
Although most local ancestry inference algorithms produce chromosome-specific files,
you may prefer to work with a single file containing multiple chromosomes.
To do so, first merge pgen files using plink2, then use the lanctools merge
command to combine the .lanc files.
# merge multiple .lanc files
lanctools merge --input chr1.lanc --input chr2.lanc --input chr3.lanc --output chr1_3.lanc
Phenotypes and Covariates
Phenotype and covariate files are expected to be whitespace-delimited text files.
This means that column names may not contain whitespace.
If a header is not provided, it is assumed that the first column is for family IDs
(FID) and the second column is for individual IDs (IID). If a header line
is provided, it must begin with a # character and must include "IID" as a
column name. Two valid examples are given below
#IID height crp_irnt
sample1 165 -1.23
sample2 175 -2.04
sample3 161 0.81
sample1 sample1 165 -1.23
sample2 sample2 175 -2.04
sample3 sample3 161 0.81
Outputs
Step 1 Intermediate
Whole-genome leave-one-chromosome-out (LOCO) predictions from step 1 are saved
to the file {prefix}.pkl. This file consists of a serialized dictionary,
where keys are chromosomes and values are \((N, P)\) pandas DataFrames with
predictions for each sample and phenotype.
Step 2 Results
Summary statistics from step 2 of agricola are saved in Apache Parquet format, one output file per input plink file and phenotype. These files have the following schema:
| Field | Type | Description |
|---|---|---|
| CHR | string | Chromosome |
| BP | int | Genomic position |
| REF | string | Reference allele |
| ALT | string | Alternate allele |
| ID | string | Variant ID |
| LOG10P_HET | double | P-value for test \(\beta_{\text{anc}_0}=\cdots=\beta_{\text{anc}_k}=0\) |
| BETA_{anc} | double | Effect estimate for \(\beta_{\text{anc}}\) |
| LOG10P_HOM | double | P-value for \(\beta=0\) under homogeneous model (all ancestry effects equal) |
| N | int | Sample size |
| AF_{anc} | double | Ancestry-specific allele frequency |
| LA_PROP_{anc} | double | Proportion of haplotypes from ancestry anc |
| LOG10P_LRT | double | P-value for likelihood ratio test of heterogeneous vs. homogeneous model (only output for Wald test) |
Options
Global Options
These options are common to the step1, step2, and all-steps commands.
| Option | Argument | Type | Description |
|---|---|---|---|
--plink |
TEXT | optional | Plink2 file prefix. This option can be repeated to specify multiple files |
--plink-list |
TEXT | optional | File containing plink2 prefixes, one per line |
--lanc |
TEXT | optional | Local ancestry .lanc file. This option can be repeated to specify multiple files |
--lanc-list |
TEXT | optional | File containing .lanc file paths, one per line |
--ancestries |
TEXT | optional | Ancestry names, comma-separated and ordered as in .lanc files |
--pheno-file |
TEXT | required | Phenotype file |
--pheno |
TEXT | optional | Phenotype to include in the analysis. This option can be repeated to specify multiple files or omitted to use all phenotypes |
--pheno-list |
TEXT | optional | File containing phenotypes to include in the analysis, one per line. This option can be omitted to use all phenotypes |
--covar-file |
TEXT | optional | Covariates file |
--covar |
TEXT | optional | Covariate to include in the analysis. This option can be repeated to specify multiple files or omitted to use all covariates |
--covar-list |
TEXT | optional | File containing covariates to include in the analysis, one per line. This option can be omitted to use all phenotypes |
--catcovar |
TEXT | optional | Categorical covariate to include in the analysis. This option can be repeated to specify multiple files or omitted to use all covariates |
--catcovar-list |
TEXT | optional | File containing categorical covariates to include in the analysis, one per line. This option can be omitted to use all phenotypes |
--samples-file |
TEXT | optional | Samples file |
--trait-type |
TEXT | optional | Trait type: quantitative (qt) or binary (bt) [default: qt] |
Info
--plink and --lanc can be repeated to specify multiple files.
E.g., --plink tests/data/chr20 --plink tests/data/chr21 --plink tests/data/chr22
Warning
Plink2 and .lanc files must match, meaning you must provide the same number of plink2/.lanc files in the same order.
Warning
Either --plink or --plink-list must be provided, but not both.
The same applies to --lanc and --lanc-list. For --pheno and
--pheno-list and --covar and --covar-list, either one may be provided
or neither (to use all phenotypes/covariates).
Warning
Categorical covariates specified through catcovar or catcovar-list must
be a subset of the full list provided through covar or covar-list
Step 1 Options
These are the non-global options for step1:
| Option | Argument | Type | Description |
|---|---|---|---|
--output |
TEXT | required | Step 1 predictions will be serialized and written to prefix.pkl |
--level0-dir |
TEXT | optional | Directory where level 0 predictions are saved (use temp dir if not provided) |
--variant-file |
TEXT | optional | File with variants to include, one per line |
--h2-prior |
TEXT | optional | SNP heritability priors, comma-separated [default: 0.01,0.255,0.5,0.745,0.99] |
--block-size |
INTEGER | optional | Number of variants per block [default: 2000] |
--seed |
INTEGER | optional | Random seed [default: 100] |
--loocv |
optional | Use leave-one-out cross-validation (only for rare binary traits) [default: no-loocv] |
Step 2 Options
These are the non-global options for step2:
| Option | Argument | Type | Description |
|---|---|---|---|
--output |
TEXT | optional | Output prefix, one per plink_prefix |
--output-list |
TEXT | optional | File containg output file prefixes, one per line and plink2 prefix |
--level0-dir |
TEXT | optional | Directory where level 0 predictions are saved (use temp dir if not provided) |
--variant-file |
TEXT | optional | File with variants to include, one per line |
--chrom |
TEXT | optional | Specify a single chromosome for step 2 |
--test-type |
TEXT | optional | Either "score" or "wald [default: score] |
--adjust-lanc |
optional | Either --adjust-lanc or --no-adjust-lanc [default: --adjust-lanc] |
|
--impute |
optional | Either --impute or --no-impute. This must be --no-impute for binary traits. [default: --no-impute] |
|
--block-size |
INTEGER | optional | Number of variants per block [default: 1000] |
--min-ac |
INTEGER | optional | Minimum allele count [default: 1] |
Info
--out-prefix can be repeated to specify multiple files, like --plink-prefix and --lanc-file.
Warning
Either --out-prefix or --output-list must be provided, but not both.
Info
--no-impute must be used for binary traits. If any quantitative traits have
missing values, computational performance can be (often greatly) improved
by using --impute, which mean-imputes all missing phenotype values.
All Steps Options
These are the non-global options for all-steps
| Option | Argument | Type | Description |
|---|---|---|---|
--output |
TEXT | optional | Output prefix, one per plink_prefix |
--output-list |
TEXT | optional | File containg output file prefixes, one per line and plink2 prefix |
--variant-file1 |
TEXT | optional | File with variants to include for step 0/1, one per line |
--variant-file2 |
TEXT | optional | File with variants to include for step 2, one per line |
--test-type |
TEXT | optional | Either "score" or "wald [default: score] |
--adjust-lanc |
optional | Either --adjust-lanc or --no-adjust-lanc [default: --adjust-lanc] |
|
--impute |
optional | Either --impute or --no-impute. This must be --no-impute for binary traits. [default: --no-impute] |
|
--block-size0 |
INTEGER | optional | Number of variants per block in step 0 [default: 2000] |
--block-size2 |
INTEGER | optional | Number of variants per block in step 2 [default: 1000] |
--min-ac |
INTEGER | optional | Minimum allele count [default: 1] |
--seed |
INTEGER | optional | Random seed [default: 100] |
--loocv |
optional | Use leave-one-out cross-validation (only for rare binary traits) [default: no-loocv] |
Info
--output can be repeated to specify multiple files, like --plink and --lanc.
Warning
Either --output or --out-list must be provided, but not both.
Info
--no-impute must be used for binary traits. If any quantitative traits have
missing values, computational performance can be (often greatly) improved
by using --impute, which mean-imputes all missing phenotype values.