Skip to content

agricola CLI

agricola provides a command-line interface for running local-ancestry-aware GWAS. The whole-genome regression and single variant tests can be performed in two steps (step1, step2) or in a single pipeline (all-steps).


Basic Usage

To get help, use the --help flag:

agricola --help
agricola step2 --help

Examples

A typical agricola run may look like:

agricola all-steps \
  --plink-list tests/data/plinks.txt \
  --lanc-list tests/data/lancs.txt \
  --pheno-file tests/data/pheno.tsv \
  --covar-file tests/data/covar.tsv \
  --output-list tests/data/outs.txt \
  --variant-file1 tests/data/variants.txt \
  --trait-type qt

Alternatively, steps 0/1 and 2 can be performed separately:

agricola step1 \
  --plink-list tests/data/plinks.txt \
  --lanc-list tests/data/lancs.txt  \
  --pheno-file tests/data/pheno.tsv \
  --covar-file tests/data/covar.tsv \
  --output step1_preds \
  --variant-file tests/data/variants.txt \
  --trait-type qt

agricola step2 \
  --plink-list tests/data/plinks.txt \
  --lanc-list tests/data/lancs.txt  \
  --pheno-file tests/data/pheno.tsv \
  --covar-file tests/data/covar.tsv \
  --step1-prefix step1_preds \
  --output-list tests/data/outs.txt \
  --trait-type qt

File Formats

Inputs

Genotype

agricola accepts genotype files in plink2 .pgen format (with corresponding .pvar and .psam). Please see the plink2 documentation for futher details. agricola accepts either 1) a single pgen file with multiple chromosomes, or 2) a set of plink2 files, each corresponding to a separate chromosome.

Info

Extension to other formats such as bgen or vcf is possible but is not currently a priority. agricola requires phasing information, so unphased formats such as .bed are not possible.

Warning

It is assumed that all pgen/pvar files are sorted by chromosome and position.

Local Ancestry

agricola accepts only .lanc files, as defined by admix-kit, for local ancestry. This format was chosen for its flexibility, low memory overhead, and simplicity. Please see the admix-kit documentation for further details.

To make working with this format easier, we introduce the lanctools Python package and CLI tool. lanctools can convert RFMix msp.tsv files or FLARE vcf.gz files into .lanc format. We provide an example below. Please see the lanctools documentation for further details.

# convert FLARE to .lanc format
lanctools convert-rfmix --file chr1.msp.tsv --plink-prefix chr1 --output chr1.lanc

Although most local ancestry inference algorithms produce chromosome-specific files, you may prefer to work with a single file containing multiple chromosomes. To do so, first merge pgen files using plink2, then use the lanctools merge command to combine the .lanc files.

# merge multiple .lanc files
lanctools merge --input chr1.lanc --input chr2.lanc --input chr3.lanc --output chr1_3.lanc

Phenotypes and Covariates

Phenotype and covariate files are expected to be whitespace-delimited text files. This means that column names may not contain whitespace. If a header is not provided, it is assumed that the first column is for family IDs (FID) and the second column is for individual IDs (IID). If a header line is provided, it must begin with a # character and must include "IID" as a column name. Two valid examples are given below

#IID    height  crp_irnt
sample1 165 -1.23
sample2 175 -2.04
sample3 161 0.81
sample1 sample1 165 -1.23
sample2 sample2 175 -2.04
sample3 sample3 161 0.81

Outputs

Step 1 Intermediate

Whole-genome leave-one-chromosome-out (LOCO) predictions from step 1 are saved to the file {prefix}.pkl. This file consists of a serialized dictionary, where keys are chromosomes and values are \((N, P)\) pandas DataFrames with predictions for each sample and phenotype.

Step 2 Results

Summary statistics from step 2 of agricola are saved in Apache Parquet format, one output file per input plink file and phenotype. These files have the following schema:

Field Type Description
CHR string Chromosome
BP int Genomic position
REF string Reference allele
ALT string Alternate allele
ID string Variant ID
LOG10P_HET double P-value for test \(\beta_{\text{anc}_0}=\cdots=\beta_{\text{anc}_k}=0\)
BETA_{anc} double Effect estimate for \(\beta_{\text{anc}}\)
LOG10P_HOM double P-value for \(\beta=0\) under homogeneous model (all ancestry effects equal)
N int Sample size
AF_{anc} double Ancestry-specific allele frequency
LA_PROP_{anc} double Proportion of haplotypes from ancestry anc
LOG10P_LRT double P-value for likelihood ratio test of heterogeneous vs. homogeneous model (only output for Wald test)

Options

Global Options

These options are common to the step1, step2, and all-steps commands.

Option Argument Type Description
--plink TEXT optional Plink2 file prefix. This option can be repeated to specify multiple files
--plink-list TEXT optional File containing plink2 prefixes, one per line
--lanc TEXT optional Local ancestry .lanc file. This option can be repeated to specify multiple files
--lanc-list TEXT optional File containing .lanc file paths, one per line
--ancestries TEXT optional Ancestry names, comma-separated and ordered as in .lanc files
--pheno-file TEXT required Phenotype file
--pheno TEXT optional Phenotype to include in the analysis. This option can be repeated to specify multiple files or omitted to use all phenotypes
--pheno-list TEXT optional File containing phenotypes to include in the analysis, one per line. This option can be omitted to use all phenotypes
--covar-file TEXT optional Covariates file
--covar TEXT optional Covariate to include in the analysis. This option can be repeated to specify multiple files or omitted to use all covariates
--covar-list TEXT optional File containing covariates to include in the analysis, one per line. This option can be omitted to use all phenotypes
--catcovar TEXT optional Categorical covariate to include in the analysis. This option can be repeated to specify multiple files or omitted to use all covariates
--catcovar-list TEXT optional File containing categorical covariates to include in the analysis, one per line. This option can be omitted to use all phenotypes
--samples-file TEXT optional Samples file
--trait-type TEXT optional Trait type: quantitative (qt) or binary (bt) [default: qt]

Info

--plink and --lanc can be repeated to specify multiple files. E.g., --plink tests/data/chr20 --plink tests/data/chr21 --plink tests/data/chr22

Warning

Plink2 and .lanc files must match, meaning you must provide the same number of plink2/.lanc files in the same order.

Warning

Either --plink or --plink-list must be provided, but not both. The same applies to --lanc and --lanc-list. For --pheno and --pheno-list and --covar and --covar-list, either one may be provided or neither (to use all phenotypes/covariates).

Warning

Categorical covariates specified through catcovar or catcovar-list must be a subset of the full list provided through covar or covar-list

Step 1 Options

These are the non-global options for step1:

Option Argument Type Description
--output TEXT required Step 1 predictions will be serialized and written to prefix.pkl
--level0-dir TEXT optional Directory where level 0 predictions are saved (use temp dir if not provided)
--variant-file TEXT optional File with variants to include, one per line
--h2-prior TEXT optional SNP heritability priors, comma-separated [default: 0.01,0.255,0.5,0.745,0.99]
--block-size INTEGER optional Number of variants per block [default: 2000]
--seed INTEGER optional Random seed [default: 100]
--loocv optional Use leave-one-out cross-validation (only for rare binary traits) [default: no-loocv]

Step 2 Options

These are the non-global options for step2:

Option Argument Type Description
--output TEXT optional Output prefix, one per plink_prefix
--output-list TEXT optional File containg output file prefixes, one per line and plink2 prefix
--level0-dir TEXT optional Directory where level 0 predictions are saved (use temp dir if not provided)
--variant-file TEXT optional File with variants to include, one per line
--chrom TEXT optional Specify a single chromosome for step 2
--test-type TEXT optional Either "score" or "wald [default: score]
--adjust-lanc optional Either --adjust-lanc or --no-adjust-lanc [default: --adjust-lanc]
--impute optional Either --impute or --no-impute. This must be --no-impute for binary traits. [default: --no-impute]
--block-size INTEGER optional Number of variants per block [default: 1000]
--min-ac INTEGER optional Minimum allele count [default: 1]

Info

--out-prefix can be repeated to specify multiple files, like --plink-prefix and --lanc-file.

Warning

Either --out-prefix or --output-list must be provided, but not both.

Info

--no-impute must be used for binary traits. If any quantitative traits have missing values, computational performance can be (often greatly) improved by using --impute, which mean-imputes all missing phenotype values.

All Steps Options

These are the non-global options for all-steps

Option Argument Type Description
--output TEXT optional Output prefix, one per plink_prefix
--output-list TEXT optional File containg output file prefixes, one per line and plink2 prefix
--variant-file1 TEXT optional File with variants to include for step 0/1, one per line
--variant-file2 TEXT optional File with variants to include for step 2, one per line
--test-type TEXT optional Either "score" or "wald [default: score]
--adjust-lanc optional Either --adjust-lanc or --no-adjust-lanc [default: --adjust-lanc]
--impute optional Either --impute or --no-impute. This must be --no-impute for binary traits. [default: --no-impute]
--block-size0 INTEGER optional Number of variants per block in step 0 [default: 2000]
--block-size2 INTEGER optional Number of variants per block in step 2 [default: 1000]
--min-ac INTEGER optional Minimum allele count [default: 1]
--seed INTEGER optional Random seed [default: 100]
--loocv optional Use leave-one-out cross-validation (only for rare binary traits) [default: no-loocv]

Info

--output can be repeated to specify multiple files, like --plink and --lanc.

Warning

Either --output or --out-list must be provided, but not both.

Info

--no-impute must be used for binary traits. If any quantitative traits have missing values, computational performance can be (often greatly) improved by using --impute, which mean-imputes all missing phenotype values.