Introduction

Interpretation of genetic variation data is a crucial step to understand the relationship between gene sequence changes and biological function. There are several annotation tools, such as ANNOVAR, VEP, vcfanno, have been developed. These tools make gene variation data annotation more convenient and faster than before. However, because different annotation tools have their own methods of use and design architecture, this increases the difficulty for bioinformatics beginner to utilize these tools. In addition, many of existing database resources and annotation scripts have not been well integrated and shared.

So, it is worth us to develop an integrated annotation system that not only includes the integration of different annotation tools but also integrate the relevant database resources. Here, we present an integrated annotation R package ‘annovarR’ to do this. It provides a series R functions to integrate external annotation tools and annotation databases.

Installation

To install annovarR, you need to install R interpreter (Supported Linux, MAC, and Windows). This package has been uploaded on The Comprehensive R Archive Network (CRAN, https://cran.r-project.org). You can use the command to install annovarR package easily:

If you want to use the latest development version, you need to use devtools install_github function.

annovarR can also be installed using the source code archive (R CMD INSTALL). In this situation, you need to manually handle dependencies on many packages.

Tips: When the RMySQL or RSQLite package can not directly installed by R, conda is an optional solution: conda install -c r r-rmysql r-rsqlite. Or you need root permissions to install the corresponding system dependency.

Download database resource

To reduce the procedure of download database and other material, annovarR provides single function download.database to download various annotation databases for ANNOVAR, AnnotationDbi, vcfanno, vep, oncotator and giggle. Moreover, you can share the database configuration for others to download your database in protected (license code required) or unprotected mode.

Basic annotation

anno.name is the key of shortcut annotation, you can get the annotation name through function get.annotation.names(), or you can search the vignette to get the annotation name

Besides, download.name is also required for you to download the related database resource connected with anno.name. Now, we provides the function get.download.name to get it.

Notably, annovarR is not a command-line program, and only provides the R function and an shiny APP (dev) to do the annotation steps.

But, compared with ANNOVAR, both R object and external file were supported to input in annovarR. and you can get the required columns of matched database by get.annotation.needcols().

To facilitate the SQLite format database annotation. annovarR also provides a simple function sqlite.build to build your sqlite format database: input a text file and output a sqlite format database.

It is helpful for the annotation database with very large size file. Because the SQLite or other SQL format database with the indexes can singlificantly reduce the search and analysis time without any other bio-annotation tools, and just need the powerful SQL database client.

Function annotation is the main interface to access various annotation system including ANNOVAR, vcfanno, vep, and R annotation system, such as AnnotationDbi and other R format anntotation script.

For example, if the anno.name contains perl_annovar, it will use the external ANNOVAR to finish the annotation step and read the output file (VCF by vcfR) as the data.table object.

Advanced annotation

Excepting the predetermined shortcut anno.name, annovarR also provides the austomizable functions annotation.cols.match and annotation.region.match for the full match and region match respectively.

Other various small annotation functions/script also be provided, and the number of the annotation function will continue to increase.

External annotation system

If you want to use the ANNOVAR of perl version, you need to download the ANNOVAR source code using R package BioInstaller, and also need to prepare the avinput format refer the tutorial:

chr start end ref alt
chr1 100000 100000 A T
chr1 100000 100001 AA -
chr1 100000 100000 - T
chr1 100000 100000 - CAC
library(BioInstaller)
install.bioinfo('annovar', annovar.dir)

To reduce the test time, we set debug to TRUE, and it will returend the command to run ANNOVAR.

VCF format files can be processed by ANNOVAR, VEP and vcfanno. If you don’t want to read the output files in R, you can set the parameter debug to TRUE, and paste/run the returend command to shell client.

# Annotate VCF file using ANNOVAR
# set debug to TRUE will not to run command
x <- annotation(anno.name = "perl_annovar_ensGene", input.file = "/tmp/test.vcf",
             annovar.dir = "/opt/bin/annovar/", database.dir = "{{annovar.dir}}/humandb", 
             out = tempfile(), vcfinput = TRUE, debug = TRUE)
#> /usr/bin/perl /opt/bin/annovar//table_annovar.pl /tmp/test.vcf {annovar.dir}/humandb -buildver hg19 -out /var/folders/nc/yl5qhkkn6vxf_m7s_yz2kzvh0000gn/T//RtmpaAQ3JM/file6b9f7629d739 -remove -protocol ensGene -operation g -nastring .  -vcfinput

# Annotation VCF file use VEP
vep(debug = TRUE)
#> vep --cache_version 91 --assembly GRCh37 --dir /Users/ljf/.vep --output_file variant_effect_output.txt --cache --offline --everything
#> [1] "vep --cache_version 91 --assembly GRCh37 --dir /Users/ljf/.vep --output_file variant_effect_output.txt --cache --offline --everything "
x <- annotation(anno.name = "vep_all", input.file = "/tmp/test.vcf",
             out = tempfile(), debug = TRUE)
#> vep --cache_version 91 --assembly hg19 --dir /Users/ljf/.vep --output_file /var/folders/nc/yl5qhkkn6vxf_m7s_yz2kzvh0000gn/T//RtmpaAQ3JM/file6b9f34f11bca --input_file /tmp/test.vcf --cache --offline --everything

# Annotation VCF file use vcfanno
vcfanno(debug = TRUE)
#> vcfanno_linux64 -p 2 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/conf.toml input.vcf > output.vcf
#> [1] "vcfanno_linux64 -p 2 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/conf.toml input.vcf > output.vcf"
x <- annotation(anno.name = "vcfanno_demo", input.file = system.file("extdata", "demo/vcfanno_demo/query.vcf.gz", 
                   package = "annovarR"), out = "test.vcf", vcfanno = "/path/vcfanno", debug = TRUE)
#> vcfanno_linux64 -base-path /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/ -lua /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/custom.lua -p 2 /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/conf.toml /Library/Frameworks/R.framework/Versions/3.5/Resources/library/annovarR/extdata/demo/vcfanno_demo/query.vcf.gz > test.vcf

Databases in annovarR

Detail about supported annotation database can be found in the another vignette.

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.4
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rvest_0.3.2          xml2_1.2.0           stringr_1.3.1       
#> [4] RCurl_1.95-4.10      bitops_1.0-6         data.table_1.11.4   
#> [7] annovarR_1.1.1       BioInstaller_0.3.3.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] Biobase_2.40.0       httr_1.3.1           RMySQL_0.10.15      
#>  [4] viridisLite_0.3.0    bit64_0.9-7          jsonlite_1.5        
#>  [7] prettydoc_0.2.1      R.utils_2.6.0        vcfR_1.8.0          
#> [10] shiny_1.1.0          assertthat_0.2.0     stats4_3.5.0        
#> [13] blob_1.1.1           yaml_2.1.19          RSQLite_2.1.1       
#> [16] backports_1.1.2      lattice_0.20-35      glue_1.2.0          
#> [19] digest_0.6.15        promises_1.0.1       Matrix_1.2-14       
#> [22] htmltools_0.3.6      httpuv_1.4.3         R.oo_1.22.0         
#> [25] pkgconfig_2.0.1      devtools_1.13.5      xtable_1.8-2        
#> [28] later_0.7.2          git2r_0.21.0.9002    mgcv_1.8-23         
#> [31] IRanges_2.14.10      DT_0.4               withr_2.1.2         
#> [34] RcppTOML_0.1.3       BiocGenerics_0.26.0  magrittr_1.5        
#> [37] crayon_1.3.4         mime_0.5             memoise_1.1.0       
#> [40] evaluate_0.10.1      R.methodsS3_1.7.1    fs_1.2.2            
#> [43] nlme_3.1-137         MASS_7.3-50          vegan_2.5-2         
#> [46] tools_3.5.0          org.Hs.eg.db_3.6.0   formatR_1.5         
#> [49] S4Vectors_0.18.2     ngstk_0.2.1.1        cluster_2.0.7-1     
#> [52] configr_0.3.2.3      AnnotationDbi_1.42.1 lambda.r_1.2.3      
#> [55] compiler_3.5.0       pkgdown_1.1.0        rlang_0.2.0.9001    
#> [58] futile.logger_1.4.3  grid_3.5.0           rstudioapi_0.7      
#> [61] htmlwidgets_1.2      rappdirs_0.3.1       crosstalk_1.0.0     
#> [64] rmarkdown_1.9        DBI_1.0.0            roxygen2_6.0.1      
#> [67] R6_2.2.2             ini_0.3.1            knitr_1.20          
#> [70] pinfsc50_1.1.0       bit_1.1-13           commonmark_1.5      
#> [73] rprojroot_1.3-2      liteq_1.0.1          futile.options_1.0.1
#> [76] permute_0.9-4        desc_1.2.0           ape_5.1             
#> [79] stringi_1.2.2        parallel_3.5.0       Rcpp_0.12.17