pkgr overview
pkgr
is a rethinking of the way packages are managed in R. Namely, it embraces
the declarative philosophy of defining ideal state of the entire system, and working
towards achieving that objective. Furthermore, pkgr
is built with a focus on reproducibility
and auditability of what is going on, a vital component for the pharmaceutical sciences + enterprises.
Why pkgr?
install.packages
and friends such as remotes::install_github
have a subtle weakness --
they are not good at controlling desired global state. There are some knobs that
can be turned, but overall their APIs are generally not what the user actually needs. Rather, they
are the mechanism by which the user can strive towards their needs, in a forceably iterative fashion.
With pkgr, you can, in a parallel-processed manner, do things like:
- Install a number of packages from various repositories, when specific packages must be pulled from specific repositories
- Install
Suggested
packages only for a subset of all packages you'd like to install -
Customize the installation behavior of a single package in a documentable and reproducible way
- Set custom Makevars for a package that persist across system installations
- Install source versions of some packages but binaries for others
- Understand how your R environment will be changed before performing an installation or action.
Today, packages are highly interwoven. Best practices have pushed towards small, well-scoped packages that do behaviors well. For example, rather than just having plyr, we now use dplyr+purrr to achieve the same set of responsibilities (dealing with dataframes + dealing with other list/vector objects in an iterative way). As such, it is becoming increasingly difficult to manage the set of packages in a transparent and robust way.
pkgr in action
Getting started
pkgr
is a command line utility with several top level commands. The two primary commands are:
pkgr plan # show what would happen if install is run
pkgr install # install the packages specified in pkgr.config
The actions are controlled by a configuration file that specifies the desired global state, namely, by defining the top level packages a user cares about, as well as specific configuration customizations.
For example, a pkgr configuration file might look like:
Version: 1
# top level packages
Packages:
- rmarkdown
- bitops
- caTools
- knitr
- tidyverse
- shiny
- logrrr
# any repositories, order matters
Repos:
- MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"
# path to install packages to
Library: "<path/to/install/library>"
# package specific customizations
Customizations:
Packages:
- shiny:
Suggests: true
When you run pkgr install
with this as your pkgr.yml file, pkgr will download and
install the packages listed in the Packages array,
and any dependencies that those packages require.
If you want to see everything that pkgr is going to install before actually installing, simply run pkgr plan
and take a look.
How about a more complex example? One such situation is the need to install from multiple repositories.
Here is a configuration that also pulls from bioconductor, which contains multiple CRAN-like repos that contain packages:
Version: 1
# top level packages
Packages:
- magrittr
- rlang
- ggplot2
- dplyr
- tidyr
- plotly
- VennDiagram
- aws.s3
- data.table
- forcats
- preprocessCore
- loomR
- ggthemes
- reshape
# any repositories, order matters
Repos:
- MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"
- BioCsoft: "https://bioconductor.org/packages/3.12/bioc"
- BioCann: "https://bioconductor.org/packages/3.12/data/annotation"
- BioCexp: "https://bioconductor.org/packages/3.12/data/experiment"
- BioCworkflows: "https://bioconductor.org/packages/3.12/workflows"
# path to install packages to
Library: pkgs
Cache: pkgcache
Logging:
all: pkgr-log.log
install: install-only-log.log
overwrite: true
The default behavior of pkgr is to find the first repository that contains the given package and use that. You
can use Customizations to control that behavior at the Repos
and Packages
level.
For example, given the following, though dplyr is available in both repositories, thus would default to MPN, by setting the Repo in the package customization it will force dplyr to be installed from CRAN.
Version: 1
# top level packages
Packages:
- dplyr
- ggplot2
Repos:
- MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"
- CRAN: "https://cran.rstudio.com"
Library: "test-library"
Customizations:
Packages:
- dplyr:
Repo: CRAN
You can confirm this behavior by inspecting the debug output of the plan:
pkgr plan --loglevel=debug
INFO[0000] Installation would launch 16 workers
INFO[0000] R Version 3.6.3
INFO[0000] OS Platform x86_64-apple-darwin15.6.0
INFO[0000] Package Library will be created path=test-library
INFO[0000] Default package installation type: binary
INFO[0000] 1072:1073 (binary:source) packages available in for MPN from https://mpn.metworx.com/snapshots/stable/2020-12-21
INFO[0000] 16593:16772 (binary:source) packages available in for CRAN from https://cran.rstudio.com
INFO[0000] Package installation cache directory: /Users/devinp/Library/Caches/pkgr
INFO[0000] Database cache directory: /Users/devinp/Library/Caches/pkgr/r_packagedb_caches
DEBU[0000] package repository set pkg=dplyr relationship="user package" repo=CRAN type=binary version=1.0.2
DEBU[0000] package repository set pkg=ggplot2 relationship="user package" repo=MPN type=binary version=3.3.2
DEBU[0000] package repository set pkg=labeling relationship=dependency repo=MPN type=binary version=0.4.2
DEBU[0000] package repository set pkg=rematch2 relationship=dependency repo=MPN type=binary version=2.1.2
DEBU[0000] package repository set pkg=isoband relationship=dependency repo=MPN type=binary version=0.2.3
DEBU[0000] package repository set pkg=lifecycle relationship=dependency repo=MPN type=binary version=0.2.0
DEBU[0000] package repository set pkg=mgcv relationship=dependency repo=MPN type=binary version=1.8-33
.... TRUNCATED FOR WEBSITE
INFO[0000] package installation status installed=0 not_from_pkgr=0 outdated=0 total_packages_required=53
INFO[0000] package installation sources CRAN=1 MPN=52 tarballs=0
INFO[0000] package installation plan to_install=53 to_update=0
INFO[0000] Library path to install packages: test-library
INFO[0000] resolution time 223.09698ms
Notice on the dplyr pkg, the repo was CRAN instead of all others on MPN.
Once a package has been installed, pkgr will not touch that package unless you explicitly request it using the --update
flag. Therefore,
if you change your configuration after already installing a package (for example changing the repository), even if it
detects a different version under the new plan, it will not override it unless --update
is passed.
pkgr install --update
Be careful around leveraging this "feature" to manually build up a combination of package versions. It is much better to be explicit around your intent - namely by adjusting the Customizations to reflect the environment you want to maintain so others can reproduce your environment.
There are many other controls for pkgr, which can be seen in pkgr's configuration.
installing stand-alone packages
pkgr can also install single packages that are not attached to a repository. This can be a convenient feature
when you have a package internal for your use or your company that is not hosted anywhere. This way
you can include it to be installed. pkgr
will also automatically reconcile the dependencies needed
to install it.
Tarballs:
- path/to/pkg.tar.gz
pkgr and packrat and renv and pak
how does it compare with pak can be read about here
Pkgr is not a replacement for Packrat/renv -- Pkgr is complementary to packrat/renv.
packrat/renv are tools to capture the state
of your R environment and isolate it from outside modification.
Where Packrat often falls short, however, is in the restoration said environment.
Running packrat::restore() restores packages in an iterative fashion, which is a
time-consuming process that doesn't always play nice with packages hosted outside
of CRAN (such as packages hosted on GitHub). Additionally, since renv uses install.packages
under the hood, each call to install.packages
is still treated as an isolated procedure rather than as a part of
a holistic effort. This means that the installation process does not stop and inform
the user when a package fails to install properly. In this situation, renv/pkgr continues to install
what packages it can without regard for how this might affect the package ecosystem when those
individual installation failures are later resolved.
Pkgr solves these issues by:
- Installing packages quickly in parallelized graph (determined by the dependency tree)
- Allowing users to control things like what repo a given package is retrieved from and what Makevars it is built with
- Showing users a holistic view of their R Environment (
pkgr inspect --deps --tree
) and how that environment would be changed on another install (pkgr plan
) - Providing timely error messages and halting the installation process immediately when something goes wrong during the installation process (such as a package not being available, a repository being unreachable, etc.)