CRAN Task View proposal: CompositionalData
There was a little hiccup at the start with a task view on compositional data analysis. It seemed that the initial authors were having trouble getting on board with the new authors, didn't agree with comments or simple did not have time to proceed, which is totally understandable. After about a year, it was decided to write this view with new authors.
Following a review of the initial proposal, we have removed a significant number of packages that were not aligned with the requirements of compositional data analysis. We have also created a new task view from scratch, incorporating new packages, sections and did not use text from the old proposal (because mostly copy and paste from package descriptions), as well as restructuring the task view itself.
This is the new proposal:
name: CompositionalData topic: Compositional Data Analysis maintainer: Karel Hron, Javier Palarea-Albaladejo, Matthias Templ, Alessandra Menafoglio email: [email protected] version: 2024-11-18 source: https://github.com/cran-task-views/CoDa/
In general terms, compositional data refers to multivariate, positive and scale invariant data that convey relative information. Although not necessarily, they are often closed or normalised to be expressed in proportions adding up to 1, percentages adding up to 100, or the like; but the scale invariance property implies that the value of any normalisation constant is in fact irrelevant. That is, compositional methods are applicable whenever the researcher recognises that the relevant information in the data is relative and there is an intrinsic interdependence between the parts configuring the composition. These particular characteristics are missed by ordinary statistical methods generally designed for unconstrained real-valued data.
Awareness of the issues with compositional data dates back to the end of the 19th century, when the renowned statistician Karl Pearson already identified the problem of spurious correlation between variables representing ratios with respect to a common denominator caused by the scaling of the data. When closed, compositional data formally live on a simplex sample space, and this can be a convenient representation in a practical setting. The simplex is a constrained space with its own internal operations and geometry. However any coherent approach to analyse compositional data should not depend on any particular representation chosen nor require any preliminary normalisation.
The mainstream approach to analysing compositional data, as originally formulated by [Aitchison (1982)] (https://doi.org/10.1111/j.2517-6161.1982.tb01195.x), involves the use of log-ratio transformations, or log-ratio coordinates using a more modern terminology, that project the data onto the real space. Nowadays, the literature offers a wide range of methods and tools to tackle the analysis of compositional data within this methodological framework, many of which are implemented in R packages.
Compositional data are common in diverse scientific areas, including the chemical, biological and environmental sciences; typically representing portions of a total sample weight or volume and expressed in units such as percentages, parts per million, mg/l, mmol/mol or similar. Some examples include chemical compositions of soil, water or air, food compositions, behavioural time-use profiles, and species relative abundances. They are also common in social sciences like economics. For example, market shares, investment portfolios, or household budgets.
In recent years, the popularity of compositional methods has increased notably, with this meaning new methodological challenges and requiring new ways to transfer and formulate compositional knowledge to meet the needs of diverse scientific fields.
Thus, this task view provides a curated collection of R packages aimed at supporting compositional data analysis, with the purpose of serving as guide for practitioners interested in applying such methods. The packages can be broadly categorised into the following topics, although in fact many offer functionalities that span multiple categories:
General purpose packages
This refers to packages that provide a general platform for compositional data analysis in R, implementing functions to conduct basic operations and calculations, log-ratio representations, data visualisation and some common statistical analyses. They, typically accompanying a published monograph, offer a platform compatible with the basic properties of compositional data to those approaching the methodology from diverse domains.
-
r pkg("compositions", priority = "core"): once the adequate class for the data is set, which corresponds to the assumed underlying geometry (either compositional,acompclass, or multivariate positive data,aplusclass), the package provides functions for their consistent analysis and modelling; including descriptive statistics, visualisation, statistical testing, and multivariate analysis (e.g. principal component analysis, clustering, MANOVA and regression). It supports the monograph van den Boogaart and Tolosana-Delgado (2013) and allows to reproduce the analyses therein. -
r pkg("robCompositions", priority = "core"): with a main focus on robust statistical methods, this package includes a wide range of tools for the manipulation and analysis of compositional data within the log-ratio framework: transformations, dealings with irregular data, robust implementations of multivariate methods such as principal component analysis, factor analysis and discriminant analysis, robust regression with compositional predictors, two-factorial compositions (compositional tables) and functional-compositional analysis of density data. The main reference monograph is Filzmoser, Hron and Templ (2018) . -
r pkg("easyCODA", priority = "core"): univariate and multivariate methods for compositional data analysis following the approach in Greenacre (2018), emphasising the use simple pairwise log-ratios. Particular features include procedures for selecting log-ratios that explain maximum log-ratio variance and conducting various multivariate analyses. -
r pkg("Compositional"): regression, classification, contour plots, hypothesis testing and fitting of distributions for compositional data are some of the functions included in this package. Further some functions for percentages (or proportions) are included. The package deals, however, with compositional data mostly as with constrained data, thus avoiding the scale invariance property and, more generally, the principles of compositional data analysis characterizing the log-ratio approach (despite referring to the classical reference Aitchison (1986)). This admits direct incorporation of zero components for the analysis, but leads to some conceptual inconsistencies with respect to the log-ratio approach. For these reasons, analyses run using this package may be incoherent with those run with the packagesr pkg("composition", priority = "core")or
r pkg("robCompositions", priority = "core").
Compositional tables
Compositional tables (i.e. ordinary contingency tables in their discrete version) represent frequencies or proportions structured across multiple categories. The compositional nature of these tables, often constrained by row or column sums, requires specialised methods to analyse relationships, dependencies, and patterns while respecting their relative nature. This section refers to packages implementing tools for their analysis, including log-ratio representation and selected multivariate methods.
r pkg("robCompositions", priority = "core"): log-ratio coordinate representation of compositional tables and methods for their statistical processing using principal component analysis and regression analysis with a real response and the compositional table as predictor.
Density data analysis
Probability density functions are essentially scale invariant data objects which are commonly subject to a unit integral constraint. They can, therefore, be considered as infinite dimensional compositional data and embedded into a Hilbert space, so-called a Bayes space. This section refers to packages implementing methods and tools for density data analysis from this perspective.
Note that these methods differs from those included in the CRAN Task View on CRAN Task View: Functional Data Analysis, in that they assume the Bayes space as sample space for density functions.
r pkg("robCompositions", priority = "core"): some methods for representation of probability density functions using compositional smoothing splines, grounded on the theory of Bayes spaces. Additionally, the package includes a functional version of the centered log-ratio transformation.
Irregular data: zeros, censoring, missing and outliers
As with ordinary data sets, compositional data sets can be often affected by different issues in real-world applications, which might prevent for the immediate use of statistical methods and should be treated in a data preprocessing stage. Of particular relevance for the application of the log-ratio methodology is the presence of zeros, which require careful handling to preserve the basic properties of the data and the integrity of the subsequent analyses.
Traditionally, three types of zeros have been distinguished in the compositional data literature: rounded zeros, count zeros and essential zeros. Briefly, rounded zeros appear in continuous-valued compositions, and are commonly associated with small values that have been rounded off or have fallen below the detection limit of the measuring device; count zeros refer to zeros that occurring in discrete compositions derived from a counting process, and are generally associated to limited sampling; and, finally, essential zeros refer to genuine zero values, i.e. the case of parts of the composition that are truly absent. Rounded zeros is the case that has received the most attention in the literature and represent a class of left-censored data.
Moreover, the presence of both missing values and outliers poses practical challenges, and a proper handling of them is required to ensure consistency, validity and robustness of a compositional data analysis.
This section focuses on specialised packages including methods to address the above issues while respecting the compositional nature of the data.
-
r pkg("zCompositions", priority = "core"): integrated suite of data imputation methods applicable to zeros, nondetects, missing data, and combinations of them, following the principles of the log-ratio approach as described in Palarea-Albaladejo and Martín-Fernández (2015). This includes a consistent treatment of closed and non-closed compositions, unique or varying detection limits, parametric and nonparametric imputation, single and multiple imputation, maximum likelihood and robust estimation, as well as some tools for the exploration of zero patterns and statistical testing of grouping structure. -
r pkg("mvoutlier"): specific tools for visualising and identifying multivariate outliers in compositional data. -
r pkg("robCompositions", priority = "core"): routines included to impute rounded zeros and missing data, as well as tools to detect outlying samples. -
r pkg("compositions", priority = "core"): routines included to detect, represent, and provide some analysis of irregular data, either missing values, zeros or outliers.
Visualisation
Visualisation is a crucial component of compositional data analysis, allowing researchers to explore patterns, relationships, and distributions within the constrained simplex geometry. This section includes tools for creating ternary diagrams, biplots or pairwise log-ratio plots, among others. On top of overall visualisation functionality included in the general purpose packages above, some others particularly devoted to such purpose are listed in the following.
-
r pkg("ggtern"): extendsggplot2to create ternary diagrams, supporting standard and additional custom geometries, including specialised ternary visualisations. -
r pkg("Ternary"): ternary diagrams and Holdridge life zone plots using base graphics. Features include custom annotation, interpolation, contouring, scaling, and a Shiny interface for interactive plotting. -
r pkg("isopleuros"): tools for visualising data in ternary space, customising graphical elements, and displaying statistical summaries. Includes specialised diagrams for fields like archaeology (e.g., soil texture charts, ceramic phase diagrams). -
r pkg("provenance"): tools for plotting compositional and count data on ternary diagrams and point-counting data on radial plots. Calculations of sample size required for specified levels of statistical precision, and to assess the effects of hydraulic sorting on detrital compositions. Intuitive query-based user interface for users who are not proficient in R.
Regression modelling
Specialised regression modelling with compositional data allows researchers to explore associations between compositions and other variables,
either acting as predictors/covariates or as response, and also between compositions on both sides of the regression model. Packages
specifically devoted to compositional regression analysis are listed in the following. It might be mentioned that r pkg("complmrob") and r pkg("robregcc")
offering nothing essential beyond, e.g., r pkg("robCompositional").
-
r pkg("complmrob"): robust linear regression models for compositional data, where the response variable is a real-valued vector and the covariates are compositional data. See also Hron, Filzmoser and Thompson (2012). -
r pkg("robregcc"): algorithm estimating the parameters of the robust regression model with compositional covariates. The model simultaneously treats outliers and parameter estimates as described in Mishra and Mueller (2019).
-
r pkg("codaredistlm"): linear regression models with compositional predictors, providing predictions and confidence intervals for outcome changes based on reallocations of compositional values, see Dumuid et al. (2017a) and Dumuid et al. (2017b). -
r pkg("DirichletReg"): functions to analyse compositional data using Dirichlet regression models. -
r pkg("multilevelcoda"): Bayesian multilevel modelling with compositional data, both as predictors and outcomes, and post hoc isotemporal substitution analysis.
High-dimensional compositional data: with applications to omics data
In recent years, compositional data analysis is having a notable impact on the omics sciences and bioinformatics, where types of data such as microbiome compositions, gene expression or metabolomic profiles have been recognised as inherently compositional. Applications in this area require methods that address unique challenges, including high dimensionality, zero inflation, overdispersion or the integration of phylogenetic information. This section highlights packages providing tools specifically designed for omics data, but certainly most of them could be equally considered for the statistical processing of high-dimensional compositional data in general.
-
r pkg("FLORAL"): log-ratio lasso regression for continuous, binary, and survival outcomes with compositional features as described in Fei et al. (2023). -
r pkg("BRACoD.R"): Implements Bayesian regression to identify associations between microbiome compositional data and environmental variables. Corrects for compositional distortions by treating total abundance as an unknown variable. Integrates with Python viareticulate.
-
r pkg("coda4microbiome"): Provides tools for microbiome data analysis while accounting for its compositional nature. Includes penalised regression methods for variable selection in cross-sectional and longitudinal studies with binary or continuous outcomes. -
r pkg("codacore"): identification of sparse log-ratios of a composition acting as predictor in regression problems. Scale-invariant log-ratios are derived optimised to account for association with the response variable. -
r pkg("lnmCluster"): logistic normal-multinomial clustering for microbiome compositional data, including extensions for factor analysis, bi-clustering, and sparse covariance estimation. -
r pkg("MicrobiomeStat"): robust methods for analysing microbiome compositional data, addressing zero-inflation, phylogenetic structure, and compositional effects. Applicable to other high-dimensional compositional datasets from sequencing experiments. -
r pkg("QFASA"): Implements quantitative fatty acid signature analysis to estimate predator diets, leveraging fatty acid diversity, biosynthesis limitations, and digestion properties in monogastric animals. Both methods for compositional and constrained data are used.
Special applications in geostatistics and geochemistry
Compositional data analysis is integral to geostatistics and geochemistry, areas where the methodology found its early successful applications. Data sets here often represent proportions of elements, minerals, or isotopes, and are subject to spatial dependencies. These applications require methods that respect the relative nature of compositions while addressing spatial structures and relationships. This section features packages which implement methods for specialised areas such as geostatistical modelling, spatial interpolation, variogram analysis, and compositional kriging; as well as techniques for analysing geochemical compositions in the context of spatial data. In any case, some methods would be equally applicable to any data sets sharing analogous structures in any other application fields.
-
r pkg("provenance"): statistical tools for sedimentary provenance analysis, including kernel density estimation, principal component analysis, correspondence analysis, and multidimensional scaling. Comparison of univariate proxies (e.g., single-grain ages, isotopic compositions) and categorical data are supported using distances like Kolmogorov-Smirnov, Wasserstein, Aitchison, and Bray-Curtis. Tools for visualising data on ternary and radial plots, calculating sample sizes, and assessing hydraulic sorting effects are included. Additionally, a user-friendly interface for non-R users is provided. -
r pkg("ArArRedux"): data reduction and error propagation for Ar(^\text{40})/Ar(^\text{39}) geochronology, processing isotopic compositions from noble gas mass spectrometer data. Methods for regression to $t=0$, blank and decay corrections, detector intercalibration, interference corrections, and age calculation. Argon isotope ratios are treated as compositional data for accurate statistical handling. -
r pkg("gmGeostats"): geostatistical tools for multivariate data with restrictions, including compositions and positive amounts. Descriptive analysis and modelling using two-point Gaussian and multipoint perspectives. Compositional variograms and compositional kriging. -
r pkg("compositions"): includes a compositional variogram and kriging methods.
Other packages
-
r pkg("aIc"): statistical tests for identifying compositional pathologies in datasets, including coherence of correlations, dominance of distance, perturbation invariance, and singularity of the covariation matrix. Supports multiple data transformations such as proportional, centred log-ratio (clr), and others from common R packages. -
r pkg("CMMs"): compositional mediation models for continuous and binary outcomes, handling mediators represented as compositional data. -
r pkg("ccmm"): compositional mediation models to estimate direct and indirect effects of a treatment on an outcome, designed for the case in which mediators are high-dimensional microbiome data. -
r pkg("coda.base"): optimal implementation of friendly functions to compute log-ratio coordinate representations of various types, including principal component and principal balance coordinates and coordinates from tailored orthonormal basis; as well as some basic compositional statistics. -
r pkg("countprop"): model-based metrics of proportionality using the logit-normal multinomial model. It can also provide empirical and plugin estimates of these metrics. -
r pkg("FlexDir"): tools for using the flexible Dirichlet distribution, including maximum likelihood estimation via EM algorithm, variance-covariance matrix estimation, random data generation, and visualisation. Supports applications to compositional data.
r pkg("ToolsForCoDa"): selected multivariate analysis tools for compositional data, including compositional canonical correlation analysis, log-ratio principal component analysis with condition number computations, and log-ratio discriminant analysis.
Links
- CRAN Task View: Functional Data Analysis
- CRAN Task View: Bayesian
- CRAN Task View: Cluster
- CRAN Task View: MachineLearning
- CRAN Task View: Robust
- CRAN Task View: SpatioTemporal
References
Matthias @matthias-da, thanks for following up on this, very much appreciated. I've had a quick look and think that this is in good-enough shape to endorse you to move ahead. If the others agree, then you could set up a GitHub project (or I can do that for you) and start polishing the proposal a bit more.
Some first quick feedback:
- The intro is useful but rather long. Maybe have a shorter intro first and then a "Background" section later?
- The inclusion/exclusion criteria could be sharpened a bit more.
- Including a table of contents at the beginning might be useful to know what the "multiple categories" are.
- Please use AE (rather than BE) spelling.
- Include the task view cross references with
r view("...")in the text and not as hyperlinks (and not just mostly at the end).
This looks very useful! Agreed with comments from @zeileis. Additionally, shorter descriptions of each package will make this task view easier to maintain over time.
Thank you for the proposal! In addition to previous comments, I am under the impression that line breaks are sometimes a bit weird and non homogeneous within the task view. Manual line breaks should probably be removed except for the end of paragraphs maybe.
Dear Achim, Julia and Nathalie
Thanks a lot for your valuable feedback.
We hopefully incorporated all suggested changes.
We put the text to a gitHub site: https://github.com/matthias-da/coda-ctv-draft/blob/main/CompositionalData.md
Achim: Please give us feedback about the further actions. We are happy to further fine-tune the text if needed.
Kind regards Matthias
Dear @matthias-da : Thank you for your work (and sorry for the late reply). I should be able to check this before the end of the week.
Hello @matthias-da I finally read it and found three minor issues:
- There's a problem with the link related to the first citation of Aitchison:
[Aitchison (1982)] (https://doi.org/10.1111/j.2517-6161.1982.tb01195.x)
instead of
[Aitchison (1982)](https://doi.org/10.1111/j.2517-6161.1982.tb01195.x)
(remove the space in the middle)
- In the description of
ArArRelux, you have $\LaTeX$ equation improperly displayed (I think that you just need to add the proper$). - Achim has suggested the addition of a table of content. I think that it would be good to have it at the end of the Background section.
Thanks for the update. Regarding the points that I raised above there was some progress but some still need further work:
- The intro is now very short but immediately followed by a long background section. For those who want to get a quick overview of R packages for compositional data this is too long IMO. I would rather have a short introduction text (similar to most other task views), then a table of contents, and then the section on general-purpose packages. The background explanations could follow later, maybe even at the end?
- Some of the task views are now included in the main text, others are just in the links section at the end. The latter can be excluded as such links will be auto-generated when rendering the HTML. Thus, mention all
r view(...)that you want included somewhere in the text and shortly explain what the interested readers can find there. Example: "Seer view("Robust")for an overview of other robust statistical methods available in R." (or something along those lines)
Dear Achim, Julia and Nathalie
Thanks a lot for your constructive feedback, which improved the CTV. We did your homework and incorporated all your comments.
Here is the latest version (see also below): https://github.com/matthias-da/coda-ctv-draft/blob/main/CompositionalData.md
Can you please go ahead with it and take all necessary steps to go online? Otherwise, we are happy to consider additional comments from your side. Thanks a lot for all your help, support, and work.
Best wishes Matthias
name: CompositionalData topic: Compositional Data Analysis maintainer: Karel Hron, Javier Palarea-Albaladejo, Matthias Templ, Alessandra Menafoglio email: [email protected] version: 2025-02-28 source: https://github.com/cran-task-views/CompositionalData/
In general, compositional data refers to multivariate, positive and scale-invariant data that convey relative information. Although not necessarily, they are often closed or normalized to be expressed in proportions adding up to 1, percentages adding to 100, or the like; but the scale-invariance property implies that the normalization constant used is actually irrelevant. That is, compositional methods are applicable whenever the researcher recognizes that the relevant information in the data is relative and, thus, there is an intrinsic interdependence between the parts that make up the composition. These particularities are not considered by ordinary statistical methods, which are generally designed for unconstrained real-valued data.
This task view provides a curated collection of R packages to support compositional data analysis within the log-ratio coordinate framework. The main goal is serving as a guide to practitioners interested in applying such methods. The packages can be broadly categorized into the following topics, although many provide functionalities spanning multiple categories, as detailed below.
Table of Contents
- General purpose packages
- Irregular data: zeros, censoring, missing and outliers
- Visualization
- Compositional tables
- Density data analysis
- Regression modelling
- High-dimensional compositional data with applications to omics data
- Special applications in geostatistics and geochemistry
- Other packages
- Background
General purpose packages
This section refers to packages that provide a general platform for compositional data analysis in R, implementing functions to conduct basic operations and calculations, log-ratio representations, data visualization and some common statistical analyses. They, typically accompany a published monograph and provide an environment for analysis compatible with the basic properties of compositional data for those approaching the methodology from diverse domains.
-
r pkg("compositions", priority = "core"): once a class for the data is set, which corresponds to the assumed underlying geometry (either compositional,acompclass, or multivariate positive data,aplusclass), the package operates through functions for their consistent analysis and modelling; including descriptive statistics, visualization, statistical testing, and multivariate analysis (e.g. principal component analysis, clustering, MANOVA and regression). It also implements some geostatistical tools such a variogram for compositions and compositional ordinary kriging. This package is linked to the monograph van den Boogaart and Tolosana-Delgado (2013) and supports the analyses and examples therein. -
r pkg("robCompositions", priority = "core"): with a main focus on robust statistical methods (seer view("Robust")for an overview of other robust statistical methods available in R). The package includes a wide range of tools for the manipulation and analysis of compositional data within the log-ratio framework. This includes ordinary transformations, dealings with irregular data, and robust versions of methods such as principal component analysis, factor analysis and discriminant analysis and regression with compositional predictors. Additionally, it implements specialized methods for two-factorial compositions (a.k.a compositional tables) and functional-compositional analysis of density data. The main reference monograph is Filzmoser, Hron and Templ (2018) . -
r pkg("easyCODA", priority = "core"): provides methods and graphics for univariate and multivariate analysis of compositional data in the spirit of Greenacre (2018), emphasizing the use of basic pairwise log-ratios. Particular features include procedures for stepwise selecting log-ratios, correspondence analysis and redundance analysis.
Irregular data: zeros, censoring, missing and outliers
As with ordinary data sets, compositional data are often affected by issues which may prevent from straightforward application of statistical methods in real-world applications.
Of particular relevance for the log-ratio approach is the treatment zeros, which requires careful handling to not distorting basic properties of the data and preserve the consistency of subsequent analyses. Traditionally, three types of zeros have been distinguished: rounded zeros, count zeros and essential zeros. Briefly, rounded zeros occur in continuous-valued compositions, and are generally associated with small values that have been rounded or have fallen below the detection limit of the measuring device; count zeros refer to zeros that occur in discrete compositions derived from a counting processes, generally associated with limited sampling; and, finally, essential zeros refer to truly zero values, i.e. parts of the composition that are genuinely absent. Rounded zeros, which essentially correspond to a class of left-censored data, have received the most attention in the literature.
Moreover, analogously to ordinary statistical methods, the presence of either missing values or outliers poses practical challenges. Again, coherent handling is required for consistent data analysis within the compositional framework.
The following are specialized packages focused on addressing these issues while respecting the compositional nature of the data.
-
r pkg("zCompositions", priority = "core"): a suite of data imputation methods applicable to zeros, nondetects, missing data, and combinations of them, following the principles of the log-ratio approach (Palarea-Albaladejo and Martín-Fernández (2015)). This includes a consistent treatment of closed and non-closed compositions, unique or varying detection limits, parametric and nonparametric imputation, single and multiple imputation, maximum likelihood and robust estimation, as well as some tools for the exploration of zero patterns and statistical testing of grouping structures. -
r pkg("mvoutlier"): specific tools for visualizing and identifying multivariate outliers in compositional data.
Visualization
Visualization is a critical component of compositional data analysis, allowing researchers to explore patterns, relationships, and distributions within the constrained simplicial geometry.
On top of the functionality provided with the general purpose packages cited above, this section compiles specialized tools for producing ternary plots, compositional biplots, or pairwise log-ratio plots, among others.
-
r pkg("ggtern"): enables plotting and managing ternary diagrams following the style and syntax of the graphical packageggplot2; supporting both standard and additional geometries with a high level of customization. -
r pkg("Ternary"): produces ternary diagrams and Holdridge life zone plots using base graphics. Particular features include custom annotation, interpolation, contouring, scaling, and a Shiny interface for interactive plotting. -
r pkg("isopleuros"): visualisation of data in ternary space, customising graphical elements, and displaying statistical summaries. Includes specialized diagrams in archaeology, e.g. soil texture charts and ceramic phase diagrams. -
r pkg("provenance"): representation of compositional and count data on ternary diagrams and point-counting data on radial plots. Allows to compute sample size required for specified levels of statistical precision, and to assess the effects of hydraulic sorting on detrital compositions. Provides an intuitive query-based user interface for users who are not proficient in R.
Compositional tables
Compositional tables (i.e., ordinary contingency tables in their discrete version) represent frequencies or proportions structured across multiple categories. The compositional nature of these tables, often constrained by row or column sums, requires specialized methods to analyze relationships, dependencies, and patterns while respecting their relative nature. This section refers to tools for their analysis, including log-ratio representation and selected multivariate methods.
r pkg("robCompositions", priority = "core"): log-ratio coordinate representation of compositional tables and methods for their statistical processing using principal component analysis and regression analysis with a real response and the compositional table as predictor. See Filzmoser, Hron and Templ (2018) (Chapter 12) and Nesrstová et al. (2023) for more details.
Density data analysis
Probability density functions are essentially scale invariant data objects, usually subject to a unit integral constraint. Therefore, they can be considered as infinite dimensional compositional data and embedded in a Hilbert space, so-called a Bayes space (see van den Boogaart, Egozcue and Pawlowsky-Glahn (2014) for details).
The packages listed below implement methods for density data analysis from this perspective.
Unlike the methods in the r view("FunctionalData") task view, here it is assumed that the sample space for density functions is the Bayes space.
r pkg("robCompositions", priority = "core"): methods for representation of probability density functions using compositional smoothing splines, grounded on the theory of Bayes spaces. Additionally, a functional version of the centered log-ratio transformation is included.
Regression modelling
Regression modelling with compositional data allows researchers to explore associations between compositions
and other variables, either as predictors/covariates or response; and also between compositions on both
sides of the regression model. Packages specifically devoted to compositional regression analysis are listed below. It should be noted that r pkg("complmrob") and r pkg("robregcc") do not offer anything essential beyond, for example, r pkg("robCompositions").
-
r pkg("complmrob"): robust linear regression models for compositional data, where the response variable is a real-valued vector and the covariates are compositional data. See also Hron, Filzmoser and Thompson (2012). -
r pkg("robregcc"): algorithm estimating the parameters of the robust regression model with compositional covariates. The model simultaneously treats outliers and parameter estimates as described in Mishra and Mueller (2019).
-
r pkg("codaredistlm"): linear regression models with compositional predictors, providing predictions and confidence intervals for outcome changes based on reallocations of compositional values, see Dumuid et al. (2017a) and Dumuid et al. (2017b). -
r pkg("DirichletReg"): functions to analyse compositional data using Dirichlet regression models. -
r pkg("multilevelcoda"): Bayesian multilevel modelling with compositional data, both as predictors and outcomes, and post hoc isotemporal substitution analysis.
High-dimensional compositional data with applications to the omics sciences
In recent years, compositional data analysis has had a notable impact on the omics sciences and bioinformatics, where data types such as microbiome compositions, gene expression, or metabolomic profiles have been recognized as inherently compositional. Applications in this area require methods that address unique challenges such as high dimensionality, zero inflation, overdispersion or the integration of phylogenetic information.
This section highlights packages that provide compositional tools designed for omics data, but certainly most of them could also be considered for the statistical processing of high-dimensional compositional data in general.
r pkg("FLORAL"): log-ratio lasso regression for continuous, binary, and survival outcomes with compositional features as described in Fei et al. (2023).
-
r pkg("coda4microbiome"): tools for microbiome data analysis while accounting for its compositional nature. Includes penalized regression methods for variable selection in cross-sectional and longitudinal studies with binary or continuous outcomes. -
r pkg("codacore"): identification of sparse log-ratios of a composition acting as predictor in regression problems. Scale invariant log-ratios are derived which are optimized to account for association with the response variable. -
r pkg("lnmCluster"): logistic normal-multinomial clustering for microbiome compositional data, including extensions for factor analysis, bi-clustering, and sparse covariance estimation. -
r pkg("MicrobiomeStat"): robust methods for analysing microbiome compositional data, addressing zero inflation, phylogenetic structure, and compositional effects. Applicable to other high-dimensional compositional data sets from sequencing experiments. -
r pkg("QFASA"): quantitative fatty acid signature analysis to estimate predator diets, leveraging fatty acid diversity, biosynthesis limitations, and digestion properties in monogastric animals. Both methods for compositional and constrained data are used.
Special applications in geostatistics and geochemistry
Compositional data analysis is an integral part of geostatistics and geochemistry, areas where the
methodology found its first successful applications. Data sets here often represent proportions of elements,
minerals, or isotopes, and are subject to spatial dependencies (see r view("SpatioTemporal") for a task view focused on spatiotemporal methods). In the case of compositional data, these
applications require methods that
respect their relative nature while taking into account any spatial structures and relationships.
Thus, this section refers to packages for geostatistical modelling, spatial interpolation, variogram analysis, and compositional kriging; as well as techniques for analyzing spatial geochemical compositions. Note that some methods would be equally applicable to any data set with analogous structures in any other application area.
-
r pkg("provenance"): statistical tools for sedimentary provenance analysis, including kernel density estimation, principal component analysis, correspondence analysis, and multidimensional scaling. Comparison of univariate proxies (e.g., single-grain ages, isotopic compositions) and categorical data are supported using distances like Kolmogorov-Smirnov, Wasserstein, Aitchison, and Bray-Curtis. Tools for visualizing data on ternary and radial plots, calculating sample sizes, and assessing hydraulic sorting effects are included. Additionally, a user-friendly interface is provided for R beginners. -
r pkg("ArArRedux"): data reduction and error propagation for $Ar^\text{40}/Ar^\text{39}$ geochronology, processing isotopic compositions from noble gas mass spectrometer data. Methods for regression, blank and decay corrections, detector intercalibration, interference corrections, and age calculation. Argon isotope ratios are treated as compositional data for accurate statistical handling. -
r pkg("gmGeostats"): tools for multivariate data with restrictions, including compositions and positive amounts. Descriptive analysis and modelling using two-point Gaussian and multipoint perspectives. Compositional variograms and compositional kriging.
Other packages
This collection is meant to include other useful small packages, typically having a fairly specific purpose. In accordance with the log-ratio framework considered here, the condition for inclusion is that the scale invariance property of compositional data is, at least partially, respected.
-
r pkg("coda.base"): provides optimized and user-friendly implementations of functions to compute log-ratio coordinate representations of various types, including principal component and principal balance coordinates, as well as log-ratio coordinates from tailored orthonormal basis. It also allows to compute some basic compositional statistics. -
r pkg("aIc"): statistical tests to identify compositional pathologies in data. Namely, coherence of correlations, dominance of distances, perturbation invariance, and singularity of the covariation matrix. Supports multiple data transformations such as proportional, centred log-ratio (clr), and others from common R packages.
r pkg("SARP.compo"): tools for network-based interpretation of changes in compositional data, including computation of pairwise ratios, statistical testing between conditions, and network representation of results.
r pkg("ToolsForCoDa"): selected multivariate analysis tools for compositional data, including compositional canonical correlation analysis, log-ratio principal component analysis with condition number computations, and log-ratio discriminant analysis.
Background
Awareness of the problems with compositional data dates back to the end of the 19th century, when the renowned statistician Karl Pearson recognized the problem of spurious correlations between variables scaled with respect to a common denominator. When closed to add up to constant value, compositional data are formally projected on a simplex sample space, and this is often a convenient representation in a practical setting. The simplex is a constrained space with its own internal operations and geometry. However, any coherent approach to analyzing compositional data should not depend on the chosen representation, nor require any preliminary normalization.
The mainstream approach to compositional data analysis, as originally formulated by Aitchison (1982), involves the use of log-ratio transformations (or log-ratio coordinates to use a more more modern terminology) which project the data into real space. Nowadays, the compositional literature offers a wide range of methods within this methodological framework, many of which are implemented in R packages.
Compositional data are common in diverse scientific fields, including the chemical, biological, and environmental sciences; where they typically represent portions of a total sample weight or volume and are expressed in units such as percent, parts per million, mg/l, mmol/mol, or similar. Some examples include chemical compositions of soil, water, or air, food compositions, behavioral or time-use profiles, and relative abundances of species. They are also common in socio-economical sciences; for example when dealing with market shares, investment portfolios, or household budgets.
In recent years, the popularity of compositional methods has grown significantly. Simultaneously, new methodological challenges have arisen requiring novel ways to transfer and formulate compositional knowledge to meet the needs of different scientific fields.
Dear @matthias-da : Thanks a lot! I checked it and I have nothing else to add. Thanks a lot for your contribution to the TV! If the other editors agree as well, you have my endorsement for the next step (i.e., the publication of the task view).
Hello, my apologies for the delay in responding.
This looks very useful; it is well organized; the packages look highly relevant. I think this will benefit R users!
There is still quite a bit of extra text that does not add much to a CTV. I don't think you need to expend space convincing people these are techniques worthy of usage, give intellectual background or provide historical context. You could publish a manuscript that highlights all these details informing readers of this new resource you have created.
Some sections that could be substantially trimmed or eliminated entirely (this is not an exhaustive list):
Compositional data analysis is an integral part of geostatistics and geochemistry, areas where the methodology found its first successful applications. Data sets here often represent proportions of elements, minerals, or isotopes, and are subject to spatial dependencies (see r view("SpatioTemporal") for a task view focused on spatiotemporal methods). In the case of compositional data, these applications require methods that respect their relative nature while taking into account any spatial structures and relationships.
(also it look like there is still some issues with manual line breaks where they are not needed, but perhaps this is already corrected in your main repository for this proposal)
In recent years, compositional data analysis has had a notable impact on the omics sciences and bioinformatics, where data types such as microbiome compositions, gene expression, or metabolomic profiles have been recognized as inherently compositional. Applications in this area require methods that address unique challenges such as high dimensionality, zero inflation, overdispersion or the integration of phylogenetic information.
As with ordinary data sets, compositional data are often affected by issues which may prevent from straightforward application of statistical methods in real-world applications. Of particular relevance for the log-ratio approach is the treatment zeros, which requires careful handling to not distorting basic properties of the data and preserve the consistency of subsequent analyses. Traditionally, three types of zeros have been distinguished: rounded zeros, count zeros and essential zeros. Briefly, rounded zeros occur in continuous-valued compositions, and are generally associated with small values that have been rounded or have fallen below the detection limit of the measuring device; count zeros refer to zeros that occur in discrete compositions derived from a counting processes, generally associated with limited sampling; and, finally, essential zeros refer to truly zero values, i.e. parts of the composition that are genuinely absent. Rounded zeros, which essentially correspond to a class of left-censored data, have received the most attention in the literature.
There's also a few issues with grammar and incorrect punctuation, e.g:
They, typically accompany a published monograph and provide an environment for analysis compatible with the basic properties of compositional data for those approaching the methodology from diverse domains.
The term 'ordinary statistics' is a bit broad and ill defined. As mentioned in your draft, there are already tools for dealing with censored and bound data that are not compositional. Perhaps just call these methods 'non-compositional' unless there is a consensus term?
Thanks, Matthias et al. for the update. I agree with Nathalie that this is close to publication now. Julia raised a couple of points, most of which should be straightfroward to check or incorporate. As for the lengthy sections that could be trimmed: I had raised this before and you addressed some of this by moving the background section to the end. So I think that we should move on soon now but maybe you can have one ore look and check whether you can deflate some of the sections Julia indicated.
I also just looked at the source code of your Markdown file and noticed that there is lots of commented text in <!-- ... -->. This is problematic if it contains r ... code chunks because those are still evaluated and the corresponding packages registered for listing in the task view. Hence please drop these parts. If you keep any comments, make sure that there are no R code chunks in them
Thanks for your patience going through these revisions! I hope that the above is not that much work and I look forward to publishing the task view on CRAN afterwards.
Dear Achim, Julia and Nathalie
Thanks for all your valuable input. We considered Julia's comments and have deleted all the commented text. In my opinion, it would be beneficial to retain the background information text, given the various concepts surrounding compositional data methods. We adhere to the log-ratio approach, and the text provides a clear rationale for this.
@zeileis : thank you for publishing this task view and all work related to it.
Matthias, thanks for the quick update. Please transfer the repository with your task view to me in the following steps: Go to
Settings > Collaborators > Public Repository > Manage
and then
Danger Zone > Transfer ownership
to "zeileis".
I will then transfer it to the cran-task-views org, do some finishing touch-ups and release it.
Achim, the ownership is now transferred to you. Thanks for all the follow-up steps.
🎉 Fantastic, thanks, Matthias @matthias-da, Karel @hronkare, Javier @Japal and Alessandra (who I couldn't find on GitHub, yet)!
The task view has now been incorporated into the cran-task-views organization and released on CRAN:
https://CRAN.R-project.org/view=CompositionalData
https://github.com/cran-task-views/CompositionalData/
On GitHub Karel is the administrator of the repository (who can also add new members, e.g., Alessandra if she joints) and Matthias and Javier are maintainers.
I have also just announced the task view on social media:
https://fosstodon.org/@zeileis/114222265087888737
https://bsky.app/profile/zeileis.org/post/3ll6wgi3vu22s
Feel free to promote this further. 😇
This looks great, thank you for all your work and support.
Dear Achim,
many thanks to you for the initiative and support and to the collaborators for the nice output of our joint work!
I'm convinced that it will help the R-users to get oriented in relevant packages for compositional data analysis.
Best wishes,
Karel
[logoUP_eps]http://www.upol.cz/
Prof. RNDr. Karel Hron, Ph.D.
Palacký University Faculty of Science | Dept. of Math. Anal. and Appl. of Math. Head of Department +420 585 634 605 @.@.> | www.kma.upol.czhttp://www.kma.upol.cz/ https://www.kma.upol.cz/katedra/lide/vizitka/hron/
From: Achim Zeileis @.> Sent: Tuesday, March 25, 2025 10:22 AM To: cran-task-views/ctv @.> Cc: hronkare @.>; Mention @.> Subject: Re: [cran-task-views/ctv] CRAN Task View proposal: CompositionalData (Issue #67)
🎉 Fantastic, thanks, Matthias @matthias-dahttps://github.com/matthias-da, Karel @hronkarehttps://github.com/hronkare, Javier @Japalhttps://github.com/Japal and Alessandra (who I couldn't find on GitHub, yet)!
The task view has now been incorporated into the cran-task-views organization and released on CRAN:
https://CRAN.R-project.org/view=CompositionalData
https://github.com/cran-task-views/CompositionalData/
On GitHub Karel is the administrator of the repository (who can also add new members, e.g., Alessandra if she joints) and Matthias and Javier are maintainers.
I have also just announced the task view on social media:
@.***/114222265087888737
https://bsky.app/profile/zeileis.org/post/3ll6wgi3vu22s
Feel free to promote this further. 😇
— Reply to this email directly, view it on GitHubhttps://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALLQFH6C3IXXHJQ42U4ALHT2WEG2JAVCNFSM6AAAAABTZIQBV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJQGYYTEMJRHE. You are receiving this because you were mentioned.Message ID: @.***> [Obrázek byl odebrán odesílatelem. zeileis]zeileis left a comment (cran-task-views/ctv#67)https://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119
🎉 Fantastic, thanks, Matthias @matthias-dahttps://github.com/matthias-da, Karel @hronkarehttps://github.com/hronkare, Javier @Japalhttps://github.com/Japal and Alessandra (who I couldn't find on GitHub, yet)!
The task view has now been incorporated into the cran-task-views organization and released on CRAN:
https://CRAN.R-project.org/view=CompositionalData
https://github.com/cran-task-views/CompositionalData/
On GitHub Karel is the administrator of the repository (who can also add new members, e.g., Alessandra if she joints) and Matthias and Javier are maintainers.
I have also just announced the task view on social media:
@.***/114222265087888737
https://bsky.app/profile/zeileis.org/post/3ll6wgi3vu22s
Feel free to promote this further. 😇
— Reply to this email directly, view it on GitHubhttps://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALLQFH6C3IXXHJQ42U4ALHT2WEG2JAVCNFSM6AAAAABTZIQBV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJQGYYTEMJRHE. You are receiving this because you were mentioned.Message ID: @.***>
Likewise, many thanks for the support, and thanks to the colleagues for the nice collaboration to produce this. It looks great!
Javier
On 25 Mar 2025, at 10:56, Hron Karel @.***> wrote:
Dear Achim,
many thanks to you for the initiative and support and to the collaborators for the nice output of our joint work!
I'm convinced that it will help the R-users to get oriented in relevant packages for compositional data analysis.
Best wishes,
Karel
<image001.png>http://www.upol.cz/
Prof. RNDr. Karel Hron, Ph.D.
Palacký University Faculty of Science | Dept. of Math. Anal. and Appl. of Math. Head of Department +420 585 634 605 @.@.> | www.kma.upol.czhttp://www.kma.upol.cz/ https://www.kma.upol.cz/katedra/lide/vizitka/hron/
From: Achim Zeileis @.@.>> Sent: Tuesday, March 25, 2025 10:22 AM To: cran-task-views/ctv @.@.>> Cc: hronkare @.@.>>; Mention @.@.>> Subject: Re: [cran-task-views/ctv] CRAN Task View proposal: CompositionalData (Issue #67)
🎉 Fantastic, thanks, Matthias @matthias-dahttps://github.com/matthias-da, Karel @hronkarehttps://github.com/hronkare, Javier @Japalhttps://github.com/Japal and Alessandra (who I couldn't find on GitHub, yet)!
The task view has now been incorporated into the cran-task-views organization and released on CRAN:
https://CRAN.R-project.org/view=CompositionalDatahttps://cran.r-project.org/view=CompositionalData
https://github.com/cran-task-views/CompositionalData/
On GitHub Karel is the administrator of the repository (who can also add new members, e.g., Alessandra if she joints) and Matthias and Javier are maintainers.
I have also just announced the task view on social media:
@.***/114222265087888737
https://bsky.app/profile/zeileis.org/post/3ll6wgi3vu22s
Feel free to promote this further. 😇
— Reply to this email directly, view it on GitHubhttps://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALLQFH6C3IXXHJQ42U4ALHT2WEG2JAVCNFSM6AAAAABTZIQBV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJQGYYTEMJRHE. You are receiving this because you were mentioned.<~WRD0000.jpg>Message ID: @.@.>>
<~WRD0000.jpg>zeileis left a comment (cran-task-views/ctv#67)https://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119
🎉 Fantastic, thanks, Matthias @matthias-dahttps://github.com/matthias-da, Karel @hronkarehttps://github.com/hronkare, Javier @Japalhttps://github.com/Japal and Alessandra (who I couldn't find on GitHub, yet)!
The task view has now been incorporated into the cran-task-views organization and released on CRAN:
https://CRAN.R-project.org/view=CompositionalDatahttps://cran.r-project.org/view=CompositionalData
https://github.com/cran-task-views/CompositionalData/
On GitHub Karel is the administrator of the repository (who can also add new members, e.g., Alessandra if she joints) and Matthias and Javier are maintainers.
I have also just announced the task view on social media:
@.***/114222265087888737
https://bsky.app/profile/zeileis.org/post/3ll6wgi3vu22s
Feel free to promote this further. 😇
— Reply to this email directly, view it on GitHubhttps://github.com/cran-task-views/ctv/issues/67#issuecomment-2750612119, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALLQFH6C3IXXHJQ42U4ALHT2WEG2JAVCNFSM6AAAAABTZIQBV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONJQGYYTEMJRHE. You are receiving this because you were mentioned.<~WRD0000.jpg>Message ID: @.@.>>
Dear all,
Thanks for your work.
There are some packages missing in the view, see for example
> library(RWsearch)
> crandb_down()
> tvdb_down()
> p <- sort(unique(c(s_crandb("compositional", "model", mode="and", select="T"), s_crandb("compositional", "data", mode="and", select="T"))))
> p2 <- p[!p %in% tvdb_list()$CompositionalData]
> p2
[1] "BRACoD.R" "ccmm" "CMMs" "coda.plot"
[5] "CoDaLoMic" "Compositional" "CompositionalML" "countprop"
[9] "DImodelsVis" "lba" "rrcov3way"
Regards, Christophe
Dear Christophe
Thank you for your helpful overview. We’ve reviewed most of these packages already. That said, the mere presence of the term compositional in a package name or description doesn’t necessarily mean the package implements methods consistent with compositional data analysis principles.
This was also the key issue we encountered with the initial version of the Task View proposed by others: it was based on an automatic keyword search rather than a content-driven assessment of whether the methods align with the core ideas of compositional data analysis.