2015年9月25日星期五

Awesome R

Awesome R

A curated list of awesome R frameworks, packages and software. Inspired by awesome-machine-learning.

Integrated Development Environment

Integrated Development Environment

  • RStudio - A powerful and productive user interface for R. Works great on Windows, Mac, and Linux.
  • JGR - JGR (speak ‘Jaguar’) is a Java Gui for R.
  • Emacs + ESS - Emacs Speaks Statistics is an add-on package for emacs text editors.
  • StatET - An Eclipse based IDE (integrated development environment) for R.
  • Revolution R Enterprise - Revolution R would be offered free to academic users and commercial software would focus on big data, large scale multiprocessor functionality.
  • R Commander - A package that provides a basic graphical user interface.
  • IPython - An interactive Python interpreter,and it supports execution of R code while capturing both output and figures.
  • Deducer - A Menu driven data analysis GUI with a spreadsheet like data editor.

Data Manipulation

Packages for cooking data.

  • dplyr - Blazing fast data frames manipulation and database query.
  • data.table - Fast data manipulation in a short and flexible syntax.
  • reshape2 - Flexible rearrange, reshape and aggregate data.
  • tidyr - Easily tidy data with spread and gather functions.

Graphic Displays

Packages for showing data.

  • ggplot2 - An implementation of the Grammar of Graphics.
  • ggvis - Interactive grammar of graphics for R.
  • rCharts - Interactive JS Charts from R.
  • lattice - A powerful and elegant high-level data visualization system.
  • rgl - 3D visualization device system for R.
  • Cairo - R graphics device using cairo graphics library for creating high-quality display output.
  • extrafont - Tools for using fonts in R graphics.
  • showtext - Enable R graphics device to show text using system fonts.
  • dygraphs - Charting time-series data in R.

Reproducible Research

Packages for literate programming.

  • knitr - Easy dynamic report generation in R.
  • xtable - Export tables to LaTeX or HTML.
  • rapport - An R templating system.
  • rmarkdown - Dynamic documents for R.

Web Technologies and Services

Packages to surf the web.

  • shiny - Easy interactive web applications with R.
  • RCurl - General network (HTTP/FTP/…) client interface for R.
  • httpuv - HTTP and WebSocket server library.
  • XML - Tools for parsing and generating XML within R.
  • rvest - Simple web scraping for R.

Parallel Computing

Packages for parallel computing.

  • parallel - R started with release 2.14.0 which includes a new package parallel incorporating (slightly revised) copies of packages multicore and snow.
  • Rmpi - Rmpi provides an interface (wrapper) to MPI APIs. It also provides interactive R slave environment.
  • foreach - Executing the loop in parallel.
  • SparkR - R frontend for Spark.

High Performance

Packages for making R faster.

  • Rcpp - Rcpp provides a powerful API on top of R, make function in R extremely faster.
  • Rcpp11 - Rcpp11 is a complete redesign of Rcpp, targetting C++11.
  • compiler - speeding up your R code using the JIT

Language API

Packages for other languages.

  • rJava - Low-level R to Java interface.
  • jvmr - Integration of R, Java, and Scala.
  • rJython - R interface to Python via Jython.
  • rPython - Package allowing R to call Python.
  • runr - Run Julia and Bash from R.
  • RJulia - R package Call Julia.
  • RinRuby - a Ruby library that integrates the R interpreter in Ruby.
  • R.matlab - Read and write of MAT files together with R-to-MATLAB connectivity.
  • RcppOctave - Seamless Interface to Octave and Matlab.
  • RSPerl - A bidirectional interface for calling R from Perl and Perl from R.
  • V8 - Embedded JavaScript Engine.

Database Management

Packages for managing data.

  • RODBC - ODBC database access for R.
  • DBI - Defines a common interface between the R and database management systems.
  • RMySQL - R interface to the MySQL database.
  • ROracle - OCI based Oracle database interface for R.
  • RPostgreSQL - R interface to the PostgreSQL database system.
  • RSQLite - SQLite interface for R
  • RJDBC - Provides access to databases through the JDBC interface.
  • rmongodb - R driver for MongoDB.
  • rredis - Redis client for R.
  • RCassandra - Direct interface (not Java) to the most basic functionality of Apache Cassanda.
  • RHive - R extension facilitating distributed computing via Apache Hive.

Machine Learning

Packages for making R cleverer.

  • h2o - Deeplearning, Random forests, GBM, KMeans, PCA, GLM
  • Clever Algorithms For Machine Learning
  • Machine Learning For Hackers
  • rpart - Recursive Partitioning and Regression Trees
  • randomForest - Breiman and Cutler’s random forests for classification and
    regression
  • lasso2 - L1 constrained estimation aka ‘lasso’
  • gbm - Generalized Boosted Regression Models
  • e1071 - Misc Functions of the Department of Statistics (e1071), TU Wien
  • tgp - Bayesian treed Gaussian process models
  • rgp - R genetic programming framework
  • arules - Mining Association Rules and Frequent Itemsets
  • frbs - Fuzzy Rule-based Systems for Classification and Regression Tasks
  • rattle - Graphical user interface for data mining in R
  • ahaz - Regularization for semiparametric additive hazards regression
  • arules - Mining Association Rules and Frequent Itemsets
  • bigrf - Big Random Forests: Classification and Regression Forests for
    Large Data Sets
  • bigRR - Generalized Ridge Regression (with special advantage for p >> n
    cases)
  • bmrm - Bundle Methods for Regularized Risk Minimization Package
  • Boruta - A wrapper algorithm for all-relevant feature selection
  • bst - Gradient Boosting
  • C50 - C5.0 Decision Trees and Rule-Based Models
  • caret - Classification and Regression Training
  • CORElearn - Classification, regression, feature evaluation and ordinal
    evaluation
  • CoxBoost - Cox models by likelihood based boosting for a single survival
    endpoint or competing risks
  • Cubist - Rule- and Instance-Based Regression Modeling
  • earth - Multivariate Adaptive Regression Spline Models
  • elasticnet - Elastic-Net for Sparse Estimation and Sparse PCA
  • ElemStatLearn - Data sets, functions and examples from the book: “The Elements
    of Statistical Learning, Data Mining, Inference, and
    Prediction” by Trevor Hastie, Robert Tibshirani and Jerome
    Friedman
  • evtree - Evolutionary Learning of Globally Optimal Trees
  • frbs - Fuzzy Rule-based Systems for Classification and Regression Tasks
  • GAMBoost - Generalized linear and additive models by likelihood based
    boosting
  • gamboostLSS - Boosting Methods for GAMLSS
  • gbm - Generalized Boosted Regression Models
  • glmnet - Lasso and elastic-net regularized generalized linear models
  • glmpath - L1 Regularization Path for Generalized Linear Models and Cox
    Proportional Hazards Model
  • GMMBoost - Likelihood-based Boosting for Generalized mixed models
  • grplasso - Fitting user specified models with Group Lasso penalty
  • grpreg - Regularization paths for regression models with grouped
    covariates
  • hda - Heteroscedastic Discriminant Analysis
  • ipred - Improved Predictors
  • kernlab - kernlab: Kernel-based Machine Learning Lab
  • klaR - Classification and visualization
  • lars - Least Angle Regression, Lasso and Forward Stagewise
  • lasso2 - L1 constrained estimation aka ‘lasso’
  • LiblineaR - Linear Predictive Models Based On The Liblinear C/C++ Library
  • LogicReg - Logic Regression
  • maptree - Mapping, pruning, and graphing tree models
  • mboost - Model-Based Boosting
  • mvpart - Multivariate partitioning
  • ncvreg - Regularization paths for SCAD- and MCP-penalized regression
    models
  • nnet - eed-forward Neural Networks and Multinomial Log-Linear Models
  • oblique.tree - Oblique Trees for Classification Data
  • pamr - Pam: prediction analysis for microarrays
  • party - A Laboratory for Recursive Partytioning
  • partykit - A Toolkit for Recursive Partytioning
  • penalized - L1 (lasso and fused lasso) and L2 (ridge) penalized estimation
    in GLMs and in the Cox model
  • penalizedLDA - Penalized classification using Fisher’s linear discriminant
  • penalizedSVM - Feature Selection SVM using penalty functions
  • quantregForest - quantregForest: Quantile Regression Forests
  • randomForest - randomForest: Breiman and Cutler’s random forests for classification and
    regression
  • randomForestSRC - randomForestSRC: Random Forests for Survival, Regression and Classification
    (RF-SRC)
  • rda - Shrunken Centroids Regularized Discriminant Analysis
  • rdetools - Relevant Dimension Estimation (RDE) in Feature Spaces
  • REEMtree - Regression Trees with Random Effects for Longitudinal (Panel)
    Data
  • relaxo - Relaxed Lasso
  • rgenoud - R version of GENetic Optimization Using Derivatives
  • rgp - R genetic programming framework
  • Rmalschains - Continuous Optimization using Memetic Algorithms with Local
    Search Chains (MA-LS-Chains) in R
  • rminer - Simpler use of data mining methods (e.g. NN and SVM) in
    classification and regression
  • ROCR - Visualizing the performance of scoring classifiers
  • RoughSets - Data Analysis Using Rough Set and Fuzzy Rough Set Theories
  • rpart - Recursive Partitioning and Regression Trees
  • RPMM - Recursively Partitioned Mixture Model
  • RSNNS - Neural Networks in R using the Stuttgart Neural Network
    Simulator (SNNS)
  • RWeka - R/Weka interface
  • RXshrink - RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least
    Angle Regression
  • sda - Shrinkage Discriminant Analysis and CAT Score Variable Selection
  • SDDA - Stepwise Diagonal Discriminant Analysis
  • svmpath - svmpath: the SVM Path algorithm
  • tgp - Bayesian treed Gaussian process models
  • tree - Classification and regression trees
  • varSelRF - Variable selection using random forests
  • xgboost - eXtreme Gradient Boosting Tree model, well known for its speed and performance.
  • SuperLearner and subsemble - Multi-algorithm ensemble learning packages.
  • Introduction to Statistical Learning
  • BreakoutDetection - Breakout Detection via Robust E-Statistics from Twitter.

Natural Language Processing

Packages for Natural Language Processing.

  • tm - A comprehensive text mining framework for R.
  • openNLP - Apache OpenNLP Tools Interface.
  • koRpus - An R Package for Text Analysis.
  • zipfR - Statistical models for word frequency distributions.
  • tmcn - A Text mining toolkit for international characters especially for Chinese.
  • rmmseg4j - R interface to the Java Chinese word segmentation system of mmseg4j.
  • Rwordseg - Chinese word segmentation.

Bayesian

Packages for Bayesian Inference.

  • coda - Output analysis and diagnostics for MCMC.
  • mcmc - Markov Chain Monte Carlo.
  • MCMCpack - Markov chain Monte Carlo (MCMC) Package.
  • R2WinBUGS - Running WinBUGS and OpenBUGS from R / S-PLUS.
  • BRugs - R interface to the OpenBUGS MCMC software.
  • rjags - R interface to the JAGS MCMC library.
  • rstan - R interface to the Stan MCMC software.

Finance

Packages for dealing with money.

  • quantmod - Quantitative Financial Modelling & Trading Framework for R.
  • TTR - Functions and data to construct technical trading rules with R.
  • PerformanceAnalytics - Econometric tools for performance and risk analysis.
  • zoo - S3 Infrastructure for Regular and Irregular Time Series.
  • xts - eXtensible Time Series.
  • tseries - Time series analysis and computational finance.
  • fAssets - Analysing and Modelling Financial Assets.

Genetics

Packages for Statistical Genetics.

  • Bioconductor - Tools for the analysis and comprehension of high-throughput genomic data.
  • genetics - Classes and methods for handling genetic data.
  • gap - An integrated package for genetic data analysis of both population and family data.
  • ape - Analyses of Phylogenetics and Evolution.

R Development

Packages for packages.

  • devtools - Tools to make an R developer’s life easier.
  • testthat - An R package to make testing fun.
  • R6 - simpler, faster, lighter-weight alternative to R’s built-in classes.
  • pryr - Make it easier to understand what’s going on in R.
  • roxygen - Describe your functions in comments next to their definitions.
  • lineprof - Visualise line profiling results in R.
  • packrat - Make your R projects more isolated, portable, and reproducible.
  • installr - Functions for installing softwares from within R.
  • Rocker - R configurations for Docker.

Other Interpreter

Alternative R engines.

  • renjin - a JVM-based interpreter for R.
  • pqR - a “pretty quick” implementation of R
  • fastR - FastR is an implementation of the R Language in Java atop Truffle and Graal.
  • riposte - a fast interpreter and JIT for R.
  • TERR - TIBCO Enterprise Runtime for R.
  • RRE - Revolution R Enterprise.
  • CXXR - Refactorising R into C++.

Resources

Where to discover new R-esources.

Websites

  • R-project - The R Project for Statistical Computing.
  • R Bloggers - There are people scattered across the Web who blog about R. This is simply an aggregator of many of those feeds.
  • DataCamp - Learn R data analytics online.
  • Quick-R - An excellent quick reference.
  • Advanced R - An in-progress book site for Advanced R.
  • CRAN Task Views - Task Views for CRAN packages.

Books

  • The Art of R Programming - It’s a good resource for systematically learning fundamentals such as types of objects, control statements, variable scope, classes and debugging in R.
  • R in Action - This book aims at all levels of users, with sections for beginning, intermediate and advanced R ranging from “Exploring R data structures” to running regressions and conducting factor analyses.
  • Use R! - This series of inexpensive and focused books from Springer publish shorter books aimed at practitioners. Books can discuss the use of R in a particular subject area, such as bayesian networks, ggplot2 and Rcpp.

Reference Card

Other Awesome Lists

Contributing

Your contributions are always welcome!

2015年9月21日星期一

Hadley Wickham:一个改变了R的人

Hadley Wickham:一个改变了R的人



http://mp.weixin.qq.com/s?__biz=MjM5NDQ3NTkwMA==&mid=210298680&idx=1&sn=e64368a007ecc0b7ae1bb1f77d6bc080&scene=1&srcid=0921zzsVglwsts0zZYUwdZYz&key=dffc561732c22651ad22898f4762b190400d7d03cd25a21d278f873f834ca377c413d9a6b224e89350a9a1fc9f28f62a&ascene=0&uin=MTI1NzAyMTU%3D&devicetype=iMac+MacBookAir6%2C2+OSX+OSX+10.10.3+build(14D136)&version=11020012&pass_ticket=r4nJr2j38vk%2BeNK2aZPS6kMmagh3vaAYKIZQ5dPE7eQ%3D2015-09-21 
COS编辑部按】本译文得到了原英文作者的授权同意,翻译:冯俊晨、王小宁,审校:邱怡轩、朱雪宁,编辑:王小宁
\
Hadley Wickham RStudio 的首席科学家以及 Rice University 统计系的助理教授。他是著名图形可视化软件包 ggplot2 的开发者,以及其他许多被广泛使用的软件包的作者,代表作品如 plyrreshape2 等。本文取自PRICEONOMICS.

通过数据从根本上了解世界真的是一件非常,非常酷的事情。
——多产的R开发者Hadley Wickham
如果你不花很多时间在开源统计编程语言R中写代码的话,他的名字你可能并不熟悉——但统计学家Hadley Wickham用他自己的话说是那种“以书呆子出名”的人。他是那种在统计会议上人们排队要和他拍照,问他要签名的人,并且人们对他充满了尊敬。他也承认“这种现象实在太奇特了。因为写R程序而出名?这太疯狂了。”
R 是一种为数据分析而设计的编程语言,Wickham正是因为成为了卓越的R包开发者而赢得了他的名声。R包是用于简化诸如整合和绘制数据等常见任务代码的编程工具。Wickham已经帮助了数以万计的人,使他们的工作变得更有效率,这使得大家都很感激他,甚至为之而欣喜若狂。他开发的R包的用户包括众科技巨头,例如Google,Facebook和Twitter,新闻巨擘诸如纽约时报和 FiveThirtyEight,政府机构诸如食品与药品管理局(FDA)以及美国禁毒署(DEA)等。
诚然,他是书呆子中的巨人。
***
Wickham出生在新西兰汉密尔顿的一个统计学世家。他父亲Brian Wickham在康奈尔大学获得动物繁殖专业的博士,该学科大量使用统计学;而他姐姐则拥有加州大学伯克利分校的统计学博士学位。如果这个世界上真有数据结构神童,那么Wickham可能就是其中之一。谈起他早年的经历,他颇为自豪:“我的第一份工作,那时我才15岁,就是开发一个微软Access数据库。我觉得这事儿挺有意思的。我为数据库编写了文档,他们至今都在用这个数据库。”
从第一份工作开始,Wickham就开始反思存储和操纵数据是否存在一种更好的办法。“对于找到更好的解决之道,我一直颇为自信”,他解释说,“并且这个办法可以造福他人。”虽然彼时的他依然在懵懂中,但正在那时他“内化”了第三范式(Third Normal Form)的概念,这将在他未来的工作中扮演重要的角色。第三范式的本质是一种降低数据冗余且保证数据一致性的数据构架方法。Wickham把这种数据叫做“干净”(tidy)数据,而他的工具推广了并依赖于这种数据结构。

R的标志,该语言的革命性演化部分归功于Hadley Wickham
Wickham第一次接触R语言是在新西兰奥克兰大学攻读统计学本科学位时。他将其描述为“一种理解数据的程序语言”。比肩SQL和Python,R是最受数据科学家欢迎的语言之一。
和Wickham一样,这个将被他革新的语言也来自新西兰。1993年,奥克兰大学的统计学家Ross Ihaka和Robert Gentleman创制了R。由于该语言是为数据分析量身定制,并且某些地方它与众不同(例如数据结构的索引方式以及数据强制存储于内存中),因此熟悉其他语言的程序员往往觉得R非常奇怪。在编写过Java,VBA和PHP后,Wickham发现R截然不同:“(许多程序员)接触R会后觉得它不伦不类,但我却不这么想,我觉得它可有意思了。”
从在爱荷华州立大学攻读博士学位起,Wickham就开始开发R工具包。用Wickham的话说,编写一个工具包是“编写一些帮助人们解决问题的代码,然后编写代码文档来帮助人们理解这玩意该怎么用”。作为课程项目的一部分,他编写的第一个工具包旨在实现生化信息的可视化。尽管这个包从未面世,他却坚定了分享个人工具的念头。
在2005年,Wickham发布了reshape工具包,这是他一连串“网红”工具包的开始。自发布以来,这个工具包已经被下载了几十万次。reshape希望让数据的聚合和操纵变得不那么“枯燥和烦人”。对于非程序员而言,简化数据变形过程可能不是什么事儿,但是对于数据科学家和统计学家而言,这往往是他们工作中最费时的事儿。
Wickham显然被reshape的成功所鼓舞。他开发这个工具包正是因为他认为他可以比前人做的更好。尽管不爱自夸,他却绝不缺乏自信。“我坚信我知道解决问题的正确方法”,他反复强调,“这种念头是好是坏就不知道了。”
***
在reshape和其他几个工具包大受欢迎的同时,Wickham对于统计学的憧憬却逐渐幻灭。在攻读PhD的过程中,他注意到“学校里教的东西和人们理解数据真正需要的东西根本不沾边”。与那些专注于高深莫测的中心极限定理变体的统计学家不同,Wickham致力于让普罗大众能够更容易地上手数据分析。他阐述说:“肯定会有象牙塔的统计学家否认我所做的工作是统计学,但是我认为他们错了。我所做的工作正是回归到统计学的根源。存在数据科学这一学科这件事本身就说明正统统计学存在巨大缺陷。对我而言,这涉及到什么是统计:统计即是通过建模和可视化从数据中获得洞见。数据清洗和操纵是个脏活累活,而正统统计学拍拍屁股说这不归我们管。”
在幻灭之旅上,Wickham开发了ggplot2这个工具包。迄今为止,该工具包已经被下载了几百万次,它不仅是Wickham最成功的作品,也改变了许多人对于数据可视化的观念。ggplot2的巨大成功也促使他离开学术界去Rstudio担任首席科学家,从而专心致志地改进R。(Rstudio是R语言最受欢迎的集成开发环境的盈利开发机构。
Hadley Wickham放了一个用ggplot2画的图片。图片DavidKahleGarrett Grolemund
ggplot2 包是以统计学家LelandWilkinson 的“图形语法”为基础,以一种数据可视化的形式开发的。Wickham把 ggplot2 和图形语法看成是“不作为一系列机械操作的可视化思维方式(如从这里到那里画一条线,在这里画一点,把长方形涂上颜色)而是以可视化的思维将数据映射到你能看到的事物上。”
在图形语法背后的概念是相当抽象的。最大的想法是图是由“几何对象”(我们在图表上看到的一个点或柱子的图形元素)和“图形属性”(关于其中几何形状被放置的方向)组成的。这听起来可能不是革命性的,但由Wickham实现的这个概念使得成千上万的人可以更加容易地画图。问答网站Stack Overflow上已经有近9000个问题标记为ggplot2,甚至说 ggplot2 在R中让作图变得更“好玩”。用 ggplot2 画的图已经出现在了Nature,FiveThirtyEight和纽约时报上。

Hadley Wickham手里拿着一本关于他的可视化软件包ggplot2 的中文译本;图片来源于statr
除了开发ggplot2和reshape包外,Wickham也设计了一些其他广受欢迎的包来为数据科学家解决其他的重要问题。想用字(字符串)的形式很容易地操纵数据么?想从网上爬取数据么?需要轻松地编写自己的包么?Wickham已经帮你解决了。
在Quora(一个问答SNS网站,译者注)上,一个R 用户问道:“Hadley Wickham为什么能对R做出这么大的贡献,尤其是在R包上?我依然不能详细地算出Hadley到底做出了多少。他做出这么多东西看起来是不可能的……”R 社区的活跃会员Eduardo Arino de la Rubia说所有成功的编程语言需要像Hadley这样的“名人”。他把Hadley与David HeinemeierHansson(Web应用程序框架的Ruby on Rails的创建者)和TatsuhikoMiyagawa(编程语言Perl 的重要开发者)进行了比较。
下面的图表展示了Hadley的超过2000次下载的17个包(有时候它们被戏称为“Hadley宇宙”)的发布日期和下载的数量。这些下载数字少得可怜,因为它们只反映了从2012年年底其中一个流行的下载来源的数据。并且,是的,这个图是用Hadley的包(ggvis)绘制的。
Dan Kopf, Priceonomics;数据来源:cranlogs
那么为什么Hadley创造了这一切?R是免费下载的,所有的包也是免费的,所以金钱的激励是次要的。简单地说,当一个问题比它应有的状态更难以解决时,Wickham就会耿耿于怀。虽然“其他大多数人都可以接受生活多艰这一事实”,但是Wickham却做不到。他说:“让我成功的原因之一是我对挫折是极其敏感的。”这种敏感性为他赢得了一个“奇特的恶名”。
在大多数情况下,Wickham是不起眼的,但是当他在R聚会或是统计数据发布会上,他就会变成一个摇滚明星。他说:“我能看到我的名誉达到了一种令人不安的水平。”他希望有人会写一本关于“如何在一个非常特殊的领域做一个名人”的书,并且他担心当人们滔滔不绝地谈论他时不知道该如何正确行事。虽然现在习惯了“恶名”,但他仍然会因为人们使用他创造的工具而感到兴奋。他乐于在“Facebook,Google,Twitter,Tumblr……”中查看有多少人在使用他的工具。他说,只有在旧金山,人们在街上认出他的几率会更大一些。他还提到最近对新闻媒体FiveThirtyEight的访问让他很高兴,他觉得了解他人如何使用自己的工具是很酷的一件事(他们使用一个高度定制的ggplot2来绘制图形)。
***
最重要的是,Wickham乐于给那些喜欢摆弄数据的人提供力量和支持。他解释说:“通过数据从根本上了解世界真的是一件非常,非常酷的事情。让我感到兴奋的分析不是谷歌爬取了1 TB的网络广告数据来优化收入, [而是]那些有着绝对热情的生物学家,现在他们可以使用,并理解R了。


统计之都:专业、人本、正直的中国统计学门户网站。

关注方式:扫描下图二维码。或查找公众帐号,搜索 统计之都 或 CapStat 即可。

往期推送:进入统计之都会话窗口,点击右上角小人图标,查看历史消息即可。

微信扫一扫
关注该公众号

2015年9月19日星期六

ŷhat | 10 R packages I wish I knew about earlier

ŷhat | 10 R packages I wish I knew about earlier

A brief introduction to “apply” in R


https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r
At any R Q&A site, you’ll frequently see an exchange like this one:
Q: How can I use a loop to […insert task here…] ?
A: Don’t. Use one of the apply functions.
So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.

If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:
1
2
3
4
5
6
7
base::apply             Apply Functions Over Array Margins
base::by                Apply a Function to a Data Frame Split by Factors
base::eapply            Apply a Function Over Values in an Environment
base::lapply            Apply a Function over a List or Vector
base::mapply            Apply a Function to Multiple List or Vector Arguments
base::rapply            Recursively Apply a Function to a List
base::tapply            Apply a Function Over a Ragged Array
Let’s examine each of those.
1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”
OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the function to each individual value.” An example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# create a matrix of 10 rows x 2 columns
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
# mean of the rows
apply(m, 1, mean)
 [1]  6  7  8  9 10 11 12 13 14 15
# mean of the columns
apply(m, 2, mean)
[1]  5.5 15.5
# divide all values by 2
apply(m, 1:2, function(x) x/2)
      [,1] [,2]
 [1,]  0.5  5.5
 [2,]  1.0  6.0
 [3,]  1.5  6.5
 [4,]  2.0  7.0
 [5,]  2.5  7.5
 [6,]  3.0  8.0
 [7,]  3.5  8.5
 [8,]  4.0  9.0
 [9,]  4.5  9.5
[10,]  5.0 10.0
That last example was rather trivial; you could just as easily do “m[, 1:2]/2” – but you get the idea.
2. by
Updated 27/2/14: note that the original example in this section no longer works; use colMeans now instead of mean.
Description: “Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames.”
The by function is a little more complex than that. Read a little further and the documentation tells you that “a data frame is split by row into data frames subsetted by the values of one or more factors, and function ‘FUN’ is applied to each subset in turn.” So, we use this one where factors are involved.
To illustrate, we can load up the classic R dataset “iris”, which contains a bunch of flower measurements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
attach(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
 
# get the mean of the first 4 variables, by species
by(iris[, 1:4], Species, colMeans)
Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.006        3.428        1.462        0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.936        2.770        4.260        1.326
------------------------------------------------------------
Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.588        2.974        5.552        2.026
Essentially, by provides a way to split your data by factors and do calculations on each subset. It returns an object of class “by” and there are many, more complex ways to use it.
3. eapply
Description: “eapply applies FUN to the named values from an environment and returns the results as a list.”
This one is a little trickier, since you need to know something about environments in R. An environment, as the name suggests, is a self-contained object with its own variables and functions. To continue using our very simple example:
1
2
3
4
5
6
7
8
9
10
11
12
# a new environment
e <- new.env()
# two environment variables, a and b
e$a <- 1:10
e$b <- 11:20
# mean of the variables
eapply(e, mean)
$b
[1] 15.5
 
$a
[1] 5.5
I don’t often create my own environments, but they’re commonly used by R packages such as Bioconductor so it’s good to know how to handle them.
4. lapply
Description: “lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.”
That’s a nice, clear description which makes lapply one of the easier apply functions to understand. A simple example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# the mean of the values in each element
lapply(l, mean)
$a
[1] 5.5
 
$b
[1] 15.5
 
# the sum of the values in each element
lapply(l, sum)
$a
[1] 55
 
$b
[1] 155
The lapply documentation tells us to consult further documentation for sapplyvapply and replicate. Let’s do that.
    4.1 sapply
Description: “sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate.”
That simply means that if lapply would have returned a list with elements $a and $b, sapply will return either a vector, with elements [[‘a’]] and [[‘b’]], or a matrix with column names “a” and “b”. Returning to our previous simple example:
1
2
3
4
5
6
7
8
9
10
# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# mean of values using sapply
l.mean <- sapply(l, mean)
# what type of object was returned?
class(l.mean)
[1] "numeric"
# it's a numeric vector, so we can get element "a" like this
l.mean[['a']]
[1] 5.5
    4.2 vapply
Description: “vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”
A third argument is supplied to vapply, which you can think of as a kind of template for the output. The documentation uses the fivenum function as an example, so let’s go with that:
1
2
3
4
5
6
7
8
9
10
11
12
13
l <- list(a = 1:10, b = 11:20)
# fivenum of values using vapply
l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
class(l.fivenum)
[1] "matrix"
# let's see it
l.fivenum
           a    b
Min.     1.0 11.0
1st Qu.  3.0 13.0
Median   5.5 15.5
3rd Qu.  8.0 18.0
Max.    10.0 20.0
So, vapply returned a matrix, where the column names correspond to the original list elements and the row names to the output template. Nice.
    4.3 replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”
The replicate function is very useful. Give it two mandatory arguments: the number of replications and the function to replicate; a third optional argument, simplify = T, tries to simplify the result to a vector or matrix. An example – let’s simulate 10 normal distributions, each with 10 observations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
replicate(10, rnorm(10))
             [,1]        [,2]        [,3]       [,4]        [,5]         [,6]
 [1,]  0.67947001 -1.94649409  0.28144696  0.5872913  2.22715085 -0.275918282
 [2,]  1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249  0.528666264
 [3,]  0.77272662 -2.36122644  0.06397576  1.5870779 -0.33926083  1.121164338
 [4,] -0.42702542 -0.90613885  0.83645668 -0.5462608 -0.87458396 -0.723858258
 [5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936  0.08253614  1.319095242
 [6,]  2.93827883 -0.33363446  0.55405024 -0.4942736  0.66407615 -0.153623614
 [7,]  1.30037496 -0.26207115  0.49818215  1.0774543 -0.28206908  0.825488436
 [8,] -0.04153545 -0.23621632 -1.01192741  0.4364413 -2.28991601 -0.002867193
 [9,]  0.01262547  0.40247248  0.65816829  0.9541927 -1.63770154  0.328180660
[10,]  0.96525278 -0.37850821 -0.85869035 -0.6055622  1.13756753 -0.371977151
             [,7]        [,8]       [,9]       [,10]
 [1,]  0.03928297  0.34990909 -0.3159794  1.08871657
 [2,] -0.79258805 -0.30329668 -1.0902070  0.73356542
 [3,]  0.10673459 -0.02849216  0.8094840  0.06446245
 [4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330
 [5,] -1.50226560 -2.35751419  1.2104163  0.74650696
 [6,] -0.32790991  0.80144695 -0.0071844  0.05742356
 [7,]  1.36719970  2.34148354  0.9148911  0.20451421
 [8,] -0.51112579 -0.53658159  1.5194130 -0.94250069
 [9,]  0.52017814 -1.22252527  0.4519702  0.08779704
[10,]  1.35908918  1.09024342  0.5912627 -0.20709053
5. mapply
Description: “mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each (…) argument, the second elements, the third elements, and so on.”
The mapply documentation is full of quite complex examples, but here’s a simple, silly one:
1
2
3
4
5
l1 <- list(a = c(1:10), b = c(11:20))
l2 <- list(c = c(21:30), d = c(31:40))
# sum the corresponding elements of l1 and l2
mapply(sum, l1$a, l1$b, l2$c, l2$d)
 [1]  64  68  72  76  80  84  88  92  96 100
Here, we sum l1$a[1] + l1$b[1] + l2$c[1] + l2$d[1] (1 + 11 + 21 + 31) to get 64, the first element of the returned list. All the way through to l1$a[10] + l1$b[10] + l2$c[10] + l2$d[10] (10 + 20 + 30 + 40) = 100, the last element.
6. rapply
Description: “rapply is a recursive version of lapply.”
I think “recursive” is a little misleading. What rapply does is apply functions to lists in different ways, depending on the arguments supplied. Best illustrated by examples:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# let's start with our usual simple list example
l <- list(a = 1:10, b = 11:20)
# log2 of each value in the list
rapply(l, log2)
      a1       a2       a3       a4       a5       a6       a7       a8
0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
      a9      a10       b1       b2       b3       b4       b5       b6
3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
      b7       b8       b9      b10
4.087463 4.169925 4.247928 4.321928
# log2 of each value in each list
rapply(l, log2, how = "list")
$a
 [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
 [9] 3.169925 3.321928
 
$b
 [1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
 [9] 4.247928 4.321928
 
# what if the function is the mean?
rapply(l, mean)
   a    b
 5.5 15.5
 
rapply(l, mean, how = "list")
$a
[1] 5.5
 
$b
[1] 15.5
So, the output of rapply depends on both the function and the how argument. When how = “list” (or “replace”), the original list structure is preserved. Otherwise, the default is to unlist, which results in a vector.
You can also pass a “classes=” argument to rapply. For example, in a mixed list of numeric and character variables, you could specify that the function act only on the numeric values with “classes = numeric”.
7. tapply
Description: “Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”
Woah there. That sounds complicated. Don’t panic though, it becomes clearer when the required arguments are described. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”.
So, to go back to the famous iris data, “Species” might be a factor and “iris$Petal.Width” would give us a vector of values. We could then run something like:
1
2
3
4
5
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
    setosa versicolor  virginica
     1.462      4.260      5.552
Summary
I’ve used very simple examples here, with contrived data and standard functions (such as mean and sum). For me, this is the easiest way to learn what a function does: I can look at the original data, then the result and figure out what happened. However, the “apply” family is a much more powerful than these illustrations – I encourage you to play around with it.
The things to consider when choosing an apply function are basically:
  • What class is my input data? – vector, matrix, data frame…
  • On which subsets of that data do I want the function to act? – rows, columns, all values…
  • What class will the function return? How is the original data structure transformed?
It’s the usual input-process-output story: what do I have, what do I want and what lies inbetween?