2015年10月3日星期六

在R markdown beamer 中输入中文

怎么在R markdown beamer 输入中文

http://rmarkdown.rstudio.com/beamer_presentation_format.html#comment-2017079389

2015年9月25日星期五

Awesome R

A curated list of awesome R frameworks, packages and software. Inspired by awesome-machine-learning.

Awesome R
Resources
Other Awesome Lists
Contributing

Integrated Development Environment

Integrated Development Environment

RStudio - A powerful and productive user interface for R. Works great on Windows, Mac, and Linux.
JGR - JGR (speak ‘Jaguar’) is a Java Gui for R.
Emacs + ESS - Emacs Speaks Statistics is an add-on package for emacs text editors.
StatET - An Eclipse based IDE (integrated development environment) for R.
Revolution R Enterprise - Revolution R would be offered free to academic users and commercial software would focus on big data, large scale multiprocessor functionality.
R Commander - A package that provides a basic graphical user interface.
IPython - An interactive Python interpreter,and it supports execution of R code while capturing both output and figures.
Deducer - A Menu driven data analysis GUI with a spreadsheet like data editor.

Data Manipulation

Packages for cooking data.

dplyr - Blazing fast data frames manipulation and database query.
data.table - Fast data manipulation in a short and flexible syntax.
reshape2 - Flexible rearrange, reshape and aggregate data.
tidyr - Easily tidy data with spread and gather functions.

Graphic Displays

Packages for showing data.

ggplot2 - An implementation of the Grammar of Graphics.
ggvis - Interactive grammar of graphics for R.
rCharts - Interactive JS Charts from R.
lattice - A powerful and elegant high-level data visualization system.
rgl - 3D visualization device system for R.
Cairo - R graphics device using cairo graphics library for creating high-quality display output.
extrafont - Tools for using fonts in R graphics.
showtext - Enable R graphics device to show text using system fonts.
dygraphs - Charting time-series data in R.

Reproducible Research

Packages for literate programming.

knitr - Easy dynamic report generation in R.
xtable - Export tables to LaTeX or HTML.
rapport - An R templating system.
rmarkdown - Dynamic documents for R.

Web Technologies and Services

Packages to surf the web.

shiny - Easy interactive web applications with R.
RCurl - General network (HTTP/FTP/…) client interface for R.
httpuv - HTTP and WebSocket server library.
XML - Tools for parsing and generating XML within R.
rvest - Simple web scraping for R.

Parallel Computing

Packages for parallel computing.

parallel - R started with release 2.14.0 which includes a new package parallel incorporating (slightly revised) copies of packages multicore and snow.
Rmpi - Rmpi provides an interface (wrapper) to MPI APIs. It also provides interactive R slave environment.
foreach - Executing the loop in parallel.
SparkR - R frontend for Spark.

High Performance

Packages for making R faster.

Rcpp - Rcpp provides a powerful API on top of R, make function in R extremely faster.
Rcpp11 - Rcpp11 is a complete redesign of Rcpp, targetting C++11.
compiler - speeding up your R code using the JIT

Language API

Packages for other languages.

rJava - Low-level R to Java interface.
jvmr - Integration of R, Java, and Scala.
rJython - R interface to Python via Jython.
rPython - Package allowing R to call Python.
runr - Run Julia and Bash from R.
RJulia - R package Call Julia.
RinRuby - a Ruby library that integrates the R interpreter in Ruby.
R.matlab - Read and write of MAT files together with R-to-MATLAB connectivity.
RcppOctave - Seamless Interface to Octave and Matlab.
RSPerl - A bidirectional interface for calling R from Perl and Perl from R.
V8 - Embedded JavaScript Engine.

Database Management

Packages for managing data.

RODBC - ODBC database access for R.
DBI - Defines a common interface between the R and database management systems.
RMySQL - R interface to the MySQL database.
ROracle - OCI based Oracle database interface for R.
RPostgreSQL - R interface to the PostgreSQL database system.
RSQLite - SQLite interface for R
RJDBC - Provides access to databases through the JDBC interface.
rmongodb - R driver for MongoDB.
rredis - Redis client for R.
RCassandra - Direct interface (not Java) to the most basic functionality of Apache Cassanda.
RHive - R extension facilitating distributed computing via Apache Hive.

Machine Learning

Packages for making R cleverer.

h2o - Deeplearning, Random forests, GBM, KMeans, PCA, GLM
Clever Algorithms For Machine Learning
Machine Learning For Hackers
rpart - Recursive Partitioning and Regression Trees
randomForest - Breiman and Cutler’s random forests for classification and
regression
lasso2 - L1 constrained estimation aka ‘lasso’
gbm - Generalized Boosted Regression Models
e1071 - Misc Functions of the Department of Statistics (e1071), TU Wien
tgp - Bayesian treed Gaussian process models
rgp - R genetic programming framework
arules - Mining Association Rules and Frequent Itemsets
frbs - Fuzzy Rule-based Systems for Classification and Regression Tasks
rattle - Graphical user interface for data mining in R
ahaz - Regularization for semiparametric additive hazards regression
arules - Mining Association Rules and Frequent Itemsets
bigrf - Big Random Forests: Classification and Regression Forests for
Large Data Sets
bigRR - Generalized Ridge Regression (with special advantage for p >> n
cases)
bmrm - Bundle Methods for Regularized Risk Minimization Package
Boruta - A wrapper algorithm for all-relevant feature selection
bst - Gradient Boosting
C50 - C5.0 Decision Trees and Rule-Based Models
caret - Classification and Regression Training
CORElearn - Classification, regression, feature evaluation and ordinal
evaluation
CoxBoost - Cox models by likelihood based boosting for a single survival
endpoint or competing risks
Cubist - Rule- and Instance-Based Regression Modeling
earth - Multivariate Adaptive Regression Spline Models
elasticnet - Elastic-Net for Sparse Estimation and Sparse PCA
ElemStatLearn - Data sets, functions and examples from the book: “The Elements
of Statistical Learning, Data Mining, Inference, and
Prediction” by Trevor Hastie, Robert Tibshirani and Jerome
Friedman
evtree - Evolutionary Learning of Globally Optimal Trees
frbs - Fuzzy Rule-based Systems for Classification and Regression Tasks
GAMBoost - Generalized linear and additive models by likelihood based
boosting
gamboostLSS - Boosting Methods for GAMLSS
gbm - Generalized Boosted Regression Models
glmnet - Lasso and elastic-net regularized generalized linear models
glmpath - L1 Regularization Path for Generalized Linear Models and Cox
Proportional Hazards Model
GMMBoost - Likelihood-based Boosting for Generalized mixed models
grplasso - Fitting user specified models with Group Lasso penalty
grpreg - Regularization paths for regression models with grouped
covariates
hda - Heteroscedastic Discriminant Analysis
ipred - Improved Predictors
kernlab - kernlab: Kernel-based Machine Learning Lab
klaR - Classification and visualization
lars - Least Angle Regression, Lasso and Forward Stagewise
lasso2 - L1 constrained estimation aka ‘lasso’
LiblineaR - Linear Predictive Models Based On The Liblinear C/C++ Library
LogicReg - Logic Regression
maptree - Mapping, pruning, and graphing tree models
mboost - Model-Based Boosting
mvpart - Multivariate partitioning
ncvreg - Regularization paths for SCAD- and MCP-penalized regression
models
nnet - eed-forward Neural Networks and Multinomial Log-Linear Models
oblique.tree - Oblique Trees for Classification Data
pamr - Pam: prediction analysis for microarrays
party - A Laboratory for Recursive Partytioning
partykit - A Toolkit for Recursive Partytioning
penalized - L1 (lasso and fused lasso) and L2 (ridge) penalized estimation
in GLMs and in the Cox model
penalizedLDA - Penalized classification using Fisher’s linear discriminant
penalizedSVM - Feature Selection SVM using penalty functions
quantregForest - quantregForest: Quantile Regression Forests
randomForest - randomForest: Breiman and Cutler’s random forests for classification and
regression
randomForestSRC - randomForestSRC: Random Forests for Survival, Regression and Classification
(RF-SRC)
rda - Shrunken Centroids Regularized Discriminant Analysis
rdetools - Relevant Dimension Estimation (RDE) in Feature Spaces
REEMtree - Regression Trees with Random Effects for Longitudinal (Panel)
Data
relaxo - Relaxed Lasso
rgenoud - R version of GENetic Optimization Using Derivatives
rgp - R genetic programming framework
Rmalschains - Continuous Optimization using Memetic Algorithms with Local
Search Chains (MA-LS-Chains) in R
rminer - Simpler use of data mining methods (e.g. NN and SVM) in
classification and regression
ROCR - Visualizing the performance of scoring classifiers
RoughSets - Data Analysis Using Rough Set and Fuzzy Rough Set Theories
rpart - Recursive Partitioning and Regression Trees
RPMM - Recursively Partitioned Mixture Model
RSNNS - Neural Networks in R using the Stuttgart Neural Network
Simulator (SNNS)
RWeka - R/Weka interface
RXshrink - RXshrink: Maximum Likelihood Shrinkage via Generalized Ridge or Least
Angle Regression
sda - Shrinkage Discriminant Analysis and CAT Score Variable Selection
SDDA - Stepwise Diagonal Discriminant Analysis
svmpath - svmpath: the SVM Path algorithm
tgp - Bayesian treed Gaussian process models
tree - Classification and regression trees
varSelRF - Variable selection using random forests
xgboost - eXtreme Gradient Boosting Tree model, well known for its speed and performance.
SuperLearner and subsemble - Multi-algorithm ensemble learning packages.
Introduction to Statistical Learning
BreakoutDetection - Breakout Detection via Robust E-Statistics from Twitter.

Natural Language Processing

Packages for Natural Language Processing.

tm - A comprehensive text mining framework for R.
openNLP - Apache OpenNLP Tools Interface.
koRpus - An R Package for Text Analysis.
zipfR - Statistical models for word frequency distributions.
tmcn - A Text mining toolkit for international characters especially for Chinese.
rmmseg4j - R interface to the Java Chinese word segmentation system of mmseg4j.
Rwordseg - Chinese word segmentation.

Bayesian

Packages for Bayesian Inference.

coda - Output analysis and diagnostics for MCMC.
mcmc - Markov Chain Monte Carlo.
MCMCpack - Markov chain Monte Carlo (MCMC) Package.
R2WinBUGS - Running WinBUGS and OpenBUGS from R / S-PLUS.
BRugs - R interface to the OpenBUGS MCMC software.
rjags - R interface to the JAGS MCMC library.
rstan - R interface to the Stan MCMC software.

Finance

Packages for dealing with money.

quantmod - Quantitative Financial Modelling & Trading Framework for R.
TTR - Functions and data to construct technical trading rules with R.
PerformanceAnalytics - Econometric tools for performance and risk analysis.
zoo - S3 Infrastructure for Regular and Irregular Time Series.
xts - eXtensible Time Series.
tseries - Time series analysis and computational finance.
fAssets - Analysing and Modelling Financial Assets.

Genetics

Packages for Statistical Genetics.

Bioconductor - Tools for the analysis and comprehension of high-throughput genomic data.
genetics - Classes and methods for handling genetic data.
gap - An integrated package for genetic data analysis of both population and family data.
ape - Analyses of Phylogenetics and Evolution.

R Development

Packages for packages.

devtools - Tools to make an R developer’s life easier.
testthat - An R package to make testing fun.
R6 - simpler, faster, lighter-weight alternative to R’s built-in classes.
pryr - Make it easier to understand what’s going on in R.
roxygen - Describe your functions in comments next to their definitions.
lineprof - Visualise line profiling results in R.
packrat - Make your R projects more isolated, portable, and reproducible.
installr - Functions for installing softwares from within R.
Rocker - R configurations for Docker.

Other Interpreter

Alternative R engines.

renjin - a JVM-based interpreter for R.
pqR - a “pretty quick” implementation of R
fastR - FastR is an implementation of the R Language in Java atop Truffle and Graal.
riposte - a fast interpreter and JIT for R.
TERR - TIBCO Enterprise Runtime for R.
RRE - Revolution R Enterprise.
CXXR - Refactorising R into C++.

Resources

Where to discover new R-esources.

Websites

R-project - The R Project for Statistical Computing.
R Bloggers - There are people scattered across the Web who blog about R. This is simply an aggregator of many of those feeds.
DataCamp - Learn R data analytics online.
Quick-R - An excellent quick reference.
Advanced R - An in-progress book site for Advanced R.
CRAN Task Views - Task Views for CRAN packages.

Books

The Art of R Programming - It’s a good resource for systematically learning fundamentals such as types of objects, control statements, variable scope, classes and debugging in R.
R in Action - This book aims at all levels of users, with sections for beginning, intermediate and advanced R ranging from “Exploring R data structures” to running regressions and conducting factor analyses.
Use R! - This series of inexpensive and focused books from Springer publish shorter books aimed at practitioners. Books can discuss the use of R in a particular subject area, such as bayesian networks, ggplot2 and Rcpp.

Reference Card

R Reference Card 2.0 - Material from R for Beginners by permission of Emmanuel Paradis (Version 2 by Matt Baggott).
Data Mining Refcard - R Reference Card for Data Mining.
Regression Analysis Refcard - R Reference Card for Regression Analysis.
Reference Card for ESS - Reference Card for ESS.
R Markdown Cheat sheet - Quick reference guide for writing reports with R Markdown.
Shiny Cheat sheet - Quick reference guide for building Shiny apps.

Other Awesome Lists

Contributing

Your contributions are always welcome!

2015年9月21日星期一

http://mp.weixin.qq.com/s?__biz=MjM5NDQ3NTkwMA==&mid=210298680&idx=1&sn=e64368a007ecc0b7ae1bb1f77d6bc080&scene=1&srcid=0921zzsVglwsts0zZYUwdZYz&key=dffc561732c22651ad22898f4762b190400d7d03cd25a21d278f873f834ca377c413d9a6b224e89350a9a1fc9f28f62a&ascene=0&uin=MTI1NzAyMTU%3D&devicetype=iMac+MacBookAir6%2C2+OSX+OSX+10.10.3+build(14D136)&version=11020012&pass_ticket=r4nJr2j38vk%2BeNK2aZPS6kMmagh3vaAYKIZQ5dPE7eQ%3D2015-09-21

【COS编辑部按】本译文得到了原英文作者的授权同意，翻译：冯俊晨、王小宁，审校：邱怡轩、朱雪宁，编辑：王小宁。

\

Hadley Wickham是 RStudio 的首席科学家以及 Rice University 统计系的助理教授。他是著名图形可视化软件包 ggplot2 的开发者，以及其他许多被广泛使用的软件包的作者，代表作品如 plyr、reshape2 等。本文取自PRICEONOMICS.

通过数据从根本上了解世界真的是一件非常，非常酷的事情。

——多产的R开发者Hadley Wickham

       如果你不花很多时间在开源统计编程语言R中写代码的话，他的名字你可能并不熟悉——但统计学家Hadley Wickham用他自己的话说是那种“以书呆子出名”的人。他是那种在统计会议上人们排队要和他拍照，问他要签名的人，并且人们对他充满了尊敬。他也承认“这种现象实在太奇特了。因为写R程序而出名？这太疯狂了。”

        R 是一种为数据分析而设计的编程语言，Wickham正是因为成为了卓越的R包开发者而赢得了他的名声。R包是用于简化诸如整合和绘制数据等常见任务代码的编程工具。Wickham已经帮助了数以万计的人，使他们的工作变得更有效率，这使得大家都很感激他，甚至为之而欣喜若狂。他开发的R包的用户包括众科技巨头，例如Google，Facebook和Twitter，新闻巨擘诸如纽约时报和 FiveThirtyEight，政府机构诸如食品与药品管理局（FDA）以及美国禁毒署（DEA）等。

        诚然，他是书呆子中的巨人。

***

       Wickham出生在新西兰汉密尔顿的一个统计学世家。他父亲Brian Wickham在康奈尔大学获得动物繁殖专业的博士，该学科大量使用统计学；而他姐姐则拥有加州大学伯克利分校的统计学博士学位。如果这个世界上真有数据结构神童，那么Wickham可能就是其中之一。谈起他早年的经历，他颇为自豪：“我的第一份工作，那时我才15岁，就是开发一个微软Access数据库。我觉得这事儿挺有意思的。我为数据库编写了文档，他们至今都在用这个数据库。”

       从第一份工作开始，Wickham就开始反思存储和操纵数据是否存在一种更好的办法。“对于找到更好的解决之道，我一直颇为自信”，他解释说，“并且这个办法可以造福他人。”虽然彼时的他依然在懵懂中，但正在那时他“内化”了第三范式（Third Normal Form）的概念，这将在他未来的工作中扮演重要的角色。第三范式的本质是一种降低数据冗余且保证数据一致性的数据构架方法。Wickham把这种数据叫做“干净”（tidy）数据，而他的工具推广了并依赖于这种数据结构。

R的标志，该语言的革命性演化部分归功于Hadley Wickham

       Wickham第一次接触R语言是在新西兰奥克兰大学攻读统计学本科学位时。他将其描述为“一种理解数据的程序语言”。比肩SQL和Python，R是最受数据科学家欢迎的语言之一。

       和Wickham一样，这个将被他革新的语言也来自新西兰。1993年，奥克兰大学的统计学家Ross Ihaka和Robert Gentleman创制了R。由于该语言是为数据分析量身定制，并且某些地方它与众不同（例如数据结构的索引方式以及数据强制存储于内存中），因此熟悉其他语言的程序员往往觉得R非常奇怪。在编写过Java，VBA和PHP后，Wickham发现R截然不同：“（许多程序员）接触R会后觉得它不伦不类，但我却不这么想，我觉得它可有意思了。”

       从在爱荷华州立大学攻读博士学位起，Wickham就开始开发R工具包。用Wickham的话说，编写一个工具包是“编写一些帮助人们解决问题的代码，然后编写代码文档来帮助人们理解这玩意该怎么用”。作为课程项目的一部分，他编写的第一个工具包旨在实现生化信息的可视化。尽管这个包从未面世，他却坚定了分享个人工具的念头。

       在2005年，Wickham发布了reshape工具包，这是他一连串“网红”工具包的开始。自发布以来，这个工具包已经被下载了几十万次。reshape希望让数据的聚合和操纵变得不那么“枯燥和烦人”。对于非程序员而言，简化数据变形过程可能不是什么事儿，但是对于数据科学家和统计学家而言，这往往是他们工作中最费时的事儿。

       Wickham显然被reshape的成功所鼓舞。他开发这个工具包正是因为他认为他可以比前人做的更好。尽管不爱自夸，他却绝不缺乏自信。“我坚信我知道解决问题的正确方法”，他反复强调，“这种念头是好是坏就不知道了。”

***

       在reshape和其他几个工具包大受欢迎的同时，Wickham对于统计学的憧憬却逐渐幻灭。在攻读PhD的过程中，他注意到“学校里教的东西和人们理解数据真正需要的东西根本不沾边”。与那些专注于高深莫测的中心极限定理变体的统计学家不同，Wickham致力于让普罗大众能够更容易地上手数据分析。他阐述说：“肯定会有象牙塔的统计学家否认我所做的工作是统计学，但是我认为他们错了。我所做的工作正是回归到统计学的根源。存在数据科学这一学科这件事本身就说明正统统计学存在巨大缺陷。对我而言，这涉及到什么是统计：统计即是通过建模和可视化从数据中获得洞见。数据清洗和操纵是个脏活累活，而正统统计学拍拍屁股说这不归我们管。”

       在幻灭之旅上，Wickham开发了ggplot2这个工具包。迄今为止，该工具包已经被下载了几百万次，它不仅是Wickham最成功的作品，也改变了许多人对于数据可视化的观念。ggplot2的巨大成功也促使他离开学术界去Rstudio担任首席科学家，从而专心致志地改进R。（Rstudio是R语言最受欢迎的集成开发环境的盈利开发机构。

Hadley Wickham放了一个用ggplot2画的图片。图片由DavidKahle和Garrett Grolemund提供

       ggplot2 包是以统计学家LelandWilkinson 的“图形语法”为基础，以一种数据可视化的形式开发的。Wickham把 ggplot2 和图形语法看成是“不作为一系列机械操作的可视化思维方式（如从这里到那里画一条线，在这里画一点，把长方形涂上颜色）而是以可视化的思维将数据映射到你能看到的事物上。”

       在图形语法背后的概念是相当抽象的。最大的想法是图是由“几何对象”（我们在图表上看到的一个点或柱子的图形元素）和“图形属性”（关于其中几何形状被放置的方向）组成的。这听起来可能不是革命性的，但由Wickham实现的这个概念使得成千上万的人可以更加容易地画图。问答网站Stack Overflow上已经有近9000个问题标记为ggplot2，甚至说 ggplot2 在R中让作图变得更“好玩”。用 ggplot2 画的图已经出现在了Nature，FiveThirtyEight和纽约时报上。

Hadley Wickham手里拿着一本关于他的可视化软件包ggplot2 的中文译本;图片来源于statr

       除了开发ggplot2和reshape包外，Wickham也设计了一些其他广受欢迎的包来为数据科学家解决其他的重要问题。想用字（字符串）的形式很容易地操纵数据么？想从网上爬取数据么？需要轻松地编写自己的包么？Wickham已经帮你解决了。

       在Quora（一个问答SNS网站，译者注）上，一个R 用户问道：“Hadley Wickham为什么能对R做出这么大的贡献，尤其是在R包上？我依然不能详细地算出Hadley到底做出了多少。他做出这么多东西看起来是不可能的……”R 社区的活跃会员Eduardo Arino de la Rubia说所有成功的编程语言需要像Hadley这样的“名人”。他把Hadley与David HeinemeierHansson（Web应用程序框架的Ruby on Rails的创建者）和TatsuhikoMiyagawa（编程语言Perl 的重要开发者）进行了比较。

       下面的图表展示了Hadley的超过2000次下载的17个包（有时候它们被戏称为“Hadley宇宙”）的发布日期和下载的数量。这些下载数字少得可怜，因为它们只反映了从2012年年底其中一个流行的下载来源的数据。并且，是的，这个图是用Hadley的包（ggvis）绘制的。

Dan Kopf, Priceonomics；数据来源：cranlogs

       那么为什么Hadley创造了这一切？R是免费下载的，所有的包也是免费的，所以金钱的激励是次要的。简单地说，当一个问题比它应有的状态更难以解决时，Wickham就会耿耿于怀。虽然“其他大多数人都可以接受生活多艰这一事实”，但是Wickham却做不到。他说：“让我成功的原因之一是我对挫折是极其敏感的。”这种敏感性为他赢得了一个“奇特的恶名”。

       在大多数情况下，Wickham是不起眼的，但是当他在R聚会或是统计数据发布会上，他就会变成一个摇滚明星。他说：“我能看到我的名誉达到了一种令人不安的水平。”他希望有人会写一本关于“如何在一个非常特殊的领域做一个名人”的书，并且他担心当人们滔滔不绝地谈论他时不知道该如何正确行事。虽然现在习惯了“恶名”，但他仍然会因为人们使用他创造的工具而感到兴奋。他乐于在“Facebook，Google，Twitter，Tumblr……”中查看有多少人在使用他的工具。他说，只有在旧金山，人们在街上认出他的几率会更大一些。他还提到最近对新闻媒体FiveThirtyEight的访问让他很高兴，他觉得了解他人如何使用自己的工具是很酷的一件事（他们使用一个高度定制的ggplot2来绘制图形）。

***

       最重要的是，Wickham乐于给那些喜欢摆弄数据的人提供力量和支持。他解释说：“通过数据从根本上了解世界真的是一件非常，非常酷的事情。让我感到兴奋的分析不是谷歌爬取了1 TB的网络广告数据来优化收入， [而是]那些有着绝对热情的生物学家，现在他们可以使用，并理解R了。

统计之都：专业、人本、正直的中国统计学门户网站。

关注方式：扫描下图二维码。或查找公众帐号，搜索统计之都或 CapStat 即可。

往期推送：进入统计之都会话窗口，点击右上角小人图标，查看历史消息即可。

微信扫一扫
关注该公众号

2015年9月19日星期六

ŷhat | 10 R packages I wish I knew about earlier

A brief introduction to “apply” in R

https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r

At any R Q&A site, you’ll frequently see an exchange like this one:

Q: How can I use a loop to […insert task here…] ?
A: Don’t. Use one of the apply functions.

So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.

If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:

base::apply             Apply Functions Over Array Margins

base::by                Apply a Function to a Data Frame Split by Factors

base::eapply            Apply a Function Over Values in an Environment

base::lapply            Apply a Function over a List or Vector

base::mapply            Apply a Function to Multiple List or Vector Arguments

base::rapply            Recursively Apply a Function to a List

base::tapply            Apply a Function Over a Ragged Array

Let’s examine each of those.

1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”

OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the function to each individual value.” An example:

# create a matrix of 10 rows x 2 columns

m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)

# mean of the rows

apply(m, 1, mean)

 [1]  6  7  8  9 10 11 12 13 14 15

# mean of the columns

apply(m, 2, mean)

[1]  5.5 15.5

# divide all values by 2

apply(m, 1:2, function(x) x/2)

      [,1] [,2]

 [1,]  0.5  5.5

 [2,]  1.0  6.0

 [3,]  1.5  6.5

 [4,]  2.0  7.0

 [5,]  2.5  7.5

 [6,]  3.0  8.0

 [7,]  3.5  8.5

 [8,]  4.0  9.0

 [9,]  4.5  9.5

[10,]  5.0 10.0

That last example was rather trivial; you could just as easily do “m[, 1:2]/2” – but you get the idea.

2. by

Updated 27/2/14: note that the original example in this section no longer works; use colMeans now instead of mean.
Description: “Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames.”

The by function is a little more complex than that. Read a little further and the documentation tells you that “a data frame is split by row into data frames subsetted by the values of one or more factors, and function ‘FUN’ is applied to each subset in turn.” So, we use this one where factors are involved.

To illustrate, we can load up the classic R dataset “iris”, which contains a bunch of flower measurements:

attach(iris)

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

6          5.4         3.9          1.7         0.4  setosa

# get the mean of the first 4 variables, by species

by(iris[, 1:4], Species, colMeans)

Species: setosa

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 

       5.006        3.428        1.462        0.246 

------------------------------------------------------------ 

Species: versicolor

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 

       5.936        2.770        4.260        1.326 

------------------------------------------------------------ 

Species: virginica

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 

       6.588        2.974        5.552        2.026

Essentially, by provides a way to split your data by factors and do calculations on each subset. It returns an object of class “by” and there are many, more complex ways to use it.

3. eapply
Description: “eapply applies FUN to the named values from an environment and returns the results as a list.”

This one is a little trickier, since you need to know something about environments in R. An environment, as the name suggests, is a self-contained object with its own variables and functions. To continue using our very simple example:

# a new environment

e <- new.env()

# two environment variables, a and b

e$a <- 1:10

e$b <- 11:20

# mean of the variables

eapply(e, mean)

$b

[1] 15.5

$a

[1] 5.5

I don’t often create my own environments, but they’re commonly used by R packages such as Bioconductor so it’s good to know how to handle them.

4. lapply
Description: “lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.”

That’s a nice, clear description which makes lapply one of the easier apply functions to understand. A simple example:

# create a list with 2 elements

l <- list(a = 1:10, b = 11:20)

# the mean of the values in each element

lapply(l, mean)

$a

[1] 5.5

$b

[1] 15.5

# the sum of the values in each element

lapply(l, sum)

$a

[1] 55

$b

[1] 155

The lapply documentation tells us to consult further documentation for sapply, vapply and replicate. Let’s do that.

4.1 sapply
Description: “sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate.”

That simply means that if lapply would have returned a list with elements $a and $b, sapply will return either a vector, with elements [[‘a’]] and [[‘b’]], or a matrix with column names “a” and “b”. Returning to our previous simple example:

# create a list with 2 elements

l <- list(a = 1:10, b = 11:20)

# mean of values using sapply

l.mean <- sapply(l, mean)

# what type of object was returned?

class(l.mean)

[1] "numeric"

# it's a numeric vector, so we can get element "a" like this

l.mean[['a']]

[1] 5.5

4.2 vapply
Description: “vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”

A third argument is supplied to vapply, which you can think of as a kind of template for the output. The documentation uses the fivenum function as an example, so let’s go with that:

l <- list(a = 1:10, b = 11:20)

# fivenum of values using vapply

l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))

class(l.fivenum)

[1] "matrix"

# let's see it

l.fivenum

           a    b

Min.     1.0 11.0

1st Qu.  3.0 13.0

Median   5.5 15.5

3rd Qu.  8.0 18.0

Max.    10.0 20.0

So, vapply returned a matrix, where the column names correspond to the original list elements and the row names to the output template. Nice.

4.3 replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”

The replicate function is very useful. Give it two mandatory arguments: the number of replications and the function to replicate; a third optional argument, simplify = T, tries to simplify the result to a vector or matrix. An example – let’s simulate 10 normal distributions, each with 10 observations:

replicate(10, rnorm(10))

             [,1]        [,2]        [,3]       [,4]        [,5]         [,6]

 [1,]  0.67947001 -1.94649409  0.28144696  0.5872913  2.22715085 -0.275918282

 [2,]  1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249  0.528666264

 [3,]  0.77272662 -2.36122644  0.06397576  1.5870779 -0.33926083  1.121164338

 [4,] -0.42702542 -0.90613885  0.83645668 -0.5462608 -0.87458396 -0.723858258

 [5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936  0.08253614  1.319095242

 [6,]  2.93827883 -0.33363446  0.55405024 -0.4942736  0.66407615 -0.153623614

 [7,]  1.30037496 -0.26207115  0.49818215  1.0774543 -0.28206908  0.825488436

 [8,] -0.04153545 -0.23621632 -1.01192741  0.4364413 -2.28991601 -0.002867193

 [9,]  0.01262547  0.40247248  0.65816829  0.9541927 -1.63770154  0.328180660

[10,]  0.96525278 -0.37850821 -0.85869035 -0.6055622  1.13756753 -0.371977151

             [,7]        [,8]       [,9]       [,10]

 [1,]  0.03928297  0.34990909 -0.3159794  1.08871657

 [2,] -0.79258805 -0.30329668 -1.0902070  0.73356542

 [3,]  0.10673459 -0.02849216  0.8094840  0.06446245

 [4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330

 [5,] -1.50226560 -2.35751419  1.2104163  0.74650696

 [6,] -0.32790991  0.80144695 -0.0071844  0.05742356

 [7,]  1.36719970  2.34148354  0.9148911  0.20451421

 [8,] -0.51112579 -0.53658159  1.5194130 -0.94250069

 [9,]  0.52017814 -1.22252527  0.4519702  0.08779704

[10,]  1.35908918  1.09024342  0.5912627 -0.20709053

5. mapply
Description: “mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each (…) argument, the second elements, the third elements, and so on.”

The mapply documentation is full of quite complex examples, but here’s a simple, silly one:

l1 <- list(a = c(1:10), b = c(11:20))

l2 <- list(c = c(21:30), d = c(31:40))

# sum the corresponding elements of l1 and l2

mapply(sum, l1$a, l1$b, l2$c, l2$d)

 [1]  64  68  72  76  80  84  88  92  96 100

Here, we sum l1$a[1] + l1$b[1] + l2$c[1] + l2$d[1] (1 + 11 + 21 + 31) to get 64, the first element of the returned list. All the way through to l1$a[10] + l1$b[10] + l2$c[10] + l2$d[10] (10 + 20 + 30 + 40) = 100, the last element.

6. rapply
Description: “rapply is a recursive version of lapply.”

I think “recursive” is a little misleading. What rapply does is apply functions to lists in different ways, depending on the arguments supplied. Best illustrated by examples:

# let's start with our usual simple list example

l <- list(a = 1:10, b = 11:20)

# log2 of each value in the list

rapply(l, log2)

      a1       a2       a3       a4       a5       a6       a7       a8 

0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000 

      a9      a10       b1       b2       b3       b4       b5       b6 

3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 

      b7       b8       b9      b10 

4.087463 4.169925 4.247928 4.321928

# log2 of each value in each list

rapply(l, log2, how = "list")

$a

 [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000

 [9] 3.169925 3.321928

$b

 [1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925

 [9] 4.247928 4.321928

# what if the function is the mean?

rapply(l, mean)

   a    b 

 5.5 15.5

rapply(l, mean, how = "list")

$a

[1] 5.5

$b

[1] 15.5

So, the output of rapply depends on both the function and the how argument. When how = “list” (or “replace”), the original list structure is preserved. Otherwise, the default is to unlist, which results in a vector.

You can also pass a “classes=” argument to rapply. For example, in a mixed list of numeric and character variables, you could specify that the function act only on the numeric values with “classes = numeric”.

7. tapply
Description: “Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”

Woah there. That sounds complicated. Don’t panic though, it becomes clearer when the required arguments are described. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”.

So, to go back to the famous iris data, “Species” might be a factor and “iris$Petal.Width” would give us a vector of values. We could then run something like:

attach(iris)

# mean petal length by species

tapply(iris$Petal.Length, Species, mean)

    setosa versicolor  virginica 

     1.462      4.260      5.552

Summary
I’ve used very simple examples here, with contrived data and standard functions (such as mean and sum). For me, this is the easiest way to learn what a function does: I can look at the original data, then the result and figure out what happened. However, the “apply” family is a much more powerful than these illustrations – I encourage you to play around with it.

The things to consider when choosing an apply function are basically:

What class is my input data? – vector, matrix, data frame…
On which subsets of that data do I want the function to act? – rows, columns, all values…
What class will the function return? How is the original data structure transformed?

It’s the usual input-process-output story: what do I have, what do I want and what lies inbetween?

订阅：博文 (Atom)

Changshun Shih's Blog

2015年10月3日星期六

在R markdown beamer 中输入中文

2015年9月25日星期五

Awesome R

Awesome R

Integrated Development Environment

Data Manipulation

Graphic Displays

Reproducible Research

Web Technologies and Services

Parallel Computing

High Performance

Language API

Database Management

Machine Learning

Natural Language Processing

Bayesian

Finance

Genetics

R Development

Other Interpreter

Resources

Websites

Books

Reference Card

Other Awesome Lists

Contributing

2015年9月21日星期一

Hadley Wickham：一个改变了R的人

Hadley Wickham：一个改变了R的人

2015年9月19日星期六

ŷhat | 10 R packages I wish I knew about earlier

A brief introduction to “apply” in R