Happy birthday, data.table!

New analysis of authors and contributors

Since this is the 20th anniversary of Matt’s original CRAN submission, I wanted to do some analysis of contributors over time, to emphasize the great community that has been working to improve data.table in recent years. To do that, we first download data on all releases, using code from my previous post about the release history of data.table.

Download Archive web page

We can download the Archive web page for data.table via the code below,

Archive <- "https://cloud.r-project.org/src/contrib/Archive/"
get_Archive <- function(Package, releases.dir="~/releases"){
  dir.create(releases.dir, showWarnings = FALSE)
  pkg.html <- file.path(releases.dir, paste0(Package, ".html"))
  if(!file.exists(pkg.html)){
    u <- paste0(Archive, Package)
    download.file(u, pkg.html)
  }
  readLines(pkg.html)
}
(Archive.data.table <- get_Archive("data.table"))

 [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">"                                                                                                                                                                                     
 [2] "<html>"                                                                                                                                                                                                                                        
 [3] " <head>"                                                                                                                                                                                                                                       
 [4] "  <title>Index of /src/contrib/Archive/data.table</title>"                                                                                                                                                                                     
 [5] " </head>"                                                                                                                                                                                                                                      
 [6] " <body>"                                                                                                                                                                                                                                       
 [7] "<h1>Index of /src/contrib/Archive/data.table</h1>"                                                                                                                                                                                             
 [8] "<pre>      <a href=\"?C=N;O=D\">Name</a>                       <a href=\"?C=M;O=A\">Last modified</a>      <a href=\"?C=S;O=A\">Size</a>  <hr>      <a href=\"/src/contrib/Archive/\">Parent Directory</a>                                -   "
 [9] "      <a href=\"data.table_1.0.tar.gz\">data.table_1.0.tar.gz</a>      2006-04-14 22:03   16K  "                                                                                                                                               
[10] "      <a href=\"data.table_1.1.tar.gz\">data.table_1.1.tar.gz</a>      2008-08-27 07:35   40K  "                                                                                                                                               
[11] "      <a href=\"data.table_1.10.0.tar.gz\">data.table_1.10.0.tar.gz</a>   2016-12-03 10:05  2.9M  "                                                                                                                                            
[12] "      <a href=\"data.table_1.10.2.tar.gz\">data.table_1.10.2.tar.gz</a>   2017-01-31 15:09  2.9M  "                                                                                                                                            
[13] "      <a href=\"data.table_1.10.4-1.tar.gz\">data.table_1.10.4-1.tar.gz</a> 2017-10-09 22:36  2.9M  "                                                                                                                                          
[14] "      <a href=\"data.table_1.10.4-2.tar.gz\">data.table_1.10.4-2.tar.gz</a> 2017-10-12 14:03  2.9M  "                                                                                                                                          
[15] "      <a href=\"data.table_1.10.4-3.tar.gz\">data.table_1.10.4-3.tar.gz</a> 2017-10-27 07:40  2.9M  "                                                                                                                                          
[16] "      <a href=\"data.table_1.10.4.tar.gz\">data.table_1.10.4.tar.gz</a>   2017-02-01 14:52  2.9M  "                                                                                                                                            
[17] "      <a href=\"data.table_1.11.0.tar.gz\">data.table_1.11.0.tar.gz</a>   2018-05-01 17:00  3.1M  "                                                                                                                                            
[18] "      <a href=\"data.table_1.11.2.tar.gz\">data.table_1.11.2.tar.gz</a>   2018-05-08 16:16  3.1M  "                                                                                                                                            
[19] "      <a href=\"data.table_1.11.4.tar.gz\">data.table_1.11.4.tar.gz</a>   2018-05-27 16:34  3.1M  "                                                                                                                                            
[20] "      <a href=\"data.table_1.11.6.tar.gz\">data.table_1.11.6.tar.gz</a>   2018-09-19 22:10  3.2M  "                                                                                                                                            
[21] "      <a href=\"data.table_1.11.8.tar.gz\">data.table_1.11.8.tar.gz</a>   2018-09-30 13:30  3.1M  "                                                                                                                                            
[22] "      <a href=\"data.table_1.12.0.tar.gz\">data.table_1.12.0.tar.gz</a>   2019-01-13 11:50  3.2M  "                                                                                                                                            
[23] "      <a href=\"data.table_1.12.2.tar.gz\">data.table_1.12.2.tar.gz</a>   2019-04-07 10:06  3.2M  "                                                                                                                                            
[24] "      <a href=\"data.table_1.12.4.tar.gz\">data.table_1.12.4.tar.gz</a>   2019-10-03 09:10  4.8M  "                                                                                                                                            
[25] "      <a href=\"data.table_1.12.6.tar.gz\">data.table_1.12.6.tar.gz</a>   2019-10-18 22:20  4.7M  "                                                                                                                                            
[26] "      <a href=\"data.table_1.12.8.tar.gz\">data.table_1.12.8.tar.gz</a>   2019-12-09 10:30  4.7M  "                                                                                                                                            
[27] "      <a href=\"data.table_1.13.0.tar.gz\">data.table_1.13.0.tar.gz</a>   2020-07-24 09:40  5.0M  "                                                                                                                                            
[28] "      <a href=\"data.table_1.13.2.tar.gz\">data.table_1.13.2.tar.gz</a>   2020-10-19 18:50  5.0M  "                                                                                                                                            
[29] "      <a href=\"data.table_1.13.4.tar.gz\">data.table_1.13.4.tar.gz</a>   2020-12-08 10:10  5.0M  "                                                                                                                                            
[30] "      <a href=\"data.table_1.13.6.tar.gz\">data.table_1.13.6.tar.gz</a>   2020-12-30 15:50  5.1M  "                                                                                                                                            
[31] "      <a href=\"data.table_1.14.0.tar.gz\">data.table_1.14.0.tar.gz</a>   2021-02-21 06:00  5.1M  "                                                                                                                                            
[32] "      <a href=\"data.table_1.14.10.tar.gz\">data.table_1.14.10.tar.gz</a>  2023-12-08 11:20  5.1M  "                                                                                                                                           
[33] "      <a href=\"data.table_1.14.2.tar.gz\">data.table_1.14.2.tar.gz</a>   2021-09-27 16:30  5.1M  "                                                                                                                                            
[34] "      <a href=\"data.table_1.14.4.tar.gz\">data.table_1.14.4.tar.gz</a>   2022-10-17 10:32  5.1M  "                                                                                                                                            
[35] "      <a href=\"data.table_1.14.6.tar.gz\">data.table_1.14.6.tar.gz</a>   2022-11-16 21:30  5.0M  "                                                                                                                                            
[36] "      <a href=\"data.table_1.14.8.tar.gz\">data.table_1.14.8.tar.gz</a>   2023-02-17 12:20  5.1M  "                                                                                                                                            
[37] "      <a href=\"data.table_1.15.0.tar.gz\">data.table_1.15.0.tar.gz</a>   2024-01-30 07:40  5.1M  "                                                                                                                                            
[38] "      <a href=\"data.table_1.15.2.tar.gz\">data.table_1.15.2.tar.gz</a>   2024-02-29 07:10  5.1M  "                                                                                                                                            
[39] "      <a href=\"data.table_1.15.4.tar.gz\">data.table_1.15.4.tar.gz</a>   2024-03-30 23:50  5.1M  "                                                                                                                                            
[40] "      <a href=\"data.table_1.16.0.tar.gz\">data.table_1.16.0.tar.gz</a>   2024-08-27 04:20  5.1M  "                                                                                                                                            
[41] "      <a href=\"data.table_1.16.2.tar.gz\">data.table_1.16.2.tar.gz</a>   2024-10-10 16:10  5.2M  "                                                                                                                                            
[42] "      <a href=\"data.table_1.16.4.tar.gz\">data.table_1.16.4.tar.gz</a>   2024-12-06 15:10  5.2M  "                                                                                                                                            
[43] "      <a href=\"data.table_1.17.0.tar.gz\">data.table_1.17.0.tar.gz</a>   2025-02-22 06:10  5.6M  "                                                                                                                                            
[44] "      <a href=\"data.table_1.17.2.tar.gz\">data.table_1.17.2.tar.gz</a>   2025-05-12 11:10  5.6M  "                                                                                                                                            
[45] "      <a href=\"data.table_1.17.4.tar.gz\">data.table_1.17.4.tar.gz</a>   2025-05-26 12:40  5.6M  "                                                                                                                                            
[46] "      <a href=\"data.table_1.17.6.tar.gz\">data.table_1.17.6.tar.gz</a>   2025-06-17 03:40  5.6M  "                                                                                                                                            
[47] "      <a href=\"data.table_1.17.8.tar.gz\">data.table_1.17.8.tar.gz</a>   2025-07-10 10:30  5.5M  "                                                                                                                                            
[48] "      <a href=\"data.table_1.18.0.tar.gz\">data.table_1.18.0.tar.gz</a>   2025-12-24 12:05  5.7M  "                                                                                                                                            
[49] "      <a href=\"data.table_1.2.tar.gz\">data.table_1.2.tar.gz</a>      2008-09-01 06:59   40K  "                                                                                                                                               
[50] "      <a href=\"data.table_1.4.1.tar.gz\">data.table_1.4.1.tar.gz</a>    2010-05-03 08:40  344K  "                                                                                                                                             
[51] "      <a href=\"data.table_1.5.1.tar.gz\">data.table_1.5.1.tar.gz</a>    2011-01-08 08:31  589K  "                                                                                                                                             
[52] "      <a href=\"data.table_1.5.2.tar.gz\">data.table_1.5.2.tar.gz</a>    2011-01-21 09:03  607K  "                                                                                                                                             
[53] "      <a href=\"data.table_1.5.3.tar.gz\">data.table_1.5.3.tar.gz</a>    2011-02-11 08:49  623K  "                                                                                                                                             
[54] "      <a href=\"data.table_1.5.tar.gz\">data.table_1.5.tar.gz</a>      2010-09-14 06:23  589K  "                                                                                                                                               
[55] "      <a href=\"data.table_1.6.1.tar.gz\">data.table_1.6.1.tar.gz</a>    2011-06-29 09:41  692K  "                                                                                                                                             
[56] "      <a href=\"data.table_1.6.2.tar.gz\">data.table_1.6.2.tar.gz</a>    2011-07-02 14:21  693K  "                                                                                                                                             
[57] "      <a href=\"data.table_1.6.3.tar.gz\">data.table_1.6.3.tar.gz</a>    2011-08-04 11:28  698K  "                                                                                                                                             
[58] "      <a href=\"data.table_1.6.4.tar.gz\">data.table_1.6.4.tar.gz</a>    2011-08-10 05:50  705K  "                                                                                                                                             
[59] "      <a href=\"data.table_1.6.5.tar.gz\">data.table_1.6.5.tar.gz</a>    2011-08-25 04:35  711K  "                                                                                                                                             
[60] "      <a href=\"data.table_1.6.6.tar.gz\">data.table_1.6.6.tar.gz</a>    2011-08-25 20:08  712K  "                                                                                                                                             
[61] "      <a href=\"data.table_1.6.tar.gz\">data.table_1.6.tar.gz</a>      2011-04-24 06:07  684K  "                                                                                                                                               
[62] "      <a href=\"data.table_1.7.1.tar.gz\">data.table_1.7.1.tar.gz</a>    2011-10-22 12:05  728K  "                                                                                                                                             
[63] "      <a href=\"data.table_1.7.10.tar.gz\">data.table_1.7.10.tar.gz</a>   2012-02-07 08:43  758K  "                                                                                                                                            
[64] "      <a href=\"data.table_1.7.2.tar.gz\">data.table_1.7.2.tar.gz</a>    2011-11-07 14:05  735K  "                                                                                                                                             
[65] "      <a href=\"data.table_1.7.3.tar.gz\">data.table_1.7.3.tar.gz</a>    2011-11-25 07:12  741K  "                                                                                                                                             
[66] "      <a href=\"data.table_1.7.4.tar.gz\">data.table_1.7.4.tar.gz</a>    2011-11-29 06:57  741K  "                                                                                                                                             
[67] "      <a href=\"data.table_1.7.5.tar.gz\">data.table_1.7.5.tar.gz</a>    2011-12-04 12:51  742K  "                                                                                                                                             
[68] "      <a href=\"data.table_1.7.6.tar.gz\">data.table_1.7.6.tar.gz</a>    2011-12-13 08:36  743K  "                                                                                                                                             
[69] "      <a href=\"data.table_1.7.7.tar.gz\">data.table_1.7.7.tar.gz</a>    2011-12-15 10:07  744K  "                                                                                                                                             
[70] "      <a href=\"data.table_1.7.8.tar.gz\">data.table_1.7.8.tar.gz</a>    2012-01-25 07:53  754K  "                                                                                                                                             
[71] "      <a href=\"data.table_1.7.9.tar.gz\">data.table_1.7.9.tar.gz</a>    2012-01-31 07:30  756K  "                                                                                                                                             
[72] "      <a href=\"data.table_1.8.0.tar.gz\">data.table_1.8.0.tar.gz</a>    2012-07-16 08:21  768K  "                                                                                                                                             
[73] "      <a href=\"data.table_1.8.10.tar.gz\">data.table_1.8.10.tar.gz</a>   2013-09-03 04:41  914K  "                                                                                                                                            
[74] "      <a href=\"data.table_1.8.2.tar.gz\">data.table_1.8.2.tar.gz</a>    2012-07-17 19:51  799K  "                                                                                                                                             
[75] "      <a href=\"data.table_1.8.4.tar.gz\">data.table_1.8.4.tar.gz</a>    2012-11-09 15:23  820K  "                                                                                                                                             
[76] "      <a href=\"data.table_1.8.6.tar.gz\">data.table_1.8.6.tar.gz</a>    2012-11-13 13:28  821K  "                                                                                                                                             
[77] "      <a href=\"data.table_1.8.8.tar.gz\">data.table_1.8.8.tar.gz</a>    2013-03-06 06:31  874K  "                                                                                                                                             
[78] "      <a href=\"data.table_1.9.2.tar.gz\">data.table_1.9.2.tar.gz</a>    2014-02-27 13:49  1.0M  "                                                                                                                                             
[79] "      <a href=\"data.table_1.9.4.tar.gz\">data.table_1.9.4.tar.gz</a>    2014-10-02 06:41  927K  "                                                                                                                                             
[80] "      <a href=\"data.table_1.9.6.tar.gz\">data.table_1.9.6.tar.gz</a>    2015-09-19 20:13  3.5M  "                                                                                                                                             
[81] "      <a href=\"data.table_1.9.8.tar.gz\">data.table_1.9.8.tar.gz</a>    2016-11-25 11:55  2.9M  "                                                                                                                                             
[82] "<hr></pre>"                                                                                                                                                                                                                                    
[83] "<address>Apache/2.4.65 (Unix) Server at cloud.r-project.org Port 80</address>"                                                                                                                                                                 
[84] "</body></html>"

The output above shows that the Archive web page has a regular structure, which we can convert into a data table using the regular expression pattern below.

file.pattern <- list(
  '(?<=>)',
  package=".*?",
  "_",
  version="[0-9.-]+",
  "[.]tar[.]gz")

The code above specifies a regular expression:

'(?<=>)' is a lookbehind assertion. It means to start by looking for a greater than sign, but not including that character in the match.
package=".*?" means to match zero or more of anything except newline (non-greedy, as few as possible), and output the match in the package column,
"_" means to start by matching an underscore,
version="[0-9.-]+" means to match one or more digits/dots/dashes, and output them in the version column,
"[.]tar[.]gz</a>\\s+" means to match the .tar.gz file name suffix.

Below we use that pattern to convert the web page into a data table with two columns,

options(datatable.print.nrows=20) # instead of default 100.
nc::capture_all_str(Archive.data.table, file.pattern)

       package  version
        <char>   <char>
 1: data.table      1.0
 2: data.table      1.1
 3: data.table   1.10.0
 4: data.table   1.10.2
 5: data.table 1.10.4-1
---                    
69: data.table    1.8.8
70: data.table    1.9.2
71: data.table    1.9.4
72: data.table    1.9.6
73: data.table    1.9.8

Next, we add to the pattern to match the release date,

library(data.table)
Archive.pattern <- list(
  file=file.pattern,
  "</a>",
  "\\s+",
  IDate=".*?", as.IDate,
  "\\s")

The code above has

file=file.pattern which means to apply the previous regex, and put the matching text in the file column,
"</a>" which matches the closing </a> tag
"\\s+" which matches one or more white space characters,
IDate=".*?", as.IDate, which matches zero or more characters (non-greedy, as few as possible), then use as.IDate to convert the text to efficient integer date, saved in the IDate column,
"\\s" means to match one white space character.

The end result is a table with one row for each matched package version, and one column for each of the named arguments:

(Archive.dt <- nc::capture_all_str(Archive.data.table, Archive.pattern))

                          file    package  version      IDate
                        <char>     <char>   <char>     <IDat>
 1:      data.table_1.0.tar.gz data.table      1.0 2006-04-14
 2:      data.table_1.1.tar.gz data.table      1.1 2008-08-27
 3:   data.table_1.10.0.tar.gz data.table   1.10.0 2016-12-03
 4:   data.table_1.10.2.tar.gz data.table   1.10.2 2017-01-31
 5: data.table_1.10.4-1.tar.gz data.table 1.10.4-1 2017-10-09
---                                                          
69:    data.table_1.8.8.tar.gz data.table    1.8.8 2013-03-06
70:    data.table_1.9.2.tar.gz data.table    1.9.2 2014-02-27
71:    data.table_1.9.4.tar.gz data.table    1.9.4 2014-10-02
72:    data.table_1.9.6.tar.gz data.table    1.9.6 2015-09-19
73:    data.table_1.9.8.tar.gz data.table    1.9.8 2016-11-25

Above the table shows all matches, in the same order as the original Archive web page. Below we key the table by date, which sorts the data in place (without allocating any new memory), and enables fast joins.

setkey(Archive.dt, IDate)
Archive.dt

Key: <IDate>
                        file    package version      IDate
                      <char>     <char>  <char>     <IDat>
 1:    data.table_1.0.tar.gz data.table     1.0 2006-04-14
 2:    data.table_1.1.tar.gz data.table     1.1 2008-08-27
 3:    data.table_1.2.tar.gz data.table     1.2 2008-09-01
 4:  data.table_1.4.1.tar.gz data.table   1.4.1 2010-05-03
 5:    data.table_1.5.tar.gz data.table     1.5 2010-09-14
---                                                       
69: data.table_1.17.2.tar.gz data.table  1.17.2 2025-05-12
70: data.table_1.17.4.tar.gz data.table  1.17.4 2025-05-26
71: data.table_1.17.6.tar.gz data.table  1.17.6 2025-06-17
72: data.table_1.17.8.tar.gz data.table  1.17.8 2025-07-10
73: data.table_1.18.0.tar.gz data.table  1.18.0 2025-12-24

We see the table above has been sorted by release date. Next, we define a grid of dates which we will search for the nearest release.

every.year.since.2016 <- seq(
  as.IDate("2016-04-14"),
  Sys.time(),
  by="year")
(grid.dt <- setkey(data.table(
  grid.IDate=c(
    as.IDate("2006-04-14"), # first release.
    as.IDate("2011-04-14"), # fifth anniversary.
    every.year.since.2016))))

Key: <grid.IDate>
    grid.IDate
        <IDat>
 1: 2006-04-14
 2: 2011-04-14
 3: 2016-04-14
 4: 2017-04-14
 5: 2018-04-14
 6: 2019-04-14
 7: 2020-04-14
 8: 2021-04-14
 9: 2022-04-14
10: 2023-04-14
11: 2024-04-14
12: 2025-04-14
13: 2026-04-14

The code above sets the key of the grid, which sorts and enables fast joins. No variables were specified to setkey(); the default is to use all columns, in this case just one. Note that setkey() sets the key by reference, then returns the table.

Next, we do a rolling join to find which releases are nearest to each date in the grid.

(nearest.dt <- unique(Archive.dt[grid.dt, .(
  file, version, package, release=x.IDate
), roll="nearest"]))

                        file version    package    release
                      <char>  <char>     <char>     <IDat>
 1:    data.table_1.0.tar.gz     1.0 data.table 2006-04-14
 2:    data.table_1.6.tar.gz     1.6 data.table 2011-04-24
 3:  data.table_1.9.6.tar.gz   1.9.6 data.table 2015-09-19
 4: data.table_1.10.4.tar.gz  1.10.4 data.table 2017-02-01
 5: data.table_1.11.0.tar.gz  1.11.0 data.table 2018-05-01
 6: data.table_1.12.2.tar.gz  1.12.2 data.table 2019-04-07
 7: data.table_1.13.0.tar.gz  1.13.0 data.table 2020-07-24
 8: data.table_1.14.0.tar.gz  1.14.0 data.table 2021-02-21
 9: data.table_1.14.4.tar.gz  1.14.4 data.table 2022-10-17
10: data.table_1.14.8.tar.gz  1.14.8 data.table 2023-02-17
11: data.table_1.15.4.tar.gz  1.15.4 data.table 2024-03-30
12: data.table_1.17.2.tar.gz  1.17.2 data.table 2025-05-12
13: data.table_1.18.0.tar.gz  1.18.0 data.table 2025-12-24

The output above shows one row per release we will analyze. For each release, we download the package sources from the Archive, and extract the Author field of DESCRIPTION.

desc.dt <- nearest.dt[, {
  cache.dir <- "~/Archive"
  dir.create(cache.dir, showWarnings = FALSE)
  dt.tar.gz <- file.path(cache.dir, file)
  if(!file.exists(dt.tar.gz)){
    url.tar.gz <- paste0(Archive, package, "/", file)
    download.file(url.tar.gz, dt.tar.gz)
  }
  conn <- gzfile(dt.tar.gz, "b")
  DESCRIPTION <- file.path(package, "DESCRIPTION")
  untar(conn, files=DESCRIPTION)
  close(conn)
  as.data.table(read.dcf(DESCRIPTION)[,"Author",drop=FALSE])
}, by=.(version, release)]

To avoid printing the full Author column (a long string), we can set an option:

options(
  datatable.prettyprint.char=30, # print ... after this many characters.
  width=100) # max characters before wrapping columns to next line.
desc.dt

    version    release                              Author
     <char>     <IDat>                              <char>
 1:     1.0 2006-04-14                          Matt Dowle
 2:     1.6 2011-04-24   Matthew Dowle with many contri...
 3:   1.9.6 2015-09-19   M Dowle, A Srinivasan, T Short...
 4:  1.10.4 2017-02-01 Matt Dowle [aut, cre],\nArun Sri...
 5:  1.11.0 2018-05-01 Matt Dowle [aut, cre],\nArun Sri...
 6:  1.12.2 2019-04-07 Matt Dowle [aut, cre],\nArun Sri...
 7:  1.13.0 2020-07-24 Matt Dowle [aut, cre],\nArun Sri...
 8:  1.14.0 2021-02-21 Matt Dowle [aut, cre],\nArun Sri...
 9:  1.14.4 2022-10-17 Matt Dowle [aut, cre],\nArun Sri...
10:  1.14.8 2023-02-17 Matt Dowle [aut, cre],\nArun Sri...
11:  1.15.4 2024-03-30 Tyson Barrett [aut, cre],\nMatt ...
12:  1.17.2 2025-05-12   Tyson Barrett [aut, cre] (ORCI...
13:  1.18.0 2025-12-24   Tyson Barrett [aut, cre] (ORCI...

We see above that the Author field can contain newlines (after the comma), which we remove below, to make later parsing easier:

desc.dt[, no.newlines := gsub("\n", " ", Author)][]

    version    release                              Author                       no.newlines
     <char>     <IDat>                              <char>                            <char>
 1:     1.0 2006-04-14                          Matt Dowle                        Matt Dowle
 2:     1.6 2011-04-24   Matthew Dowle with many contri... Matthew Dowle with many contri...
 3:   1.9.6 2015-09-19   M Dowle, A Srinivasan, T Short... M Dowle, A Srinivasan, T Short...
 4:  1.10.4 2017-02-01 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
 5:  1.11.0 2018-05-01 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
 6:  1.12.2 2019-04-07 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
 7:  1.13.0 2020-07-24 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
 8:  1.14.0 2021-02-21 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
 9:  1.14.4 2022-10-17 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
10:  1.14.8 2023-02-17 Matt Dowle [aut, cre],\nArun Sri... Matt Dowle [aut, cre], Arun Sr...
11:  1.15.4 2024-03-30 Tyson Barrett [aut, cre],\nMatt ... Tyson Barrett [aut, cre], Matt...
12:  1.17.2 2025-05-12   Tyson Barrett [aut, cre] (ORCI... Tyson Barrett [aut, cre] (ORCI...
13:  1.18.0 2025-12-24   Tyson Barrett [aut, cre] (ORCI... Tyson Barrett [aut, cre] (ORCI...

The output above has a new column of comma-separated authors per release (with no newlines). We would like to convert these data to a table with one year per author. A simple approach would be

head(sapply(strsplit(desc.dt$no.newlines, ", "), head))

[[1]]
[1] "Matt Dowle"

[[2]]
[1] "Matthew Dowle with many contributions from Tom Short.  See SVN logs on R-Forge."

[[3]]
[1] "M Dowle"                                       "A Srinivasan"                                 
[3] "T Short"                                       "S Lianoglou with contributions from R Saporta"
[5] "E Antonyan"                                   

[[4]]
[1] "Matt Dowle [aut"       "cre]"                  "Arun Srinivasan [aut]" "Jan Gorecki [ctb]"    
[5] "Tom Short [ctb]"       "Steve Lianoglou [ctb]"

[[5]]
[1] "Matt Dowle [aut"       "cre]"                  "Arun Srinivasan [aut]" "Jan Gorecki [ctb]"    
[5] "Michael Chirico [ctb]" "Pasha Stetsenko [ctb]"

[[6]]
[1] "Matt Dowle [aut"       "cre]"                  "Arun Srinivasan [aut]" "Jan Gorecki [ctb]"    
[5] "Michael Chirico [ctb]" "Pasha Stetsenko [ctb]"

It is clear that the result above does not quite work (Matt’s aut, cre role contains a comma so is broken into two entries). Instead we can use

author.pattern <- list(
  name=".+?",
  nc::quantifier(
    " \\[",
    roles=".+?",
    "\\]",
    "?"),
  nc::quantifier(
    " \\(", 
    paren=".+?",
    "\\)",
    "?"),
  ## each author ends with one of these (\z means end of string).
  nc::alternatives(" with (?:many )?contributions from ", ", ", "\\z"))
(author.dt <- desc.dt[, nc::capture_all_str(
  no.newlines, author.pattern
), by=.(version, release)])

     version    release                              name  roles       paren
      <char>     <IDat>                            <char> <char>      <char>
  1:     1.0 2006-04-14                        Matt Dowle                   
  2:     1.6 2011-04-24                     Matthew Dowle                   
  3:     1.6 2011-04-24 Tom Short.  See SVN logs on R-...                   
  4:   1.9.6 2015-09-19                           M Dowle                   
  5:   1.9.6 2015-09-19                      A Srinivasan                   
 ---                                                                        
550:  1.18.0 2025-12-24                      Reino Bruner    ctb            
551:  1.18.0 2025-12-24                        @badasahog    ctb GitHub user
552:  1.18.0 2025-12-24                      Vinit Thakur    ctb            
553:  1.18.0 2025-12-24                       Mukul Kumar    ctb            
554:  1.18.0 2025-12-24                    Ildikó Czeller    ctb

The table above has one row for each time a person appears in the Author field of one of the releases. We will analyze the roles.

author.dt[roles==""]

   version    release                              name  roles  paren
    <char>     <IDat>                            <char> <char> <char>
1:     1.0 2006-04-14                        Matt Dowle              
2:     1.6 2011-04-24                     Matthew Dowle              
3:     1.6 2011-04-24 Tom Short.  See SVN logs on R-...              
4:   1.9.6 2015-09-19                           M Dowle              
5:   1.9.6 2015-09-19                      A Srinivasan              
6:   1.9.6 2015-09-19                           T Short              
7:   1.9.6 2015-09-19                       S Lianoglou              
8:   1.9.6 2015-09-19                         R Saporta              
9:   1.9.6 2015-09-19                        E Antonyan

We see some old entries above with missing roles, which we fill in below.

linewidth.values <- c(
  ctb=2,
  aut=1)
author.dt[
, Role := factor(fcase(
  roles=="aut, cre" | grepl("Dowle|Srinivasan", name), "aut",
  roles=="", "ctb",
  default=roles), names(linewidth.values))
][
, table(roles, Role, useNA="always")
]

          Role
roles      ctb aut <NA>
             5   4    0
  aut        0  26    0
  aut, cre   0  10    0
  ctb      509   0    0
  <NA>       0   0    0

Above we use fcase() to create a new Role column, with factor levels in a non-default order (to control legend entry display order below). Then we chain square brackets to display a table which shows how roles are mapped to Role. The counts look reasonable, so the next step is to count how many people with each role in each release:

(count.dt <- author.dt[, .(people=.N), by=.(release, version, Role)])

       release version   Role people
        <IDat>  <char> <fctr>  <int>
 1: 2006-04-14     1.0    aut      1
 2: 2011-04-24     1.6    aut      1
 3: 2011-04-24     1.6    ctb      1
 4: 2015-09-19   1.9.6    aut      2
 5: 2015-09-19   1.9.6    ctb      4
---                                 
21: 2024-03-30  1.15.4    ctb     65
22: 2025-05-12  1.17.2    aut      8
23: 2025-05-12  1.17.2    ctb     81
24: 2025-12-24  1.18.0    aut      8
25: 2025-12-24  1.18.0    ctb     86

How has this evolved in the past ten years?

library(ggplot2)
gg <- ggplot(count.dt, aes(
  release, people, color=Role))+
  ggtitle("data.table contributor and author counts for selected releases")+
  theme(
    panel.grid.minor=element_blank(),
    axis.text.x=element_text(hjust=1, angle=40))+
  geom_line(aes(linewidth=Role))+
  geom_point(shape=21, fill="white")+
  scale_x_date(breaks="year")+
  scale_linewidth_manual(values=linewidth.values)+
  scale_y_log10(limits=c(0.2, 500))
gg

Above we see a time series showing the increasing authors and contributors over time. To emphasize the values at each release, we add direct labels below:

pp <- function(num)sprintf("%d %s", num, ifelse(num==1, "person", "people"))
## To define upper limit of X scale, we use prop.
## prop=0 means no extra space.
## prop=0.1 means 10% more space, etc.
prop <- 0.1
space.cm <- 0.2 # space between polygon point and data point.
poly.method <- function(position, direction)substitute(list(
  directlabels::dl.trans(
    cex=0.7, # text size of direct labels.
    y=y+YSPACE),
  directlabels::polygon.method(
    POSITION, offset.cm=0.5)), #space between polygon point and text.
  list(YSPACE=direction*space.cm, POSITION=position))
directlabels::direct.label(
  gg, list(directlabels::dl.trans(x=x+space.cm), "right.polygons"))+
  scale_x_date(
    breaks=grid.dt$grid.IDate,
    limits=grid.dt[, {
      i <- as.integer(grid.IDate)
      as.IDate(c(min(i), (1+prop)*max(i)-prop*min(i)))
    }])+
  directlabels::geom_dl(aes(
    label=sprintf("%s\n%s", version, pp(people))),
    data=count.dt[Role=="ctb"],
    method=poly.method("top", 1))+
  directlabels::geom_dl(aes(
    label=sprintf("%s\n%s", pp(people), version)),
    data=count.dt[Role=="aut"],
    method=poly.method("bottom", -1))

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the directlabels package.
  Please report the issue at <https://github.com/tdhock/directlabels/issues>.

Warning in geom_dl(mapping = a2, method = method, stat = L$stat, debug = debug, : Ignoring unknown
aesthetics: linewidth

Scale for x is already present.
Adding another scale for x, which will replace the existing scale.

The figure above shows that the number of authors and contributors has greatly expanded in the second decade of data.table. I’m looking forward to the third decade!

Update of the previous blog

The rest of this post is copied from my previous post, with an update based on recent data.

Analyze several packages for comparison

The code below defines a set of four packages for which we would like to analyze the release history (tidyverse packages for comparison).

compare.pkg.dt <- rbind(
  data.table(project="tidyverse", Package=c("readr","tidyr","dplyr")),
  data.table(project="deprecated", Package=c("reshape2", "plyr")),
  data.table(project="data.table", Package="data.table"))

In the code below, we do the same thing for each package,

(release.dt <- compare.pkg.dt[, {
  Archive.pkg <- get_Archive(Package)
  nc::capture_all_str(Archive.pkg, Archive.pattern)
}, by=names(compare.pkg.dt)])

        project    Package                    file    package version      IDate
         <char>     <char>                  <char>     <char>  <char>     <IDat>
  1:  tidyverse      readr      readr_0.1.0.tar.gz      readr   0.1.0 2015-04-08
  2:  tidyverse      readr      readr_0.1.1.tar.gz      readr   0.1.1 2015-05-29
  3:  tidyverse      readr      readr_0.2.0.tar.gz      readr   0.2.0 2015-10-20
  4:  tidyverse      readr      readr_0.2.1.tar.gz      readr   0.2.1 2015-10-21
  5:  tidyverse      readr      readr_0.2.2.tar.gz      readr   0.2.2 2015-10-22
 ---                                                                            
213: data.table data.table data.table_1.8.8.tar.gz data.table   1.8.8 2013-03-06
214: data.table data.table data.table_1.9.2.tar.gz data.table   1.9.2 2014-02-27
215: data.table data.table data.table_1.9.4.tar.gz data.table   1.9.4 2014-10-02
216: data.table data.table data.table_1.9.6.tar.gz data.table   1.9.6 2015-09-19
217: data.table data.table data.table_1.9.8.tar.gz data.table   1.9.8 2016-11-25

The result above is a data table with one row for each package version. Note that the code set by to all column names, so that the code is run for each row/package.

Add columns for plotting

For plotting we add a few more columns,

release.dt[, `:=`(
  year = as.integer(sub("-.*", "", IDate)),
  package = factor(Package, compare.pkg.dt$Package),
  Project = paste0('\n', project))]
setkey(release.dt, Project, Package, IDate)
release.dt

Key: <Project, Package, IDate>
        project    Package                    file    package version      IDate  year      Project
         <char>     <char>                  <char>     <fctr>  <char>     <IDat> <int>       <char>
  1: data.table data.table   data.table_1.0.tar.gz data.table     1.0 2006-04-14  2006 \ndata.table
  2: data.table data.table   data.table_1.1.tar.gz data.table     1.1 2008-08-27  2008 \ndata.table
  3: data.table data.table   data.table_1.2.tar.gz data.table     1.2 2008-09-01  2008 \ndata.table
  4: data.table data.table data.table_1.4.1.tar.gz data.table   1.4.1 2010-05-03  2010 \ndata.table
  5: data.table data.table   data.table_1.5.tar.gz data.table     1.5 2010-09-14  2010 \ndata.table
 ---                                                                                               
213:  tidyverse      tidyr      tidyr_1.1.4.tar.gz      tidyr   1.1.4 2021-09-27  2021  \ntidyverse
214:  tidyverse      tidyr      tidyr_1.2.0.tar.gz      tidyr   1.2.0 2022-02-01  2022  \ntidyverse
215:  tidyverse      tidyr      tidyr_1.2.1.tar.gz      tidyr   1.2.1 2022-09-08  2022  \ntidyverse
216:  tidyverse      tidyr      tidyr_1.3.0.tar.gz      tidyr   1.3.0 2023-01-24  2023  \ntidyverse
217:  tidyverse      tidyr      tidyr_1.3.1.tar.gz      tidyr   1.3.1 2024-01-24  2024  \ntidyverse

To explain the new columns above,

IDate is for the date to display on the X axis,
year is for labeling the first released version each year,
package is for displaying the Y axis in a particular order (defined by the factor levels),
Project is for the facet/panel titles (newline so that minimal vertical space is used).

Basic plot

The code below creates a basic version history plot,

(gg.points <- ggplot()+
  theme(
    axis.text.x=element_text(hjust=1, angle=40))+
  facet_grid(Project ~ ., labeller=label_both, scales="free")+
  geom_point(aes(
    IDate, package),
    shape=1,
    data=release.dt)+
  scale_x_date("Date", breaks="year"))

The plot above shows a point for every release to CRAN, so you can see the distribution of releases over time.

Add direct labels

Before plotting we make a new table which contains only the first release of data.table in each year (for direct labels),

(labeled.releases <- release.dt[Package=="data.table", .SD[1], by=year])

     year    project    Package                     file    package version      IDate      Project
    <int>     <char>     <char>                   <char>     <fctr>  <char>     <IDat>       <char>
 1:  2006 data.table data.table    data.table_1.0.tar.gz data.table     1.0 2006-04-14 \ndata.table
 2:  2008 data.table data.table    data.table_1.1.tar.gz data.table     1.1 2008-08-27 \ndata.table
 3:  2010 data.table data.table  data.table_1.4.1.tar.gz data.table   1.4.1 2010-05-03 \ndata.table
 4:  2011 data.table data.table  data.table_1.5.1.tar.gz data.table   1.5.1 2011-01-08 \ndata.table
 5:  2012 data.table data.table  data.table_1.7.8.tar.gz data.table   1.7.8 2012-01-25 \ndata.table
 6:  2013 data.table data.table  data.table_1.8.8.tar.gz data.table   1.8.8 2013-03-06 \ndata.table
 7:  2014 data.table data.table  data.table_1.9.2.tar.gz data.table   1.9.2 2014-02-27 \ndata.table
 8:  2015 data.table data.table  data.table_1.9.6.tar.gz data.table   1.9.6 2015-09-19 \ndata.table
 9:  2016 data.table data.table  data.table_1.9.8.tar.gz data.table   1.9.8 2016-11-25 \ndata.table
10:  2017 data.table data.table data.table_1.10.2.tar.gz data.table  1.10.2 2017-01-31 \ndata.table
11:  2018 data.table data.table data.table_1.11.0.tar.gz data.table  1.11.0 2018-05-01 \ndata.table
12:  2019 data.table data.table data.table_1.12.0.tar.gz data.table  1.12.0 2019-01-13 \ndata.table
13:  2020 data.table data.table data.table_1.13.0.tar.gz data.table  1.13.0 2020-07-24 \ndata.table
14:  2021 data.table data.table data.table_1.14.0.tar.gz data.table  1.14.0 2021-02-21 \ndata.table
15:  2022 data.table data.table data.table_1.14.4.tar.gz data.table  1.14.4 2022-10-17 \ndata.table
16:  2023 data.table data.table data.table_1.14.8.tar.gz data.table  1.14.8 2023-02-17 \ndata.table
17:  2024 data.table data.table data.table_1.15.0.tar.gz data.table  1.15.0 2024-01-30 \ndata.table
18:  2025 data.table data.table data.table_1.17.0.tar.gz data.table  1.17.0 2025-02-22 \ndata.table

gg.points+
  directlabels::geom_dl(aes(
    IDate, package, label=paste0(year, "\n", version)),
    method=list(
      cex=0.7,
      directlabels::polygon.method(
        "top", offset.cm=0.2, custom.colors=list(
          colour="white",
          box.color="black",
          text.color="black"))),
    data=labeled.releases)

The plot above shows a label for the first version released each year.

Releases per year

One way to compute releases per year would be to add up the total number of releases, then divide by the number of years,

(overall.stats <- dcast(
  release.dt, 
  project + Package ~ ., 
  list(min,max,length), 
  value.var="year"
)[, releases.per.year := year_length/(year_max-year_min+1)][])

Key: <project, Package>
      project    Package year_min year_max year_length releases.per.year
       <char>     <char>    <int>    <int>       <int>             <num>
1: data.table data.table     2006     2025          73         3.6500000
2: deprecated       plyr     2008     2022          34         2.2666667
3: deprecated   reshape2     2010     2020          10         0.9090909
4:  tidyverse      dplyr     2014     2026          46         3.5384615
5:  tidyverse      readr     2015     2025          23         2.0909091
6:  tidyverse      tidyr     2014     2024          31         2.8181818

Another way to do it would be to compute the number of releases in each year since the release of the package. To do that we first compute, for each package, a set of years for which we want to count releases.

(max.year <- max(release.dt$year))

[1] 2026

(years.since.release <- release.dt[, .(
  year=seq(min(year), max.year)
), by=.(Project, project, Package, package)])

         Project    project    Package    package  year
          <char>     <char>     <char>     <fctr> <int>
 1: \ndata.table data.table data.table data.table  2006
 2: \ndata.table data.table data.table data.table  2007
 3: \ndata.table data.table data.table data.table  2008
 4: \ndata.table data.table data.table data.table  2009
 5: \ndata.table data.table data.table data.table  2010
---                                                    
91:  \ntidyverse  tidyverse      tidyr      tidyr  2022
92:  \ntidyverse  tidyverse      tidyr      tidyr  2023
93:  \ntidyverse  tidyverse      tidyr      tidyr  2024
94:  \ntidyverse  tidyverse      tidyr      tidyr  2025
95:  \ntidyverse  tidyverse      tidyr      tidyr  2026

Then we can do a join and summarize to count the number of releases in each year, for each package,

(releases.per.year <- release.dt[years.since.release, .(
  N=as.numeric(.N)
), on=.NATURAL, by=.EACHI])

       project    Package    package  year      Project     N
        <char>     <char>     <fctr> <int>       <char> <num>
 1: data.table data.table data.table  2006 \ndata.table     1
 2: data.table data.table data.table  2007 \ndata.table     0
 3: data.table data.table data.table  2008 \ndata.table     2
 4: data.table data.table data.table  2009 \ndata.table     0
 5: data.table data.table data.table  2010 \ndata.table     2
---                                                          
91:  tidyverse      tidyr      tidyr  2022  \ntidyverse     2
92:  tidyverse      tidyr      tidyr  2023  \ntidyverse     1
93:  tidyverse      tidyr      tidyr  2024  \ntidyverse     1
94:  tidyverse      tidyr      tidyr  2025  \ntidyverse     0
95:  tidyverse      tidyr      tidyr  2026  \ntidyverse     0

Note that on=.NATURAL above means to join on the common columns between the two tables, and by=.EACHI means to compute a summary for each value specified in i (the first argument in the square bracket). We can plot these data as a heat map via

this.year <- as.integer(strftime(Sys.time(), "%Y"))
ggplot()+
  theme_bw()+
  theme(
    panel.spacing=grid::unit(0, "lines"),
    axis.text.x=element_text(hjust=1, angle=40))+
  geom_tile(aes(
    year, package, fill=N),
    data=releases.per.year)+
  geom_text(aes(
    year, package, label=N),
    data=releases.per.year)+
  facet_grid(Project ~ ., labeller=label_both, scales="free", space="free")+
  scale_fill_gradient(
    "releases",
    low="white",
    high="red",
    breaks=c(0, 2^seq(0, 4)),
    transform=scales::transform_log1p())+
  scale_x_continuous(breaks=seq(2006, this.year))+
  coord_cartesian(expand=FALSE)

The heat map above shows a summarized display of the release data we saw earlier in the dot plot.

Next, we can apply a list of summary functions over all of the yearly counts, for each package, via

(per.year.stats <- dcast(
  releases.per.year,
  project + Package ~ .,
  list(min, max, mean, sd, length),
  value.var = "N"))

Key: <project, Package>
      project    Package N_min N_max    N_mean      N_sd N_length
       <char>     <char> <num> <num>     <num>     <num>    <int>
1: data.table data.table     0    17 3.4761905 3.7499206       21
2: deprecated       plyr     0     7 1.7894737 2.3939495       19
3: deprecated   reshape2     0     2 0.5882353 0.7952062       17
4:  tidyverse      dplyr     0     8 3.5384615 2.4019223       13
5:  tidyverse      readr     0     5 1.9166667 1.7816404       12
6:  tidyverse      tidyr     0     6 2.3846154 1.9381461       13

Finally, the code below creates a table to compare the two different ways of computing the number of releases per year,

per.year.stats[overall.stats, .(
  Package, 
  overall.mean=N_mean, 
  mean.per.year=releases.per.year
), on="Package"]

      Package overall.mean mean.per.year
       <char>        <num>         <num>
1: data.table    3.4761905     3.6500000
2:       plyr    1.7894737     2.2666667
3:   reshape2    0.5882353     0.9090909
4:      dplyr    3.5384615     3.5384615
5:      readr    1.9166667     2.0909091
6:      tidyr    2.3846154     2.8181818

The table above show similar numbers for the two methods of computing the number of releases per year.

Conclusion

We have shown how to download CRAN package release data, how to parse the web pages using the nc package and regular expressions, how to summarize/analyze using data.table, and how to visualize using ggplot2.