Scala in Numbers
The Ecosystem Census
Source code statistics of common Scala libraries
2014-06-18
Scala Days 2014 Berlin
Johannes Rudolph
/
@virtualvoid
Scientific Approach
- Start with hypothesis
- Gather evidence
- Try to prevent biases
- Make sure data is correct
- Make experiments reproducible
- Choose relevant features
- Goal: discover knowledge
Ad Hoc Approach
- Start with rough idea
- Technology first
- Run some experiments
- Look for the interesting needles in the data haystack
- Hope to learn something in the process
Data basis
- Scala 2.10
- Mostly libraries from the 2.11 release announcement list
- libraries
- ~ 200 MB json files of analysis output
(*) Lines the compiler reported an AST for
Doing statistics is hard
- Start with hypotheses?
- Start with data?
- Start with visualizations?
Local vals
Rank | Name | Occurrences | Top Library | % of Total* |
(*) % of occurrences the top library contributed
Example
GenericCompanion.scala:
def apply[A](elems: A*): CC[A] = {
if (elems.isEmpty) empty[A]
else {
val b = newBuilder[A]
b ++= elems
b.result()
}
}
Local vars
Rank | Name | Occurrences | Top Library | % of Total* |
(*) % of occurrences the top library contributed
Example
rapture-io / streams.scala
override def foreach[U](fn: Data => U): Unit = {
var next: Option[Data] = read()
while(next != None) {
fn(next.get)
next = read()
}
}
scala-stm / TxnHashTrie.scala
def mapForeach[U](f: ((A, B)) => U) {
var i = 0
while (i < hashes.length) {
f(getKeyValue(i))
i += 1
}
}
Parameter names
Rank | Name | Occurrences | Top Library | % of Total* |
(*) % of occurrences the top library contributed
Example
scalaz-core / Lens.scala
def map[C](f: B1 => C): State[A1, C]
def >-[C](f: B1 => C): State[A1, C]
def flatMap[C](f: B1 => IndexedState[A1, A2, C]): IndexedState[A1, A2, C]
Lengths of Local Identifiers
Longest: linearizedTargetColumnsForOriginalTargetTable from slick
Lengths of Local Identifiers by Library
Library | Avg. Length | Median | Shortest | Longest |
Lengths of Local Identifiers
Scala Library Usage (12 - 24)
Scala Library Usage Top 12
Scala Library Usage (scala.collection)
Predef Enhancements Usages
Implicit parameter definitions by type
Implicit params from scala-library
Parts
- Crawler
- Compiler
- Feature extraction
- Analysis
- Frontend
Crawler
- Input: Maven ModuleID
- Output: source jar and binary dependency jars
- Uses ivy/sbt-ivy to access Maven repositories
Compiler
- Input: sources and dependency jars
- Initializes presentation compiler with all sources
- Allows to run queries over source trees
trait AnalyzingCompiler {
def analyze[T](factory: AnalyzerFactory[Option[T]]): Seq[T]
}
trait AnalyzerFactory[T] {
def create(u: Universe): Analyzer[T] { type U = u.type }
}
trait Analyzer[T] {
type U <: Universe
def analyze(tree: U#Tree): T
}
new AnalyzerFactory[Option[PredefUsage]] {
def create(u: Universe) =
new Analyzer[Option[PredefUsage]] {
def analyze(tree: u.Tree): Option[PredefUsage] =
tree match {
case q"${ _ }.Predef.$method" ⇒
Some(
PredefUsage(method.decoded,
pos(library, tree.pos)))
case _ => None
}
}
}
Extraction
- Extracts one aspect of code
- Examples: collect-identifiers, scala-library-references
- Runs per library
- Results serialized to json
Analysis
- Aggregates per-library data into statistics
- Examples: common-identifers, local-identifier histogram
- Results serialized to json
Frontend
- Fetch statistics as JSON
- Render stats on the presentation on-the-fly
- Problem: Data may still change
- Tools used
Techniques and Libraries used
- sbt/ivy for fetching dependencies
- presentation compiler for providing data structures
- quasi-quotes for matching on interesting trees
- scala-js for doing some client side data manipulation
- d3 for creating visualizations
Issues with the data
- Only library code was considered
- Not corrected for code size
- Correctness wasn't validated rigorously
- Libraries weren't properly pre-screened for relevancy
- Some libraries had minor compilation issues
- Multi-module libraries weren't aggregated
Last but not least: symbolic operators
Top 10 Top 2 Libraries with Unicode operators
Executive Summary
Code is data
Use it to your advantage.