Scala in Numbers

The Ecosystem Census

Source code statistics of common Scala libraries

2014-06-18 Scala Days 2014 Berlin

Johannes Rudolph   /
@virtualvoid

Scientific Approach

  • Start with hypothesis
  • Gather evidence
  • Try to prevent biases
  • Make sure data is correct
  • Make experiments reproducible
  • Choose relevant features
  • Goal: discover knowledge

Ad Hoc Approach

  • Start with rough idea
  • Technology first
  • Run some experiments
  • Look for the interesting needles in the data haystack
  • Hope to learn something in the process

Data basis

  • Scala 2.10
  • Mostly libraries from the 2.11 release announcement list
  • libraries
  • ~ 200 MB json files of analysis output

Libraries

Size ~ LoC
NameLoC (*)Source files
(*) Lines the compiler reported an AST for

Doing statistics is hard

  • Start with hypotheses?
  • Start with data?
  • Start with visualizations?

Identifiers

Local vals

RankNameOccurrencesTop Library% of Total*
(*) % of occurrences the top library contributed

Example

GenericCompanion.scala:

def apply[A](elems: A*): CC[A] = {
    if (elems.isEmpty) empty[A]
    else {
        val b = newBuilder[A]
        b ++= elems
        b.result()
    }
}
    

Local vars

RankNameOccurrencesTop Library% of Total*
(*) % of occurrences the top library contributed

Vals vs. Vars

NameVarsVals% Vars

Example

rapture-io / streams.scala

override def foreach[U](fn: Data => U): Unit = {
    var next: Option[Data] = read()
    while(next != None) {
        fn(next.get)
        next = read()
    }
}
scala-stm / TxnHashTrie.scala

def mapForeach[U](f: ((A, B)) => U) {
    var i = 0
    while (i < hashes.length) {
        f(getKeyValue(i))
        i += 1
    }
}
    

Parameter names

RankNameOccurrencesTop Library% of Total*
(*) % of occurrences the top library contributed

Example

scalaz-core / Lens.scala

def map[C](f: B1 => C): State[A1, C]
def >-[C](f: B1 => C): State[A1, C]
def flatMap[C](f: B1 => IndexedState[A1, A2, C]): IndexedState[A1, A2, C]
    

Lengths of Local Identifiers

Longest: linearizedTargetColumnsForOriginalTargetTable from slick

Lengths of Local Identifiers by Library

LibraryAvg. LengthMedianShortestLongest

Lengths of Local Identifiers

Scala Library Usage

Scala Library Usage (12 - 24)

Name%Users

Scala Library Usage Top 12

Name%Users

Scala Library Usage (scala.collection)

Name%Users

Predef Usages

Name%

Predef Enhancements Usages

MethodExtension%

Implicit usage

Implicit parameter definitions by type

Type#Top user%

Implicit params from scala-library

Type#Top user%

Making-Of

Parts

  • Crawler
  • Compiler
  • Feature extraction
  • Analysis
  • Frontend

Crawler

  • Input: Maven ModuleID
  • Output: source jar and binary dependency jars
  • Uses ivy/sbt-ivy to access Maven repositories

Compiler

  • Input: sources and dependency jars
  • Initializes presentation compiler with all sources
  • Allows to run queries over source trees

trait AnalyzingCompiler {
  def analyze[T](factory: AnalyzerFactory[Option[T]]): Seq[T]
}
trait AnalyzerFactory[T] {
  def create(u: Universe): Analyzer[T] { type U = u.type }
}
trait Analyzer[T] {
  type U <: Universe
  def analyze(tree: U#Tree): T
}

new AnalyzerFactory[Option[PredefUsage]] {
  def create(u: Universe) =
    new Analyzer[Option[PredefUsage]] {
      def analyze(tree: u.Tree): Option[PredefUsage] =
        tree match {
          case q"${ _ }.Predef.$method" ⇒
            Some(
              PredefUsage(method.decoded,
                          pos(library, tree.pos)))
          case _ => None
        }
    }
}

Demo: Code Search

Extraction

  • Extracts one aspect of code
  • Examples: collect-identifiers, scala-library-references
  • Runs per library
  • Results serialized to json

Analysis

  • Aggregates per-library data into statistics
  • Examples: common-identifers, local-identifier histogram
  • Results serialized to json

Frontend

  • Fetch statistics as JSON
  • Render stats on the presentation on-the-fly
  • Problem: Data may still change
  • Tools used
    • reveal.js
    • d3.js
    • scala-js

Techniques and Libraries used

  • sbt/ivy for fetching dependencies
  • presentation compiler for providing data structures
  • quasi-quotes for matching on interesting trees
  • scala-js for doing some client side data manipulation
  • d3 for creating visualizations

Issues with the data

  • Only library code was considered
  • Not corrected for code size
  • Correctness wasn't validated rigorously
  • Libraries weren't properly pre-screened for relevancy
  • Some libraries had minor compilation issues
  • Multi-module libraries weren't aggregated

Last but not least: symbolic operators

Name#Top 5

Top 10 Top 2 Libraries with Unicode operators

Name#Operators

Executive Summary

Code is data

Use it to your advantage.

Thank you for listening!

@virtualvoid

This presentation (soon):
2014.sca.land