Monday, April 18, 2011

Extracting top words from titles and abstractions in MIX11 presentations


The videos of MIX11 (Apr. 11-14) conference are available at To get a sense of what’s going on recently, I wrote an F# script to count the top occurring words in the titles and abstractions of all the presentations.

Title statistics:

"windows" : 41
"phone" : 26
"azure" : 14

"web" : 14
"html5" : 11
"silverlight" : 9

"net" : 9
"7" : 8
"applications" : 8
"data" : 8
"platform" : 7
"application" : 7
"javascript" : 7
"new" : 7
"ux" : 6
"what’s" : 6
"asp" : 6
"boot" : 5
"camp" : 5
"building" : 5

From the statistics of the words in titles, we can find that windows mobile phone and azure cloud platform are the hottest topics.  HTML5, the next standard of web page technology, has been always been a focus by Microsoft. Silverlight still has its heat. It is a good company to HTML5, I think it will have many applications in in-house web applications; while HTML5 has more support and available in all browsers across platforms.

Abstract statistics:

"web" : 62
"session" : 55
"windows" : 49
"new" : 40
"applications" : 36
"phone" : 36
"learn" : 32
"we’ll" : 29
"use" : 25
"using" : 24
"silverlight" : 23
"data" : 23
"net" : 22
"javascript" : 21
"azure" : 21
"developers" : 20
"come" : 17
"microsoft" : 17
"features" : 17
"one" : 16


and the whole F# program is actually short -- only 40 lines! You have everything there: download webpages, get the titles and titles, stopword removing and word counting, sorting...


Code Snippet
  1. open System
  2. open System.Net
  3. open System.Text.RegularExpressions
  5. let fetchUrlSimple (url:string) =
  6.     let req = WebRequest.Create(url)
  7.     let response = req.GetResponse()
  8.     use stream = response.GetResponseStream()
  9.     use streamreader = new System.IO.StreamReader(stream)
  10.     streamreader.ReadToEnd()
  12. let topKWords (docs:string seq) K =
  13.     let separator = [|' '; '\r'; '\n'; '-'; '.'; ',' ; '\t'; '!'; '?'; '\''; ';'; '/' |]
  14.     let stopwords = Set("a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your".Split ',')
  16.     docs
  17.     |> (fun doc ->
  18.         doc.Split(separator, StringSplitOptions.RemoveEmptyEntries)
  19.         |> (fun word -> word.ToLower())
  20.         |> Seq.filter (fun word -> not (stopwords.Contains(word)))
  21.         )
  22.     |> Seq.concat
  23.     |> Seq.groupBy (fun x->x)
  24.     |> (fun (word, wordSeq) -> (word, wordSeq |> Seq.length))
  25.     |> Seq.sortBy (fun (_, wordCnt) -> - wordCnt)
  26.     |> Seq.take K
  27.     |> Seq.toList
  29. let rawpage = fetchUrlSimple @""
  30. let titles, abstracts =
  31.     let page = Regex.Replace(rawpage, "&#?[a-z0-9]+;", " ")
  32.     let titleMatches = Regex.Matches(page, "class=\"title\">(.*?)</a>")
  33.     let abstractMatches = Regex.Matches(page.Replace("\n"," "), "class=\"description\">(.*?)</div>")
  34.     let matchesToSeq (matches: MatchCollection) =
  35.         seq {
  36.             for m in matches do
  37.                 yield m.Groups.[1].Value
  38.         }
  39.     matchesToSeq titleMatches, matchesToSeq abstractMatches
  42. topKWords titles 20
  43. topKWords abstracts 20


My colleague Defu Lian wrote a C# version for topKWords function using LINQ:

Code Snippet
  1. static List<Tuple<string, int>> topKWords(IEnumerable<string> docs,int K)
  2. {
  3.     char[] separator = { ' ', '\r', '\n', '-', '.', ',', '\t', '!', '?', '\'', ';', '/' };
  4.     var stopwords = new HashSet<string>("a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your".Split(','));
  5.     return docs.SelectMany(doc => doc.Split(separator, StringSplitOptions.RemoveEmptyEntries)
  6.                         .Select(word => word.ToLower())
  7.                         .Where(word => !stopwords.Contains(word)))
  8.         .GroupBy(a => a)
  9.         .Select(wordgroup => Tuple.Create<string,int>(wordgroup.Key, wordgroup.Count()))
  10.         .OrderByDescending(w2c => w2c.Item2)
  11.         .Take(K)
  12.         .ToList();
  13. }

Sunday, April 10, 2011

A Note on F# Quotations

I worked on extending the F# ODSL (Optimization Domain Specific Language, link1 and link2) during the weekend. This is the first time I use F# quotations in a non-trivial fashion. I tried some quotation examples in Programming F# when I was learning the language and read code examples in F# PowerPack (for LINQ integration) and other libraries before.

Only after I actually write some programs in it, I can appreciate this F# feature more and have some of my own thoughts. In this post, I’d like to share these thoughts. I don’t intend to go into detailed F# code, but maybe in the future I will write a separate blog on my work on extending the F# ODSL.

General idea of F# Quotations

The great idea of quotation at least traces back to Lisp, where program is also a kind of data – the execution behavior of a piece of program is completely controllable by the user, just treat it as input data and write a custom evaluator for it. The default Lisp evaluator is eval, we can easily write a custom one to change the default behavior [LispEval].

In F#, we can also treat a piece of F# code as data by quoting it:

let quotedProgramAsData = 
let y = 1 + 2 * 30
20 * y
// code here:
// an F# program using a subset of F# language features

When we create the variable quotedProgramAsData, the .Net runtime will not compute the expression in the quotation; instead it generates the following value:

val quotedProgramAsData : Quotations.Expr<int> =

val quotedProgramAsData : Quotations.Expr<int> =
Let (y,
Call (None, Int32 op_Addition[Int32,Int32,Int32](Int32, Int32),
[Value (1),
Call (None, Int32 op_Multiply[Int32,Int32,Int32](Int32, Int32),
[Value (2), Value (30)])]),
Call (None, Int32 op_Multiply[Int32,Int32,Int32](Int32, Int32),
[Value (20), y]))


Let’s visualize it in a tree form:


F# compiler generates this tree structure by free and the rest is how you deal/evaluate this tree. If the above code is not in quotation, F# compiler will generate code that put the value of the second parameter (1 + 2 * 30) of Let to the first parameter (y), and continue to generate code for the third parameter (20 * y). But because it is in the quotation, only the expression tree is generated without any explicit execution behavior for them.

With different purposes, we can write different evaluators for the quoted F# code. In the following I list some of the application areas.

Domain specific language (DSL)

There are different kinds of domain specific languages. In his book Domain Specific Languages, Martin Fowler summarizes them into external and internal ones:

DSLs come in two main forms: external and internal. An external DSL is a language that's parsed independently of the host general purpose language: good examples include regular expressions and CSS. External DSLs have a strong tradition in the Unix community. Internal DSLs are a particular form of API in a host general purpose language, often referred to as a fluent interface. The way mocking libraries, such as JMock, define expectations for tests are good examples of this, as are many of the mechanisms used by Ruby on Rails. Internal DSLs also have a long tradition of usage, particularly in the Lisp community. (

Quotations can be used as a very good language feature for implementing internal domain specific languages.

As stated as before, the quoted program is anyway an F# program. How could it be look like a domain specific one? Remember F# has a rich set of syntax while a domain language takes a small subset of it is usually enough expressive.

Take a look at the following program, which expresses the quadratic optimization for Support Vector Machines:

// solve the SVM using ODSL
let dsl_solver =
let index = [ 0..n-1]
let alpha = vararray1 (index)
maximise (sum index (fun i -> alpha.[i]) -
(sum index (fun i -> sum index (fun j ->
coef.[i,j] * alpha.[i] * alpha.[j])))) // Eq. (1)
foreach index (fun i -> 0.0 <= alpha.[i]) // Eq. (2)
foreach index (fun i -> alpha.[i] <= C) // Eq. (2)
sum index (fun i -> alpha.[i] * y.[i]) = 0. // Eq. (3)

and its optimization formulation:


Subject to constraints:



The correspondence between the DSL code and mathematical formulates is very clear. In the future work, we can even vectorize this DSL to eliminate the usage of sum: range -> lambda (int->value) -> value function, which will make the program and the formulas more similar.

Note that in the above example qp, maximise, sum and where are not magic, but are all F# functions whose behaviors are defined in the DSL. For example, F# function qp() only indicates the following code is a quadratic programming.

Compare this code with the code in my SVM post: it is clearer and thus easier to write.

The restriction of such a DSL inside F# is that the syntax of this DSL should follow that of F# -- it should be a valid program recognizable by the F# lexer and parser. This is why DSL people like Lisp which has so simple and flexible syntax (just brackets), but sometimes it is also boring and confusing when one gets lost in the brackets.

High performance computing

We know that F#’s Seq module and C#’s LINQ share some common features, e.g. chained operations and the laziness. Because these operations are so fundamental, LINQ team has spent enormous time optimizing LINQ; on the other hand, F#’s implementation is quite a standard one without heavy optimization because F# compiler team is not big and they have other important tasks to do. (F#’s Seq module actually does some optimization, e.g. for arrays and lists special routines are called instead of the general ones for IEnumerable<T> objects. But LINQ does more!)

So one idea of speeding up sequence operations in F# is to use LINQ’s equivalent functions. This could be implemented by putting F#’s sequence operations in a quotation and rewrite them in LINQ expression as done in F# PowerPack. Interested readers can read the source code of LINQ for F# in PowerPack.

The above example talks about basic data structures. As a data mining guy myself, let’s move to numerical computing: we can quote a piece of numerical code in F# and translates it into Fortran/C and a JIT Fortran/C compiler compiles the code translated code into native machine code which uses special instruction set in that platform (e.g. vectorized instructions in P4 CPU). Or even we can compile this piece of F# into GPU and utilize the parallel computing there. [Syme_ML06]

But any fancy optimization has some overhead too. For example, the compiling time may cost some time.

Education: Compiler courses

The last application area is education. There are few CS departments teaching functional programming. Some of them will mention it when teaching programming language concepts. F# is a push for FP into industry and main stream. Hope in the near future, schools will open tiny or selected courses for F#. See the recent F# in Education Workshop [Education] for details.

Ok. Let’s focus on compiler courses. Compilers, different from functional programming, are taught in nearly all CS departments. However the projects in compiler courses are a huge pain to students. They are so complicated! And the focus of the first half of the course (lexing and parsing) and the second half (code generation and optimization) are kind of separate. The theories behind the two halves are different. Sometimes, the course instructor focuses too more on lexing and parsing and don’t have enough time for students to work on code generation and optimization, which in my mind are more important in a compiler course. While lexing and parsing will occur in other CS courses too, e.g. computational theory, code generation and optimization will only be in a compiler course.

By using the F# quotations, F# compiler generates the expression tree for free. Just so convenient and straightforward! Considering that the F# syntax you can put in quotation is a capable set of imperative language + functional language, the expression tree for us to do the code generation and optimization is non-trivial. By using tools like .Net Reflector, we can also study how the standard F# compiler generates the same piece of code and learn the tricks there. Students can work on code generation for closures, and tail-recursive-calls – important language features that are not implemented or fully implemented in many main stream languages.


Some references in the above text:

[LispEval] Christian Queinnec , Lisp in Small Pieces.

[Syme_ML06] Leveraging .NET Meta-programming Components from F#, ML Workshop, 2006.

[Education] F# in Education Workshop,

A very good paper by Leijen and Meijer:

[Leijen&Meijer] Domain specific embedded compilers, Sigplan Notices, vol. 35, no. 1, pp. 109-122, 2000.

F# quotation extensively uses active patterns for expression tree pattern matching:

[Syme_Active] Extensible Pattern Matching via a Lightweight Language Extension, ICFP 2007.

Here are some relevant blog posts:

[Petricek] F# Overview (IV.) - Language Oriented Programming,

[RubyDSL] Building a DSL in Ruby,

Two nice answers by Petricek & Harrop discussing Quotations (F#, Ocaml and Lisp) and DSL on Stackoverflow:


I’d like to thank Nathan Brixius for discussions on F# ODSL!