Saturday, May 16, 2015

Is Rust good for data mining?

 

Rust 1.0 is just released. This is great achievement for Rust team! I have watched Rust for some time. One phase on its website summaries this language rather well: “zero-cost abstraction”. I was attracted to Rust by Poss’s article: Rust for functional programmers. But the more I know the language, the more I find that its syntax-level similarity to Haskell and ML is only superficial, what Rust really wants to be is a system programming language with safe and fine-granularity memory management.

C has “zero-cost”, however it is hard to build high-level abstractions in it. If we program concurrency in C, we shall explicitly use a thread library (e.g. OpenMP, PThreads) or we need to go lower to implement a thread using more primitive functions at the OS level.

On the other extreme, languages such as Haskell and Scala have “abstractions”. They can build beautiful APIs that others can use and don’t care about how they are implemented. There is a cost in abstraction. The memory usage of Haskell and Scala is harder to predict than C programs. When we chain functions such as map/filter, we don’t know exactly how many objects are created as immediate objects; it depends the underlying library implementation and the optimization ability of the compiler and these two are not common knowledge to the applications programmer . The extra cost also comes from the constant hidden inside the big O – using the iterator interface would has extra cost on each .next() function call while iterating a bare array in C is much faster.

How is it possible for Rust to have the zero-cost abstraction? Basically it achieves it through memory management system. This is the real innovative part of Rust, though some theory is built in academic papers and previous small languages, Rust is the first to provide a best engineering on it. In F#, we can easily write

col |> Seq.map (fun e -> e*e) |> Seq.filter (fun e -> e%3==0) |> Seq.sum

and we don’t know how memory is managed. Will map and filter create many small objects to be GCed? The F# program can ignore these questions. He can of course dig into Seq module’s source code and the final compiled IL code to know the details. But by default, the F# programmer need not care.

If the same program is written in Rust, the programmer has to control exactly how each object is created. He knows it by writing extra annotations to the code. This adds burden to the programmer’s mind!

In data mining, writing correct code for the numerics and the algorithmic logics is already hard, how would a data miner want to put the programming issues in his mind? I would not. This is of course because I am not familiar with Rust’s borrow system. I believe after enough training, my skill can reach a state of caring less and less about memory when programming data mining applications. But in the first place, why should I? Fine memory control is not the primary issue of data mining applications. If performance is not that critical, any static language such as F# and Scala would have a fine performance. Need more performance? Code in C++, allocate all the memory deterministically, and avoid big-object copy and dynamic heap allocation when critical components are running!