středa 16. listopadu 2011

Installing Ruby & Rails & RVM on Windows 7

I write this post, just in case that someone will have the same issues as I had when installing Ruby & Rails (on Windows) and maybe it will save someones some hours.

One limitation I had: I would use lots of versions of ruby, so I needed: RVM. (Ruby Version Manager) which is not available for windows.

My config: Win 7 64 bit

In this case, you have two options:
  • Use cygwin
  • Install on Virtual Machine

Use Cygwin - did not work for me!

  • Install RVM.
    I started with this option. To install RVM, there is some great help here
  • Install RubyGem. First download RubyGem, than run ruby setup.rb
  • Install rails by typing: gem install rails. And I just got the following errors:
Building native extensions.  This could take a while...
      0 [main] ruby 1192 C:\cygwin\bin\ruby.exe: *** fatal error - unable to rem
ap \\?\C:\cygwin\lib\ruby\1.8\i386-cygwin\ to same address as parent: 0x1B
0000 != 0x210000
      Stack trace:
Frame     Function  Args
023F9BB8  6102796B  (023F9BB8, 00000000, 00000000, 00000000)
023F9EA8  6102796B  (6117EC60, 00008000, 00000000, 61180977)
023FAED8  61004F1B  (611A7FAC, 61243684, 001A0000, 00210000)
End of stack trace
      1 [main] ruby 3856 fork: child 1188 - died waiting for dll loading, errno
      0 [main] collect2 3220 fork: child -1 - died waiting for longjmp before in
itialization, retry 10, exit code 0xC0000135, errno 11
ERROR:  Error installing rails:
        ERROR: Failed to build gem native extension.

        /usr/bin/ruby.exe extconf.rb
checking for re.h... yes
checking for ruby/st.h... no
creating Makefile

gcc -I. -I/usr/lib/ruby/1.8/i386-cygwin -I/usr/lib/ruby/1.8/i386-cygwin -I. -DHA
VE_RE_H    -g -O3   -Wall  -c parser.c
gcc -shared -s -o parser.o -L. -L/usr/lib -L.  -Wl,--enable-auto-image
-base,--enable-auto-import,--export-all   -lruby  -ldl -lcrypt
collect2: fork: Resource temporarily unavailable
      0 [main] collect2 3220 fork: child -1 - died waiting for longjmp before in
itialization, retry 10, exit code 0xC0000135, errno 11
make: *** [] Error 1
And I never got over this issue. And I tried quite long enough. So if you are running the same config as I. Be aware you might end up like this...

Installing on Ubuntu 11 in VMWare

This should be just a piece of cake I thought.
  • Ubuntu comes with Ruby already installed.
  • Install RVM: sudo apt-get install ruby-rvm
  • Install RubyGems: sudo apt-get install rubygems
  • Install Railssudo gem install rails
  • And Bundle: sudo gem install bundle
I needed some special libraries for our application, which had some prerequisites which I did not have:
To install "nokogiri" I had to do:
sudo apt-get install libxslt-dev
sudo gem install nokogiri

To install "rmagick":
sudo apt-get install libmagicwand-dev
sudo gem install rmagick

Now I did run "bundle" on the application, which actually did finish. But with some warnings, and the application did not run. So I started with cleaning up the warnings:

First warning:

Invalid gemspec in [/var/lib/gems/1.8/specifications/capybara-1.1.1.gemspec]: invalid date format in specification: "2011-09-04 00:00:00.000000000Z"
Invalid gemspec in [/var/lib/gems/1.8/specifications/polyamorous-0.5.0.gemspec]: invalid date format in specification: "2011-09-03 00:00:00.000000000Z"
Apparently quite common and apparently for everyone there is different solution. For me what worked was:

sudo gem install rubygems-update
sudo update_rubygems 

Another strange warning which I was getting:
ERROR:  While executing gem ... (Gem::DocumentError)
    ERROR: RDoc documentation generator not installed: no such file to load -- json

Solved by:

gem install rdoc-data
rdoc-data --install

After that I had to reinstall "bundle". And gem actually reinstalled all dependencies. But after that, it run!!

Also I needed to access to my application from our network. VMWare has two options for setting up network: NAT and Bridged. Bridged interface did not work with Ubuntu 11 (don't know why).
So when you are in NAT mode and you need to access to your VM, you will need to configure port forwarding.
Here is good way to set it up.

I get that Ruby is not Windows friendly, but on Ubuntu it was not much better, until I solved those strange warnings, nothing worked correctly - and that was clean install, I mean a clean machine! With ruby pre-installed. I was just adding RubyGem and took me a half a day...

sobota 12. listopadu 2011

Universal Naive Bayes Classifier for C#

This post is dedicated to describe the internal structure and the possible use of Naive Bayes classifier implemented in C#.

I was searching for a machine learning library for C#, something that would be equivalent to what WEKA is to Java. I have found but it did not include the Bayesian classification (the one in which I was interested). So I decided to implement it into the library.

How to use it

One of the aims of is to allow the users to use simple POCO's for the classification. This can be achieved by using the C# attributes. Take a look at the following example which treats categorization of payments, based on two features: Amount and Description.
First this is the Payment POCO object with added attributes:
public class Payment
    [StringFeature(SplitType = StringType.Word)]
    public String Description { get; set; }

    public Decimal Amount { get; set; }

    public String Category { get; set; }
And here is how to train the Naive Bayes classifier using a set of payments and than classify new payment.
var data = Payment.GetData();            
NaiveBayesModel<Payment> model = new NaiveBayesModel<Payment>();
var predictor = model.Generate(data);
var item = predictor.Predict(new Payment { Amount = 110, Description = "SPORT SF - PARIS 18 Rue Fleurus" });

After the execution the item.Category property should be set to a value based on the analysis of the previously supplied payments.

About Naive Bayes classifier

This is just small and simplify introduction, refer to the Wikipedia article for more details about Bayesian classification.

Naive Bayes is a very simple classifier which is based on a simple premise that all the features (or characteristics) of classified items are independent. This is not really true in the real life, that is why the model is called naive.
The total probability of a item having features F1, F2, F3 being of category "C1" can be expressed as:

p(F1,F2,F3|C1) = P(C1)*P(F1|C1)*P(F2|C1)*P(F3|C1)

Where P(C1) is the A priory probability of item being of category C1 and P(F1|C1) is the Posteriori probability of item being of category C1 when it has feature F1.
That is simple for binary features (like "Tall", "Rich"...). For example p(Tall|UngulateAnimal) = 0.8, says that the posteriori probability for an animal to be and ungulate is 0.8, when it is a tall animal.

If we have continuous features (just like the "Amount" in the payment example), the Posteriori probability will be expressed slightly differently. For example P(Amount=123|Household) = 0.4 - can be translated as: the probability of the payment being part of my household payments is 0.4, when the amount was 123$.

When we classify, we compute the total probability for each category (or class if you want) and we select the category with maximal probability. We have to thus iterate over all the categories and all the features of each item and multiply the probabilities to obtain the probability of the item being in each class.

How it works inside

After calling the Generate method on the model a NaiveBayesPredictor class is created. This class contains the Predict method to classify new objects.
My model can work with three types of features (or characteristics, or properties):
  • String properties. These properties have to be converted to a binary vectors based on the words which they contain. The classifier builds a list of all existing words in the set and then the String feature can be represented as a set of binary features. For example if the bag of all worlds contains four words: (Hello, World, Is, Cool), than the following vector [0,1,0,1] represents text "World Cool".
  • Binary properties. Simple true or false properties
  • Continuous properties. By default these are Double or Decimal values, but the list could be extend to other types.
After converting the String features to binary features, we have two types of features:
  • Binary features
  • Continuous features
As mentioned in the introduction for each feature in the item we have to compute the A priori and Posteriori probabilities. The following pseudocode shows how to estimate the values of A priori and Posteriori probabilities. I use array-like notation, just because I have used arrays also in the implementation.

Apriori probability

The computation of Apriori probability will be the same for both type of features.

Apriori[i] = #ItemsOfCategory[i] / #Items

Posteriori probability

The Posteriori for binary features will be estimated:

Posteriori[i][j] = #ItemsHavingFeature[j]AndCategory[i] / #ItemsOfCategory[i]

And the Posteriori probability for contiunous features:

Posteriori[i][j] = Normal(Avg[i][j],Variance[i][j],value)

Where Normal references the normal probability distribution. Avg[i][j] is the average value of feature "j" for items of category "i". Variance[i][j] is the variance of feature "j" for items of category "i".
If we want to know the probability of payment with Amount=123 being of category "Food", we have the average of all payments of that category let's say: Avg[Food][Amount] = 80, and we have the Variance[Food][Amount] = 24, then the posteriori probability will be equal: Normal(80, 24, 123).

What does the classifier need?

The response to this question is quite simple, we need at least 4 structures, the meaning should be clear from the previous explication.

public double[][] Posteriori { get; set; }
public double[] Apriori { get; set; }
public double[][] CategoryFeatureAvg { get; set; }
public double[][] CategoryFeatureVariance { get; set; }

And how does it classify?

As said before the classification is a loop for all the categories in the set. For each category we compute the probability by multiplying apriori probability with posteriori probability of each feature. As we have two types of features, the computation differs for both of them. Take a look at this quite simplified code:

public T Predict (T item){
  Vector values; // represents the item as a vector
  foreach (var category in Categories)
      for (var feature in Features)
          if (NaiveBayesModel<t>.ContinuesTypes.Contains(feature.Type))
              var value = values[feature];
              var normalProbability = Helper.Gauss(value, CategoryFeatureAvg[category][j], CategoryFeatureVariance[category][j]);
              probability = probability * normalProbability;
          if (feature.Type == typeof(bool)) //String properties are converted also to binary
              var probabilityValue = Posteriori[category][j];
      if (probability > maxProbability)
          maxProbability = probability;
          maxCategory = category;

That's all there is to it. Once you understand that we need just 4 arrays, it is just a question of how to fill these arrays, that is not hard (it should be clear from the previous explication), but it takes some plumbing and looping over all the items in the learning collection.
If you would like to see the Source Code - check my fork