The Wall: února 2012

sobota 25. února 2012

Applications of Graph Theory

Graph Theory is just a beautiful part of mathematics. Not only Computer Science is heavily based on Graph Theory. There are a lot of applications of Graph Theory in Operational Research, Combinatorial Optimization, Bioinformatics.
For my personal clasification I have separated the tasks, which you can solve using Graph Theory into two groups:

Obvious applications – I mean, that the analogy between the graph and the problem is quite easy to imagine (maps, cities, relations etc.). In this post you can find the following:
- Vehicle Routing Problem
- Graph coloring
- Map coloring
Hidden applications - Tasks which you would never assume can be solved using Graph Theory. Than you see one of them, and than you think: “Wow, I wonder who came up with that one…”. I will provide the following ones in this post:
- Image or 3D model reconstruction from projections
- Prove of NP hardness of Integer Linear Programming
- Register allocation
- Approximation of data, data compression
- Task scheduling

Obvious applications

Here are some examples of problems for which the creation of the graph to model the problem is quite easy and obvious.

Vehicle routing problem and other variants of TSP

There is a whole set of problems, which are just variations of Traveling Salesman Problem. Vehicle Routing Problem (VRP) can be characterized by following description:

We are provided with a set of trucks, each of a certain capacity
Set of customers, each with certain need of the goods
The central repository, where the goods are stored

The tasks is to root the trucks, to distribute the goods to the clients and minimalize the distance. I have written a blog and a web utility which is able to solve the problem using two algorithms:

Clark & Wright Savings Algorithm
The Sweep Algorithm

You can find more algorithms for solving VRP here.

Graph coloring problem

Given a graph, we want to decide, whether it is possible to color each of the vertices in the graph in such way, that none of the vertices which are sharing and edge have the same color. Many real world problems can be formulated as Graph coloring problem. Ne first one of the Map coloring.

Map coloring

One of the first application is the map coloring problem. It has been proven, that each map can be colored using 4 colors. This problem can be converted to graph coloring problem by placing the vertex inside each country or region in the map. Two vertices are connected if and only if the two countries have a common border. More over here.

Hidden applications

There are tasks and problems for which you would not intuitively search the solution by applying graph theory. Here are some of them:

Image reconstruction from X-Rays – Computer tomography.

Tomography is a technique used to reconstruct an image or 3D model from series of projections, subsequently taken from different angles. When using technologies such as the x-rays, the image take from an angle gives for each pixel the total thickness of the scanned object. The questions is than how to reconstruct the image from several taken images which are containing only the thicknesses.

As described in great book "Network Flows – Theory, Algorithms and Applications”, concrete example of computer tomography is the “Reconstruction of the Left Ventricle from x-Rays projections”. This problem can be solved using the application of network flows theory. This method is applicable only to problems where the scanned object has a uniform structure. As mentioned in the book this assumes that the well-working Ventricle is filled uniformly with blood and dye mixture.

The following graphics was taken from the book. It explains the method on two dimensional image. Using two projections of the project, we obtain vectors which are containing for each pixel (or other unit) the probable mass hidden behind this pixel. Now is up to us to find out how this mass is distributed - that means where are the ‘1’ in the picture. The more projections we have, the better results we can obtain.

The problems is thus simplified to the problem of constructing binary matrix from the projection sums. This problem is a special case of the feasible flow problem.

The following image shows similar very simplified task, which I have taken from the Combinatorial Optimization course offered as part of Open Informatics program at CTU Prague.

The whole problem can be seen as the question of finding the feasible flow in a network (G, b, u, l,c). So what does network consist of:

Graph G
s – sources – the nodes which provide the fluid into the network – the nodes with positive values
t – appliances (or sinks) – the nodes which consume the fluid – the nodes with negative values
u – upper bound for the flow of each edge
l – lower bound for the flow of each edge
c – the actual flow in each edge – the one for which we are looking. The task is to find the values of c for each edge, in order to satisfy all sinks.

Here is the graph G which corresponds to the projections sumR and sumC from the previous image. Each edge in the graph corresponds to one pixel, connecting the two projections. The sumR are being sources in this network and the sumC edges are sinks.

For each edge the lower bound l(e) = 0, upper bound u(e) = 1 and we are looking for values of values of c(e), in order to for the flow to be feasible and also minimal. The edges which are used in the feasible and minimal flow are pixels which will have ‘1’ value in them.

Proving NP’s ness of some problems such as Integer Linear Programming

The graph coloring problem has been already mentioned above. We are trying to color each node of the graph in such a way, that nodes with same color cannot be connected by an edge.

Integer Linear Programming (ILP) is NP-hard problem. This can be proven by the polynomial reduction of Graph coloring problem to the ILP problem. Concretely we can say, that for each graph which can be colored using 3 colors, we are able to construct an ILP problem, which has a solution. From the theoretical point of view saying “we are able to construct” means that there is a polynomial reduction of Graph coloring problem to ILP problem. Polynomial reduction proves that:

If Graph Coloring is NP-hard problem, than ILP is also NP hard problem.

Polynomial reduction has to satisfy two conditions in order to prove the NP-hardness:

The reduction algorithm – the construction of one problem from another has to be performed in polynomial time
For each instance graph which can be colored with 3 colors an instance of ILP can be constructed which has a solution

Here is the reduction algorithm (the algorithm which explains how to define an ILP problem to given graph):

In the beginning we have a graph colored using 3 colors. We will try to create an instance of ILP out of this graph. That means we have to define the variables and the equations which build the ILP problem. We can do this in 3 steps.

Create N variables x_ncolor== 1 <=> the node n has the color c, where N is the number of nodes.
For each node in the graph add en equation to the ILP system:
- x_nred + x_nblue + n_ngreen = 1
for each edge e = {n_i, n_j} in the graph add following three equations in the system:
- x_nired + x_njred <= 1
- x_niblue + x_njblue <= 1
- x_nigreen + x_njgreen <= 1

Here is an example, we have an simple graph:

Now the first set of equations, which states, that each edge can have at most one color:

The following set of equations, which states, that nodes sharing edge cannot have the same color:

Now because the ILP problem can be reduced to graph coloring problem, we know, that this problem has solution, when the graph can be colored with three colors. Here is the solution:

Which corresponds to:

The coloring of the graph is NP hard, so also ILP is NP hard. If you wonder how to prove that NP graph coloring is NP hard: there is a polynomial reduction from one special type of SAT problem.

Register allocation

Register allocation is the process of assigning possibly infinite set of variables of assembly program to a finite set of registers which are available in the processor. Not all variables are used at the same time, so several variables can share a register (if not this mapping would not be possible). Even this problem is solved using graph coloring. For each variable a vertex is created. Vertices are connected if variables “live” in the program at the same time. The number of colors given to color the graph is equal to number of registers.

Approximation of the data – data compression

This technique is used in order to approximate the data which has to be stored while minimizing the loses of precision.

For example a data which represents taken temperatures during the time and builds a nice graph. However if this data was taken at high frequency, there might be too many records. The idea is to minimize the number of records, while keeping most of the information about the evolvement of the temperature.

The shortest path algorithm can be used to solve this problem. For instance the blue line in the following graphics represents the data to be stored. It is 10 values: the time x and Temperature( x) at the time x. The green and red line represent possible approximations, when we are leaving out one or two nodes. Of course there are several nodes which can be left out and the shortest path algorithm can help us to find which ones can be left out.

We can construct a full graph, which will contain 5 nodes, representing the 5 data points (the times x). Each edge represents the “precision loose” which we pay, when we take the direct path between the two nodes of the edge instead of passing the traditional way. The following picture represents the partial graph – the skipping edges start only in the first node ( A ). The edge with value x1 corresponds to the red line in the graph etc. The graph should be also filled with other edges starting in B and C (the only edge going from D to E is already present and there are no edges starting in E), but I have left them out for simplicity.

So without compression we have the simple path: A,B,C,D,E = 1 + 1 + 1 + 1 = 4

Taking the red edge and the rest of the path: A,C,D,E = 1 + 1 + 1+ x1

Taking the green edge and the rest of the path: A, D, E = 1 + 1 + x2

The values of the edges in the standard path should be the lowest (here all of them have value 1). On the other hand values of edges which will make us loose more precision should be the greatest. Then of course we can introduce some bonus to motivate the algorithm to take less edges (better compression). All this constraints can be modeled using heuristics.

One possible heuristics to evaluate the value of the edge is to measure the distance between the real data and the estimated date. For instance the value of the second point is 5. If we estimate the value using the red line (leaving out the second point) the corresponding value on the red line is 3. The distance between these two values is: 2.

If we use the the green line instead then the distance between the estimated value f’( x) and the real value f( x) is 1. On the other hand the green line also estimates the second point 3 point. And we see that the distance for the second point will be more or less 1.5. So we should add these distance together. So we get:

x1 = 2

x2 = 2.5

This is just a proposition. We could also multiply it by some coefficient to obtain some reasonable results.

With the developed and evaluated graph, finding the shortest path in the full graph from the node A to the node E will give us the best “size/precision” ratio.

Task scheduling

In this problem we have a group of workers and group of tasks. Each task can be processed by each worker. However the workers do not have the same performance on all tasks – the time for the processing of each task differs for each worker.

Let’s take a look at very simple example, we have two workers (A,B) and two tasks (T1,T2). The values in the table represent the processing time, that the worker needs for the given task.

This can be solved as finding the cheapest flow in the following graph.

Not that each edge has two values: u/c. The ‘u’ represents the capacity of the edge – it is always one. The ‘c’ represents the cost of the edge. Finding the cheapest flow in this graph from S to F will give us the best assignment of workers to tasks.

Other interesting applications

Development of custom processors for minimization of resources – data path optimization.
Scheduling on parallel processors – application of maximal flow finding problem.

sobota 18. února 2012

JavaScript asynchronously uploading files

At the beginning I have thought, that it has to be easy; just make a POST to the server using jQuery and the only question is how to get the data. Well I have found out that it is not that easy and googling around I have found there are quite a lot of pre-build components and plugins, which makes it quite difficult to decide for one of those.

Why it is not possible to use simple JavaScript POST?

Because of the security restrictions. The browser is not allowed to post the file content asynchronously. This is however about to change thanks to HTML 5.

Workarounds

HTML 5 - has a support for file uploading. Use the File API. Follow this how to. This does not work in current versions of IE (7,8,9).
Create a hidden iFrame on the page and redirect the return of the post to this iFrame
Use Flash, Silverlight, or Java applet
Use some component, or jQuery plugin, which usually makes use of the preceding ones (usually the iFrame hack)

jQuery plugins

There are quite few of those:

I have tested jQuery File Upload. Which is cool, comes with nice GUI but at the time of writing this, I have found it little hard to customize. Actually I have struggled to use a simple form, which would upload just one file, instead of the predefined GUI with it's behavior.

The second one that I have tested is jQuery Form Plugin which contrary to the previous one, is simple to use in a one file upload scenario. However it does not provide the nice UI, ready for multiple files upload etc...

Using jQuery Form Plugin in ASP.NET

Client side

On the client side you need jQuery and the Plugin js file. Then with one jQuery call you can set up the form, to use the plugin.

<form id="uploadForm" action="Upload.ashx" method="POST" enctype="multipart/form-data">
  <input type="hidden" name="MAX_FILE_SIZE" value="100000" />
  File:
  <input type="file" name="file" />
  <input type="submit" value="Submit" />
</form>

$('#uploadForm').ajaxForm({
    beforeSubmit: function (a, f, o) {
        o.dataType = 'html'
    },
    success: function (data) {
        alert('upload OK:' + data);
    }
});

The dataType property which is set to 'html, specifies to Form Plugin what kind of response should it expect. To check the other options see the documentation.

You can see, that the form action is set to "Upload.ashx". This is the server side script, being a Http Handler (in case of ASP.NET application). It could also probably be a WCF service - but let's keep it simple when we can.

Server side

On the server side you have to define a Http Handler which will take care of the upload functionality.

public class Upload : IHttpHandler
{
    public void ProcessRequest(HttpContext context)
    {
        System.Web.HttpPostedFile file = context.Request.Files[0];
        String filePath = "Uploads" + "\\" + RandomString(10) + "." + extension;
        string linkFile = System.Web.HttpContext.Current.Server.MapPath("~") + filePath;

        file.SaveAs(linkFile);
        context.Response.StatusCode = 200;
        context.Response.Write(filePath);
    }
}

And that's it. The handler will save the file and send back the address of the file.

sobota 4. února 2012

Choosing technologies for .NET project

Our latest research and development project was an online banking application. While choosing the building pieces of this application, we tried to pick the State-Of-Art frameworks and technologies. This is not an easy task, while there are always several alternatives for each component. I have decided to created this post which sums up the technologies available for different parts of application and I will try regularly update it, to keep up with the changes.

Here is the structure of this blog, according to which the technologies are grouped.

DataAccess - ORM, data generation
Platform - Dependency Injection, Aspect Oriented Programming
Integration - SOAP/REST, messaging, distributed objects...
Testing - Unit testing and Mocking, Parametrized testing, Functional testing
Presentation layer
Security
Logging

Typical application

Our application was a classical 3-tier application with database, business and presentation layers.
Data stored in SQL Server 2008. Data access layer implemented using Repository pattern and using ORM. Dependency Injection and Aspect Oriented Programming used to put together the application pieces. Services exposed using WCF, and two types of client applications: mobile and web.

So the technologies presented here, are the ones mostly used in this scenarios, however as said before, I would like to update the post to give more information any time I cross another technology, and that might while working on different architectures.

Data Access

The most important part of the Data Access layer is the framework used for Object Relational Mapping (ORM). There are currently two major ORM frameworks in .NET: NHibernate and Entity Framework. Both provide similar ORM functionalities (Code only approach, Lazy loading , use of POCOs as persistence classes).

Entity Framework 4.0 has brought a lot of improvement to its previous version (named EF 1.0) which did not provide above mentioned functionalities and its comparable to NHibernate. Crucial for ORM framework in .NET environment is the integration of LINQ (Language Integrated Query). Entity Framework was the first to offer this functionality but the implementation in NHibernate followed shortly after.

NHibernate has still several advantages among these it’s better ability to process batch treatment and also the fact that as an open source product it can be customized. On the other hand Entity Framework provides better tools integrated into Visual Studio.
One last thing which can justify the choice of NHibernate is the possibility of using FluentNHibernate.

FluentNHibernate
NHibernate uses its XML based HBM format to define the mappings between entities and POCOs. While the separation of code and configuration in XML can be seen as nice approach it gets complicated once the XML configuration files are larger and once we are introducing changes into the POCOs. The XML is not checked upon the compilation, so potential errors can be detected at run-time only and are generally hard to localize.
NFluent allows us to define the mappings in strongly-typed C#, which practically eliminates these issues. If there is an error in configuration, it will be most likely discovered during the compilation. Currently Fluent allows provides almost full compatibility with HBM files, which means that what can be defined in HBM can be also defined in Fluent.

Data Generation
AutoPoco is a simple framework which allows generation of POCOs (Plain Old CLR Objects) with meaningful values. When building enterprise application we often need generate initial data for the database. This can of course be done using SQL scripts or in imperative language which we are using, but consists of lots of repetitive code and for loops in order to create sufficient amount of data. AutoPoco provides easy way to generate the starting data. It also provides several build-in sources for common properties which are stored in databases such as phone numbers, birth dates, name and credit card numbers.

Platform

There are two design patterns (or approaches) which are very often present among the several layers of enterprise applications: Dependency Injection and Aspect Oriented Programming.

Dependency Injection is used to assemble complex system from existing blocks. There are several Dependency Injection containers available for .NET framework: Spring.NET, CastleWinsdor, StructureMap, AutoFac, Ninject, Unity (by Microsoft), LinFu.

Aspect Oriented Programming allows developers to separate cross-cutting concerns from the applications blocks. This is usually done by injecting code into object's existing methods.
There are several ways to implement AOP, two of these being most common: Proxy based AOP and IL Weaving based AOP.

Proxy based AOP is easily achieved by wrapping targeted object by a proxy class. Than it is easy to intercept the calls to the target object by the proxy class and call the code, which should be injected. It just happens so, that the Dependency Injection containers use proxy classes and therefor most of them offer also AOP. (Spring.NET, CastleWinsdor).

IL Weawing is an expression for injection of IL code after compile time before the generation of byte-code.

There are two frameworks which provide AOP through IL Weaving: PostSharp and LinFu. PostSharp has a commercial licence, however at the time of writing this post(July 2011), there is also 45 days free trial. LinFu is an opensource project under LGPL licence which covers both IoC and AOP.

I have used to choose Spring.NET because of it’s maturity, the fact that it is well documented, works great with NHibernate and allows both AOP as well as Dependency Injection. One of the disadvantages of Spring.NET is the XML configuration which as always can become too large to maintain. Other frameworks use C# as the language to configure the AOP or Dependency Injection (PostSharp makes use of attributes and frameworks such as Ninject or StructureMap use strongly typed classes to configure the dependency injection container).

I have however decided to use Ninject on my last project, which seems to have a bit of momentum right now, and I will post here later pros/cons.

Code Verification (Code Contracts)
Design by contract is software design approach, which implies that developers define clear interfaces for each software component, specifying its exact behavior. The interfaces are defined by contracts and extend the possibilities of code verification and validation.
The term was first used by Bertrand Meyer, who made it part of his Eiffel programming language.

Code Contracts is a language agnostic framework which enables the Design-by-Contract approach by allowing the programmer to define three types of conditions for each method:
Pre-condition - states in what forms the arguments of the method should be.
Post-condition - states what forms the outputs of the method will have.
Invariants - conditions which will always be true during the execution of the method.

These conditions can be later verified by two types of checks:
Static checking - is being done at the compilation type. At this time the compiler does not know what will be the values passed as arguments to the methods, but from the execution tree can determine which method calls might potentially be evoked with non-compliant parameters.
Runtime checking - the code contracts are compiled as conditions directly into .NET byte-code. This allows the program to avoid writing conditions manually inside the method bodies.

Note that Code Contracts are not language feature. They are composed of class library and the checking tools which are available as plugins for Visual Studio.

Integration

Distributed applications need a way of communication between the components. Remote Procedure Call(RPC) was the first technology used in distributed systems back in 70's. The choice here surely depends on the architecture of the application (client-server, publish-subscribe, ESB, and more...)

WCF
Flexible platform which provides abstraction of transport layer configuration (security, transport format, message patterns).
WCF options and choices:
Transportation protocol: WCF can user HTTP, TCP, MSMQ
Transportation format: XML, JSON, or Binnary

One service can expose several Endpoints (URIs). Each Endpoint can be configured to use different Binding. Binding can have different transportation protocol and format options. The same services can be thus exposed using different protocols and formats. In our application we can use this advantage and expose different endpoints for different clients.

Testing

Several types of tests can be used to confirm the correct behavior of the application: Unit Tests, Integration tests, smoke tests, functional tests (or acceptance tests).

Unit Testing
Mocking frameworks
When it comes to isolating the unit tests there are several Mocking frameworks available: NMock, EasyMock, Moq, JustMock (commercial), TypeMock (comercial), RhinoMocks, NSubstitute, FakeItEasy and Moles.

In our application we have decided for RhinoMocks and Moles. Moles are used in connection with Pex - test generation framework, which will be described later.
Most of the Mocking frameworks provide more or less the same functionalities thus the decision is quite complicated. RhinoMocks has the following characteristics:

Free and Open Source
Easy to use
Active community
Compatible with Silverlight (existing port to Silverlight)

Possible disadvantage: three types of syntax, which might be confusing for beginners
Actual version 3.6, version 4 which should break backwards compatibility is in development, but if I have not missed something, there are so far no releases.

Pex & Moles - Parametrized Unit Testing
Pex & Moles are used in order to build Unit Tests for the back-end part. Pex is a tool which helps generate inputs for unit tests while Moles enables the isolation of tested code. In order for Pex to generate the inputs the the test cases have to be parametrized.

Instead of writing concrete test cases, the test method is just a wrapper which takes the same arguments as the tested method, performs necessary set-up and then passes the arguments to the tested method. Pex analyses the execution tree of tested method and suggests the parameters which should be passed to the method and builds concrete test cases.

The aim of Pex is to obtain maximal code coverage. In order to achieve that, it uses algebraic solver (Microsoft’s Z3) to determine the values of variables used in the method which will lead to execution of each branch. Than it varies the parameters to obtain these values.

Moles is a stubbing framework. It allows you to isolate the parts of the code which you want to test from other layers. There are basically two reasons why use Moles:
Moles works great with Pex. Because Pex explores the execution tree of your code, so it also tries to enter inside all the mocking frameworks which you might use. This can be problematic, since Pex will generate inputs which will cause exceptions inside the mocking frameworks. By contrast Moles generates simple stubs of classes containing delegates for each method, which are completely customizable and transparent.
Moles allows to stub static classes, including the ones of .NET framework which are usually problematic to mock(typically DateTime, File, etc)

As it says on the official web: "Moles allows you to replace any .NET method by delegate". So before writing your unit test, you can ask Moles to generate the needed stubs for any assembly (yours or other) and than use these “moles” in your tests.

Presentation Layer

The presentation layer is quite large topic with several choices: ASP.NET, ASP.MVC + JavaScript, pure HTML5 + JavaScript, some JS frameworks (jQuery, KnockOutJS, Silverlight - and all of these technologies can be combined.

Silverlight
Here is a list of characteristics which can be seen as advantages:

Intend ed to develop Rich Internet Applications.
Supports separation of the view and the logging using the MVVM pattern.
Possibility to use declarative language (XAML) to design user interface and imperative language tode ne the application logic.
Data visualization support u sing open source Silverlight Toolkit (charts, line series)
Re-usability of code on .NET compliant platform.
Possibility to access audio and video devices on client side.
Plug-in based technology. Requires the plug-in to be run inside the browser. The plug-in is not available for all possible combinations of platform and browser. This lowers the availability of the developed application and brings also higher requirements on hardware.
Standard web features are missing such as navigation.
Limited testability. Silverlight can not be tested with traditional functional testing frameworks such as Selenium. On the other hand, when the MVVM pattern is implied, the ViewModels can be tested as simple classes, using traditional Unit Testing technologies.

HTML + JavaScript

No plug-in needed, HTML 5 is supported on the majority of the current browsers.
Naturally comes with web standard features: navigation, bookmarking.
Developers has to handle the "all browsers compatibility" issue.
Compared to C\# JavaScript is dynamic language, not compiled before the execution. This may be seen as advantage and disadvantage.

Knouckout.JS seems to me as a great possibility to use the MVVM pattern with JavaScript, I will be checking it and writing about it later.

Logging

Logging is an essential part of each application. Following frameworks are available in .NET:

Log4Net - easy configurable framework.
Logging in MS Enterprise library
NLog - version 2.0 released 7/2011 including logging framework for Windows Phone 7 and Silverlight - seems very nice, but I have never tried.
The Objects Guy Logging Framework - lightweight logging framework
.NET build-in tracing - alternative approach of using System.Diagnostics namespace which enables output of standard Trace and Debug Write method to XML file.

Good recapitulation for logging is available at this stackoverflow thread.

Security

There is usually a need to handle the user authentication in enterprise applications. When using ASP.NET I have found out that there are the standard Forms Authentication usually satisfies my needs. To handle OpenID authentication DotNetOpenAuth is an excellent choice.

Forms Authentication
Forms Authentication scheme works by issuing a token to user the first time that he authenticates. User can be authenticated against database or any other information source.

This token in the form of cookie is added to the response which follows the authentication request. This way the cookie is added to the next request by the same client. Forms Authentication than takes care of revoking the cookie (after demanded time) as well as of checking the cookie in each requests.

Forms Authentication works automatically with browser based clients, when used from different clients, some additional work on the client has to be done in order to add the authentication cookie to each request.

DotNetOpenAuth

I have previously used this library for two task: integrating OpenID authentication and creating OAuth provider.

Integration of OpenID works hand in hand with Forms Authentication. DotNetOpenAuth library provides a means to authenticate user against any Open ID provider. Once the user is authenticated the authentication cookie can be generated using Forms Authentication.

Conclusion

When new application is being developed, there are several decisions, that have to be taken regarding the framework and technologies which might be used. This article does not give direct answers to these question, but rather lists all the possible frameworks which should be taken into account.

New frameworks are being delivered by Microsoft and by Open Source community and it is hard to see which technologies will hold on which will be forgotten. I hope this overview can help to make the right decision. Any suggestions are welcomed.