Tuesday, August 4, 2009

Convert Word-Documents to PDF on an ASP.NET Server

Introduction

This PDFConverter converts Microsoft Word documents (*.doc) into PDF-files on a Webserver. There is a simple Webform, where you can upload your Word-Document, which will be converted and sent back as a PDF-File.

Prerequisites

Background

There are a lot of people trying to do this Word => PDF conversion using Com Interop directly from their ASP.NET code. As Microsoft reports, Word or any other Office products are not recommended to be automated on a server. The Office products are optimized for running as client applications using a Desktop for interaction. This is why, if you want to use an Office-product in any kind on a server, there must be a User logged in. But then, there is the next problem: Using COM Interop directly from ASP.NET means, the call is made by the ASP.NET-User which is not allowed to interact with Word. Even if this setting is changed in the DCOM-configuration, there will still remain a lot of access rights-related problems with this solution. That's why I considered the following way to do it:

PDFConverter

Explanation

The PDFConverter.exe is an executable containing RemotableObjects running all the time on the server. To do this, a User must be logged in. When starting the PDFConverter.exe, it will be checked if Word 2007 is available or not. I configured that Word should be Visible for this check, so you can see if Microsoft Word quickly opens and closes again, everything works fine.

Then there is the normal website with fileupload. Store the uploaded file somewhere (don't forget to give the appropriate rights to the ASP.NET and IIS User on this folder, in order to be able to save the uploaded file there).

When the file is saved, you call the convert()-method of the PDFConverter.RemoteConverter Instance which you get using Remoting (see code). The whole conversion thing is then called from the PDFConverter.exe which runs on a "Desktop" with the appropriate rights to interact with Microsoft Word.

When the conversion is finished, you can do whatever you want with the pdf-file. In the example, it will be streamed back as filedownload to the client.

Using the Code

Remoting

Serverside PDFConverter.exe (app.config):


See full detail: http://www.codeproject.com/KB/aspnet/word2pdf_serverconvert.aspx

ASP.NET Providers for the ADO.NET Entity Framework

Introduction

One of the most powerful improvements of ASP.NET 2.0 was truly the introduction of the membership, role and profile providers. They allow to rapidly integrate user management, role based security as well as visitor based page customization into your ASP.NET application. The name already indicates that they all implement the provider model design pattern.

Unfortunately, Microsoft doesn't have these providers ready to download leveraging the ADO.NET Entity Framework.

The downloadable source from this article provides a ready-to-use implementation for the above mentioned providers. The following article gives a quick overview about providers and it describes what it takes to get the provided source up and running.

In order to understand the source, sound knowledge of LINQ is inevitable. Furthermore, it is important that readers understand the philosophy behind ASP.NET providers.The following link gives a good overview of the ASP.NET providers.

Background

Provider Model Design Pattern

provider_overview.JPG

The provider model pattern was designed to provide a configurable component for data access which is defined from the web.config. The Provider interfaces between the Business logic and Data Access. The actual concrete implementation of the provider is defined in the web.config. Custom providers can be built and configured in the web.config without changing the application design. Providers are a subclass of the ProviderBase class and typically instantiated using a factory method.

Various providers for the same purpose can co-exist and easily be configured in the web.config.

Database Schema

The database schema is closely related to the one that gets created by aspnet_regsql.exe:

provider_database_schema.JPG

The resulting Entity Model looks like the following:

provider_entity_model.JPG

Other than Microsoft's ASP.NET providers for SQL, the presented solution here does not use stored procedures. Instead, all the queries are implemented in the respective providers source using LINQ.

Exposed APIs


See full detail: http://www.codeproject.com/KB/aspnet/AspNetEFProviders.aspx

A Custom UpdateProgress Control

UpdateProgress

Introduction

In this article, I’d like to show you how to build a custom update progress control with the following aspects:

  • The whole page will be protected by other requests caused by clicking on the buttons located on a page.
  • The progress bar will be displayed
  • The button for aborting the process will be available.
  • The control doesn’t need to have a specified UpdatePanel. It will work for each UpdatePanel located on a page by default.
  • The update progress control will be available from JavaScript as an object. It means you can use it either for an UpdatePanel or for custom AJAX requests.

For understanding this article, you should know what is the UpdatePanel control and finally what is the UpdateProgress control.

Let’s have a look at the following screenshot to find out what the result will be:

pic1

Pic 1

As you can see, there is the yellow box approximately in the middle of the page. The rest of the page is covered by shadow and does not allow the users to click the buttons.

The article is divided into two parts. The first the theory and the second the examples/how to use. If you are not interested in the theory, then you can just skip it.

Theory

Imagine that we are in the following situation: we have an UpdatePanel control located in the page and we want to display a progress bar during an update of this UpdatePanel. The update of this panel is certainly made by an invisible HTTP request. What does ‘invisible’ mean? It means that the HTTP request is made by the XMLHTTPRequest object. The point is that the browsers are not giving feedback to the user during the HTTP request. For this reason, we should display some progress bar to the user to let him know that its command is in progress.

Firstly, you need to be informed about two stages:

  • The beginning of the request
  • The end of the request

It is very easy in the MS AJAX JavaScript library to achieve this goal. There is the object called PageRequestManager, and using it, you can register the events for the mentioned stages. Let’s have a look at how to get the instance of the PageRequestManager.


See full detail: http://www.codeproject.com/KB/ajax/UpdateProgressControl.aspx

A custom DecimalBox for accepting only Digits and a Decimal Point

Introduction

In a recent project I was creating for evaluating investment properties I found I was going to need as many as 20 or so text boxes to accept monetary amounts, the purchase price, down payment, taxes, rental income, miscellaneous income, and a number of expenses beyond just the mortgage payment. I initially searched the web for examples of a control that would only accept digits, but was disappointed with what I found; likewise when I considered formatting the user input as currency. The solutions I encountered didn't do anything to ensure that the user couldn't make a mistake that wasn't caught and immediately rejected by the code, and the formatting examples I found were inspirational, but not quite what I wanted to accomplish. So, I decided that I would use pieces of what I'd found to create a custom cotrol that did exactly what I wanted it to do. The criteria were as follows:

  1. Any character other than a digit or a single decimal point was to be simply rejected as user input, and not even displayed in the textbox. In this way, since users couldn't possibly make any input errors, there'd be no need for error handling, and mercifully, no annoying pop-up error messages.
  2. On entering the control, the control's background color should change so there's no question about which field the user is in; when exiting the control, the background color changes back.
  3. Since all user input was to be "money" I wanted to format the user input on leaving the control to a currency format where the appropriate monetary symbol and thousand's separator would be automatically added to the text display, precision was set to two decimal places, and all the figures lined up neatly on the form.

As I found during testing, which as it turns out was amazingly simple once I opened and had running two instances of Visual Studio 2008 - one for developing my "DecimalBox" and one to test the "DecimalBox" - anything that can possibly go wrong, will. You'll find code that on first blush seems to add unnecessary code, or unnecessarily complicate things, but actually precludes the unhandled errors I was able to somewhat accidentally introduce during testing. I'll explain them as we go.

If you've tended to shy away from custom controls in the past thinking them either unnecessary or too time-consuming to develop, I hope the simple code here will allow you to see how, in the long run, they'll actually save time and cut development time next time you need something like a "DecimalBox" on your form. C# encourages reuse, encapsulation and code efficiency, and using custom controls can help you achieve that in your own code. In much the same way that using Microsoft-provided controls saves you development and support time, the one-time creation of a custom control will pay off in not only your current project, but the next time you need to accept, for example, only decimal user input, and the time after that, and the time after that as well. Developing custom control is a great skill to develop. I hope this simple control inspires you to consider developing your own custom controls - and contributing your successes to this forum. Truly, the entire community benefits from every contribution, and we all become better programmers.

Background

One reviewer asked a wonderful question: "Why would I use your control rather than a numeric up-down counter?" which led me to consider the implications of why use a custom control at all. My response follows:

For any control I think there are three interwoven issues, functionality, maintainability, and user perception or experience. Remember, I needed 20 or more controls capable of:

  1. accepting user input in the form of a decimal value, excluding any and all extraneous characters
  2. I wanted to format the user input as currency on leaving the control, adding the appropriate culture-specific monetary symbol and the appropriate thousand's separator
  3. I wanted to change the background color on entering (to focus user attention to the appropriate control) and on leaving.

While requirement a) could be handled by the numeric up-down counter, requirement b) couldn't be accomplished because the control only displays digit or decimal values and not text (it's a 'numeric' up-down counter).

But more importantly - addressing requirement c) - is the question of managing 20 such identical controls. Should they be managed individually, that is, 20 separate "Focus - Leave" event handlers each containing identical code either to format the user's decimal input as currency or calling on a method in a separate class that formats the user input as currency? Or, could I create a custom control where in one place that format-as-currency code could be created and maintained, and the result displayed in one, 20, 100, or 10,000 controls, if so needed? From the perspective of maintainability, and not to forget the development cycle of creating this behavior, a single place, that is a single custom control, made the most logical sense.

Further, in the process of development, my initial effort was to simply add a '$' character (and later, to detect the presence of one to prevent a formatting exception error from occurring), and in testing I found that the simple approach didn't prevent or resolve problems. The point is, that doing development on one (custom) control allowed me to modify its code in one place and have the behavior propagate to the 20-or-so instances of the control in my form. The alternative, had it been necessary, would have been to change the code in either 20 separate locations (very inefficient) or in a single method in a separate class (more efficient, but requiring dedicated code in my project whereas in a custom control that code is in a separate). Additionally, of course, I now have a custom control that I can use in subsequent projects with no further development necessary.

I hope it's clear that focusing the development efforts in one place made both development and subsequent maintenance far simpler. Philosophically this results in code re-usability, one of the supposed hallmarks of the C# language. From another philosophical angle, this isn't much different from using a Microsoft-provided control that already has certain functionality already built in, much like your question of why not use the already-provided numeric up-down control, and the answer should be obvious: though it has its advantages, it does not do exactly what I want the control to do. The solution is a custom control, which as I found out, wasn't as difficult to develop as I initially thought, and allowed me to add a control to my form that does do exactly what I want it to do.

The user experience is not be ignored either. Users see text boxes all the time and are used to entering values in them, whether test or numeric. When they see a numeric up-down they see a control with arrows they have to use to alter the displayed value. Unless they get a visual cue, say a change in background color, they may not be aware that they can change the value directly by typing it in. In my project - real estate evaluation - values could vary from a few hundred dollars in one field to tens of millions of dollars in another. Clearly, users should be encouraged to enter the value they wish rather than use arrow keys to describe or select it.

To take this one step further in encouraging the use of custom controls, I also have dozens of TextBoxes where users can enter text information, names, addresses, expense categories, etc. Suppose I want to have the background color of each of these controls change as they user tabs through my forms? Should I add an event handler for "Focus - Enter" and "Focus - Leave" for each individual instance of these dozens of TextBoxes, changing the background color on entering, changing it back on leaving, or should I create a custom TextBox control with the appropriate background color changes "built-in" so that adding an instance of that control to my form - or any other project for that matter - carries with it the changes of background color that I wish the control to exhibit as the user tabs through the form? I hope the answer is obvious, that is, that the little effort required to create a custom control saves tremendous amounts of development and maintenance time in the longer term even though it cost a bit of time on the front end to develop it.

Using the Code

  1. Open Visual Studio 2008, click the “File” pull-down at the top of the screen, then “New” and “Project…”
  2. Click on “Windows Forms Control Library, then name it “DecimalBox”.
  3. In the “Solution Explorer” on the right of the screen, right-click the entry ending in “.cs” (by default this will be “UserControl1” and delete it.

See full detail: http://www.codeproject.com/KB/miscctrl/Formatting_User_Input.aspx

Making a parsable text box in .Net 3.5

Introduction

The standard WPF controls provided by the .Net 3.5 SP1 framework do not contain a range of controls like numerical text boxes. Very often developers need these text boxes to limit the user input to accept integers or floating point etc. I developed these controls while working on a hobby project of mine. While at it, I thought it would be nice to extend these controls to beyond numerical data types and in general, parse any text into strongly typed .Net objects. This article describes how to extend the library to target any .Net data type that you want to create ,starting from plain text input.

Note: Unless and until mentioned otherwise, we will referring to input that has to be parsed as a System.Double object. We will call its text counterpart as floating point representation.

Background

The objective is to get textboxes that accept a text input that can be parsed into a strongly typed .Net object. Now, most people will enter text character by character (possibly press back spaces, or cursor keys and edit/remove previously eneterd characters or add new characters at any location) and so it is not sufficient to intercept the TextInput event and check if Double.TryParse returns true or false. That said, we also want to prevent users from enetering characters that once entered will never allow the input to become a valid floating point number even if any characters are appended to their input. Also, we will like to notify them that they need to append characters that will make their input valid.

For example, in the following screenshot, the top row shows examples of input that can become valid (bottom row) after some specific characters are appended to the input.

MyControls

Some input can never be made valid, meaning no character combination can be appended to rectify it. Example, -1.2. -1.. -- 12p 12e-1. etc. We shall not allow the text box to reach such a state.

Since the main purpose of this article is to demonstrate how to extend this control, let's review its design in some detail. If I were to tell you to write a program to accept a string input character by character and determine if the input you have received so far constitutes a valid floating point number, chances are that you will start thinking in terms of a finite automata. Meaning you will define some states and jump from one state to other depending on some pre-defined (i.e., design time) rules. Some of these states would be final --meaning that if you are on it the input is valid, else it is invalid. While there is nothing wrong with this approach and you can build correct programs using this approach, the code will look a bit messy. Furthermore, if you were to extend this code to accept variations in the float representations (like accepting the thousand seperator or the exponential part then you will have to add new states, new jumps etc.

Now, let's utilize something known as a regular expression. Regular expression are equivalent to a finite automata --meaning for every finite automata there exists a regular expression (on the same set of alphabets) and vice versa. So, if we can define a regular expression for a valid floating point pattern, all we need is to find out if the input matches the regular expression. To do so, you will need to again write an automata code and the whole point will be defeated. But, some good people, have already written code to match any text against any regular expression and it is available under the System.Text.RegularExpressions namespace in .Net.

So first and foremost we need two regular expressions: one for patterns that are valid and one of the patterns for which a character combination exists, that when appended to it shall make it valid. For System.Double objects these can be:


See full detail: http://www.codeproject.com/KB/edit/ParsableTextBox.aspx

Saturday, August 1, 2009

Building a web crawler

In October’s issue I showed how to develop an HTML Container class. This month, we will use that class to develop a general purpose Web Crawler class. The HTML Container project, including a VB.NET version, can be downloaded from the VSJ web site.

Before getting started you will need to add the HTML Container class (WebWagon.dll) to your project. From the menu, choose Projects|Add Reference. Click the Projects tab and then the Browse button. Navigate to the location of WebWagon.dll and click OK.

A Web Crawler – sometimes referred to as a spider or robot – is a process that visits a number of web pages programmatically, usually to extract some sort of information. For example, the popular search engine Google has a robot called googlebot that sooner or later visits virtually every page on the Internet for the purpose of indexing the words on that page. We are going to develop a general-purpose class that can be used as a basis for writing any type of robot. This class will be simple yet powerful. The heart of the class is a method called CrawlURL, which accepts a beginning URL. The contents of this URL will be loaded into the HTML Container class. For each link found on this page, it recurs back again, thus repeating the process for each of those pages and so on.

The basic process is pretty simple, but we must add a few more features in order to avoid spiralling into an infinite loop. First of all, we want to keep track of pages that we’ve already seen. Many sites are such that Page A points to Page B which points back again to Page A – and the routine will soon be chasing its own tail if we don’t prevent it from doing so. A related problem has to do with the allowable recursion level. If left unbound, many sites will cause the crawler to dig itself into a hopelessly deep hole, using large amounts of stack space and memory. It is also possible to encounter a submission form that will point back to itself, using a parameterized URL that fools the check for already encountered links. For the sake of usability, we want to restrict the range so we can target a specific site or group of sites, while ignoring everything else. Finally, as we will see later, Robot etiquette requires that we maintain an Exclude list so we might as well expose this function to the object caller as well.

The first problem could be solved in a number of ways. We could have an array or collection representing the known URLs and a routine that would check to see if a given URL was already in the list. Actually we can save a bit of work by using the .Net Queue class. This class has two methods that will be useful for this purpose – the Enqueue method, which adds an item to a queue, and the Contains method, that returns True if a given item is in the queue or False if not – exactly the functions needed for the task at hand.

A property will be added to set or indicate the maximum recursion level. We will write a private version of CrawlURL, which accepts a URL as does the public method, but also expects a recursion level parameter. The public interface will simply invoke the private method, passing a zero and the process will continue from there. When the maximum level has been reached, we will stop the process. When this happens, we don’t want to just throw away the links. We may drop pages that are not accessible from another path if we do. We will use another queue that is populated with links that would have been visited had we not exceeded the maximum level. These links will in turn be visited whenever the initial recursion has completed and the process will start again. A lower recursion level will save memory, but not allow any links to be missed completely.

The Include list will be optional. If none is specified, any URL will be allowed unless found on the Exclude list. If one or more are present, the URL in question will be required to match the Include. A partial path will be allowed, so the path:

http://www.bbc.co.uk/sports
…will include or exclude (depending on the list) anything that matches to that point.

A well-mannered robot

As with just about everything in life, there are certain rules that should be followed when writing a robot. There are two important bits of information that every robot should check before visiting a specific link. The first of these is the robots.txt file. This file may or may not be present – but if it is, it will be found in the root directory of the server. For example, the BBC’s robot.txt file can be found at:
http://www.bbc.co.uk/robots.txt
A robots.txt file consists of two parts – the robot name and the list of excludes meant for that specific robot. An asterisk (*) is used to indicate any robot not explicitly named – us for example. Here is a sample entry from the BBC robots.txt file:
User-agent: * Disallow: /cgi-bin
This tells us that we should not index any files found in the /cgi-bin path, which for this page would mean everything that begins with:
http://www.bbc.co.uk/cgi-bin/
Two important things should be noted about the robots.txt file. First of all, it has nothing to do with security. You can, if you wish, ignore the robots.txt file and traverse to your heart’s content – but it is not considered polite to do so. Secondly, it is often the case that the excluded paths contain duplicated or otherwise uninteresting information that you probably didn’t want to visit anyway. Often the robots.txt file is maintained by the webmaster as a courtesy to the robot writer rather than a hindrance.

While the robots.txt file applies to the entire site, a given page can also have meta-tags that contain robot information. There are two of these that we should be concerned with here – NoIndex and NoFollow, which may appear together, separately or not at all. Here is an example of a meta-tag that contains both values:

The NoIndex attribute requests that text not be indexed from that page. The NoFollow attribute requests that none of the links on that page be crawled.

We are just about ready to write the CrawlURL method, but first let’s take a look at the HTML Container class. There are two methods that we will find particularly useful: LoadSource and GetHRefs. The first of these will grab a document given a URL. The second will extract every HRef element from the anchor tags on that page. Perfect – except that it would have been nice if the HTML Document class had included NoIndex and NoFollow properties to save us the trouble of checking them ourselves.

Extending the HTML container class

Well, of course there is an easy solution – inheritance. We will create a new HTML Container class that will inherit the original, while adding two additional properties: NoIndex, True if a NoIndex meta-tag is present, and NoFollow, True if a NoFollow meta-tag is present. Recalling the HTML Document class again, we find the LoadStatus event, which will raise with a Description of either “Complete” or “Error” when the page has completed loading or failed to load. This sounds like a good place to check the document’s meta-tags. We will write a private routine SetRobotsFlags, which will set modular level Boolean variables corresponding to the NoIndex and NoFollow properties depending on the presence or absence of their corresponding meta-tags. When a page has been loaded successfully, this routine will be called to set the flags. In the result of an error, we will set both of these flags to True.

Defining the classes and events

To create a new class based on an existing class we declare the class, and then use the Inherited keyword for VB.NET or suffixing the class name with a colon, followed by the base class for C#. The modified HTML Container will be a stand-alone class – but contained in the same namespace as the WebCrawler class:

See full detail: http://www.vsj.co.uk/articles/display.asp?id=402