HTML Screen Scraping using C#
.Net WebClient
What is Screen Scraping?
Screen Scraping means reading
the contents of a web page. Suppose you go to yahoo.com, what you see is
the interface which includes buttons, links, images etc. What we don't see is
the target url of the links, the name of the images, the method used by the
button which can be POST or GET. In other words we don't see the HTML behind the
pages. Screen Scraping pulls the HTML of the web page. This HTML includes every
HTML tag that is used to make up the page.
Why use screen scraping?
The question that comes to our mind is why do
we ever want the HTML of any web page. Screen Scraping does not stop only on
pulling out the HTML but displaying it also. In other words you can pull out the
HTML from any web page and display that web page on your page. It can be used as
frames. But the good thing about screen scraping is that it is supported by all
browsers and frames unfortunately are not.
Also sometimes you go to a website which has
many links which says image1, image2, image3 and so on. In order to see those
images you have to click on the image and it will enlarge in the parent or the
new window. By using screen scraping you can pull all the images from a
particular web page and display them on your own page.
Displaying a web page on your own page
using Screen Scraping:
Lets see a small code snippet which you can use
to display any page on your own page. First make a small interface as I have
made below. As you can see the interface is quite simple. It has a button which
says "Display WebPages below" and the web page trust me or not will be displayed
in place of label. All the code will be written for the Button Click event.
Below you can see the "Button Click Code".
C# Button Click Code:
private
void Button1_Click(object sender, System.EventArgs e)
{
WebClient webClient = new WebClient();
const string strUrl = "http://www.yahoo.com/";
byte[] reqHTML;
reqHTML = webClient.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
lblWebpage.Text = objUTF8.GetString(reqHTML);
} |
Explanation of the Code Snippet in C#:
As you can see the code is few lines long. This
is because Microsoft.net has a very strong set of class libraries that
makes the task easier for the developer. If you were trying to achieve the same
result from classic Asp you might have to write a lot more code, I guess that's
good for all the coders out there in the programming world.
In the first line I made an object of the
WebClient class. The WebClient class provides common methods for
sending data to or receiving data from any local, intranet, or Internet resource
identified by a URI.
In the next line we just defined a
private string variable strUrl which holds the url of the web
page we wish to use in our example.
Then we declared a byte array reqHTML
which will hold the bytes transferred from the web page.
Next line downloads the data in the form of
bytes and put them in the reqHTML byte array.
The UTF8Encoding class represents the
UTF-8 encoding of Unicode characters.
And in the next line we use the UTF8Encoding
class method GetString to get the bytes as a string representation and
finally we binds the result to the label.
This code now gets the
www.yahoo.com homepage when
the label is bound with the HTML of the yahoo page. The whole yahoo page is
displayed.
The Generated HTML:
For those curious people who want to see that
HTML was generated when the request was made. You can easily view the HTML by
just viewing the source code of the yahoo page. In our internet explorer go to
View -> Source. The notepad will open with the complete HTML generated of the
page. Lets see a small screen shot of the HTML generated when we visit yahoo.com.
As you can see the HTML generated is quite complex. Wouldn't it be really cool
if you can extract out all the links from the generated source. Lets try to do
that :)
Extracting Urls:
The first thing you need to extract all the
Urls from the web page is the regular expression. I am not saying you cannot do
this without regular expression you can but it will be much harder.
Regular Expression for Extracting Urls:
First you need to introduce
System.Text.RegularExpressions. Next you need to make a regular
expression that can extract all urls from the generated HTML. There are many
regular expressions already made for you which you can view at
http://www.regexlib.com/ . Your regular
expression would like this:
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*
))");
This just says that extract everything from the
web page source which starts with "href\\"
User Interface in Visual Studio .Net:
I am keeping user interface pretty simple. It
consist of a textbox, Datagrid and button. The Datagrid will be used to display
all the extracted urls.
Here is a screen shot of the User Interface.
The Code:
Okay the code is implemented in the button
click event. But before that lets see the important declarations. You also need
to include the following namespaces:
System.Net;
System.Text;
System.IO // If you plan to write in
a file
// creates a button protected
System.Web.UI.WebControls.Button Button1; // creates a byte array private
byte[] aRequestHTML; // creates a string private string myString = null;
// creates a datagrid protected System.Web.UI.WebControls.DataGrid
DataGrid1; // creates a textbox protected
System.Web.UI.WebControls.TextBox TextBox1; // creates the label protected
System.Web.UI.WebControls.Label Label1; // creates the arraylist private
ArrayList a = new ArrayList();
|
Okay now lets see some button click code that
does the actual work.
private
void Button1_Click(object sender, System.EventArgs e)
{
// make an object of the WebClient class
WebClient objWebClient = new WebClient();
// gets the HTML from the url written in the textbox
aRequestHTML = objWebClient.DownloadData(TextBox1.Text);
// creates UTf8 encoding object
UTF8Encoding utf8 = new UTF8Encoding();
// gets the UTF8 encoding of all the html we got in aRequestHTML
myString = utf8.GetString(aRequestHTML);
// this is a regular expression to check for the urls
Regex r = new Regex("href\\s*=\\s*(?:(?:\\\"(?[^\\\"]*)\\\")|(?[^\\s]*
))");
// get all the matches depending upon the regular expression
MatchCollection mcl = r.Matches(myString);
foreach(Match ml in mcl)
{
foreach(Group g in ml.Groups)
{
string b = g.Value + "
";
// Add the extracted urls to the array list
a.Add(b);
}
}
// assign arraylist to the datasource
DataGrid1.DataSource = a;
// binds the databind
DataGrid1.DataBind();
// The following lines of code writes the extracted Urls to the file named
test.txt
StreamWriter sw = new StreamWriter(Server.MapPath("test.txt"));
sw.Write(myString);
sw.Close();
}
|
The MatchCollection mc1 has all the
extracted urls and you can iterate through the collection to get all of them.
Once you enter the url in the textbox and press the button the Datagrid will be
populated with the extracted urls. Here is a screen shot of the datagrid. The
screen shot only shows few urls extracted there are at least 50 of them.
Final Note:
As you see that its simple to extract urls from
any web page. You can also make the Column in the Datagrid a hyperlink column so
you can browse the extracted url.