In a previous post I promised to go through a real example of how we can automate Internet Explorer to retrieve data from a website. We'll also enumerate some of the members of the IE Document Object as well as see how we can incorporate XMLHTTP into the mix.
The EPA's SunWise Program publishes a UV Index, which is the amount of UV exposure you can expect for the day. Their website has some neat graphs of how the UV Index works, and what each value means. I'm obliged to remind you that you should cover up in the sun, wear at least SPF 15 sunblock, and remember that just because it isn't sunny out doesn't mean you're not getting exposed to UV rays.
You can lookup the UV Index directly on their site. But we don't want to do that. We want to extract the UV index programmatically from the EPA's website.
The first thing we do is look at the source of the page (View > Source in Internet Explorer, Ctrl+U in FireFox) and see that the UV Index lookup is wrapped in HTML form tags. Wonderful! Here is the code.
<form action="http://oaspub.epa.gov/enviro/uv_search" method="post" id="uviform" onsubmit="return validateInput();">
<table border="0" cellpadding="1" cellspacing="0" summary="The UV Index query form." style="width:auto; margin-right:60px;">
<tr>
<td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label">ZIP Code:</label></td>
<td><input name="zipcode" type="text" id="zipcode" size="5" maxlength="5" /></td>
</tr>
<tr>
<td> </td>
<td style="padding:0.7em 0;">or</td>
</tr>
<tr>
<td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label2">City:</label></td>
<td><input id="city_name" type="text" name="city_name" size="30" /></td>
</tr>
<tr>
<td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label3">State:</label></td>
<td><input name="state_code" type="text" id="state_code" title="Postal abbreviation" size="2" maxlength="2" /></td>
</tr>
<tr>
<td> </td>
<td style="padding-top:20px;"><input type="submit" id="submit" style="font-weight:bold;" value="Get UV Index" />
<input type="reset" id="reset" style="font-size:0.9em;" value="Reset" /></td>
</tr>
</table>
</form>This markup tells me:
- The data you enter into this form is submitted to a URL for processing.
- The base URL for my query is "http://oaspub.epa.gov/enviro/uv_search".
- The form takes three inputs, named zipcode, city_name and state_code.
How did I figure that out?
- The "action" attribute of the form tag gives you the base URL for any query using that form's data. So any IE navigation or XMLHTTP query I want must begin with "http://oaspub.epa.gov/enviro/uv_search".
- There are three input fields for this particular form. Those are the fields that will be sent to the URL in the action attribute of the form tag. The name attribute for each one gives you the parameters you'll need to add to the query.
Any HTTP GET request must be formatted like this:
base URL?first_parameter=value&second_parameter=value&third_parameter=value
In other words, to start the parameter list, add a question mark to the end of the base URL. Each subsequent parameter is prepended with an ampersand ('&') and added to the end of the parameter list. Here's a great example: HTTP Client Methods – GET and POST
Since we can submit the form using only zip code, the URL for the query would look like this:
http://oaspub.epa.gov/enviro/uv_search?zipcode=11103
(It would actually be http://oaspub.epa.gov/enviro/uv_search?zipcode=11103&city_name=&state_code= but we can leave off the other parameters since they are blank anyway)
If you enter either URL into your web browser, and view the source of the resulting page, the tag we're looking for looks like this:
<img src="//www.epa.gov/enviro/gif/C_UV08.gif" alt="UVI 8" border=0>
This is the graphic that displays the UV Index. The index number itself is in the alt attribute of the IMG tag. So now we'll need to move to the HTML Object Model to parse the HTML from this page and extract the UV Index number.
The easy way
We'll use a simple trick: use the XMLHTTP object to retrieve the result of the query, instead of instantiating Internet Explorer. This will make our code run much faster, since we figured out how to get our result without having to actually visit the site.
No early bound references required!
Function GetUVIndex(zipCode As String) As String
Dim xml As Object ' MSXML2.XMLHTTP
Dim html As Object ' MSHTML.HTMLDocument
Dim imgLinks As Object ' MSHTML.IHTMLElementCollection
Dim imgLink As Object ' MSHTML.IHTMLElement
Dim result As String
Set xml = CreateObject("MSXML2.XMLHTTP.6.0")
With xml
.Open "GET", "http://oaspub.epa.gov/enviro/uv_search?zipcode=" & zipCode, False
.Send
End With
result = xml.responseText
' create HTML document
Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument
html.body.innerHTML = result
Set imgLinks = html.getElementsByTagName("img")
For Each imgLink In imgLinks
If Left$(imgLink.getAttribute("alt"), 3) = "UVI" Then
GetUVIndex = Right$(imgLink.getAttribute("alt"), _
Len(imgLink.getAttribute("alt")) - InStrRev(imgLink.getAttribute("alt"), " "))
Exit Function
End If
Next imgLink
End FunctionSample usage
Sub TestGetUVIndex()
Debug.Print GetUVIndex("11103")
End SubWe use XMLHTTP to run the query, but instead of using the MSXML Object Model, we use the HTML Object Model because the result is HTML, not XML.
After grabbing the result page, we put it into a HTMLDocument Object. The getElementsByTagName Method returns a collection of all the elements with the given tag name (for example: input, a, img). The getAttribute Method will return the given attribute name for each tag element. Then it's a simple matter of looking for the correct tag information.
We're looking for the IMG tag that contains the UV Index in its alt attribute, so we use html.getElementsByTagName("img") and then loop through the resulting collection.
The hard way
Let's explore how we can do this using only IE automation (no XMLHTTP).
I call this "the hard way" because you will see how much more is required of this method.
You'll need to study the source code of the first webpage much more closely, to figure out the field names you want to fill out along with the appropriate methods for each. We didn't need to do that when we pointed XMLHTTP directly at the endpoint URL.
Here's the function and then we'll discuss further.
Function GetUVIndexWithIE(zipCode As String) As String
Dim ie As Object ' InternetExplorer
Dim html As Object ' MSHTML.HTMLDocument
Dim imgLinks As Object ' MSHTML.IHTMLElementCollection
Dim imgLink As Object ' MSHTML.IHTMLElement
Dim inputCollection As Object ' MSHTML.IHTMLElementCollection
Dim inputElement As Object ' MSHTML.IHTMLElement
Dim result As String
Set ie = CreateObject("InternetExplorer.Application")
' go to target page
ie.Navigate "http://www.epa.gov/sunwise/uvindex.html"
'ie.Visible = True
' loop until page is loaded
Do Until ie.READYSTATE = 4 ' READYSTATE_COMPLETE
DoEvents
Loop
' enter zip code
' there's only one page element with id="zipcode", so we
' can enter its value directly
ie.document.getElementById("zipcode").value = zipCode
' click submit button
' there are multiple input elements named "submit", so we
' need to loop
Set inputCollection = ie.document.getElementsByTagName("input")
For Each inputElement In inputCollection
If inputElement.getAttribute("value") = "Get UV Index" Then
inputElement.Click
Exit For
End If
Next inputElement
' loop until page is loaded
Do Until ie.READYSTATE = 4 ' READYSTATE_COMPLETE
DoEvents
Loop
' grab screenshot
result = ie.document.body.innerHTML
' close IE
ie.Quit
' create HTML document
Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument
html.body.innerHTML = result
' loop through IMG tags
Set imgLinks = html.getElementsByTagName("img")
For Each imgLink In imgLinks
If Left$(imgLink.getAttribute("alt"), 3) = "UVI" Then
GetUVIndexWithIE = Right$(imgLink.getAttribute("alt"), _
Len(imgLink.getAttribute("alt")) - InStrRev(imgLink.getAttribute("alt"), " "))
Exit Function
End If
Next imgLink
End FunctionThe first thing we do is instantiate Internet Explorer, which itself can take a few seconds. After navigating to the first page, we need to loop to wait until the page is loaded. This will usually happen quickly, but if the page never loads, or loads very slowly, we could be here a while. In the declarations section as well as inline, I provide the early-bound versions in case you want to set a reference to Microsoft Internet Controls and take advantage of early-binding, or if you just want to look up each object on MSDN to see its members.
Now we'll need to locate the input element for zip code. By studying the underlying HTML source for the page, I learned that the zip code field is the only one named "zipcode":
<input name="zipcode" type="text" id="zipcode" size="5" maxlength="5" />
That means it is the only field with the name attribute "zipcode". I can therefore input the zip code directly into the field like this:
ie.document.getElementById("zipcode").value = zipCodeTake note: this is how you input anything on a webpage using IE automation where the field has a unique name. Unfortunately, this is the only place where we're able to do this — we'll need to loop to find the Submit button because there are multiple Submit buttons on this page.
The Submit button is an input element; that is to say, the tag name is "input":
<input type="submit" id="submit" style="font-weight:bold;" value="Get UV Index" />
So we set an object reference to the collection of input elements using this syntax:
ie.document.getElementsByTagName("input")Late bound, you would declare the object reference As Object; early bound it is a MSHTML.IHTMLElementCollection Object.
As we saw earlier, the getElementsByTagName Method returns a collection of all the elements with the given tag name. So if I needed all the links (<a> tag), I would use ie.document.getElementsByTagName("a") and if I wanted all the text boxes (<textarea> tag), I would write ie.document.getElementsByTagName("textarea"). For a list of HTML tags you can grab, see HTML Reference.
The Submit button's value attribute DOES have a unique name, so we'll check for that in our loop:
' use Click Method on appropriate Submit button
For Each inputElement In inputCollection
If inputElement.getAttribute("value") = "Get UV Index" Then
inputElement.Click
Exit For
End If
Next inputElementWe loop again to make sure the next page is fully loaded, then "screen scrape" it into a String variable. The next step is very important. Once we're done with IE, we call the Quit Method to make sure it doesn't hang around after the function ends.
After that, the code is identical to the function that uses XMLHTTP. The resulting page is put into an HTMLDocument Object and parsed for the UV Index.
The Results
How do the two functions stack up?
In my tests, GetUVIndexWithIE took anywhere from 1.5 to 2.5 seconds, where GetUVIndex took less than 1/10 of a second. Try this test procedure and report your findings.
Sub TestGetUVIndex()
Dim starttime As Single
Dim endtime As Single
starttime = Timer
Debug.Print GetUVIndexWithIE("11103")
endtime = Timer
Debug.Print "GetUVIndexWithIE took " & endtime - starttime & " seconds."
starttime = Timer
Debug.Print GetUVIndex("11103")
endtime = Timer
Debug.Print "GetUVIndex took " & endtime - starttime & " seconds."
End SubHTML Document Object Members
The Document Object is one of the more important Objects in the HTML Object Model because it represents the document being displayed by the browser at any given time. This is the typical object you'll be working with when automating IE, filling in forms and clicking web buttons. Here are some of its members.
All of these functions assume that you have already navigated to a URL.
Document Object Properties
The Charset Property returns something like "utf-8" for Unicode support.
Function GetCharSet(ie As Object) As String GetCharSet = ie.document.Charset End Function
The last modified date of the current page.
Function GetLastModified(ie As Object) As Date GetLastModified = ie.document.LastModified End Function
ReadyState Property
Function GetReadyState(ie As Object) As String GetReadyState = ie.document.READYSTATE End Function
Current document URL
Function GetURL(ie As Object) As String GetURL = ie.document.URL End Function
Document Object Collections
The Document Object has several important collections you can iterate through pretty easily.
All Elements
Function GetAllElements(ie As Object) As Object Set GetAllElements = ie.document.all End Function
Anchors Collection
From MSDN:
Retrieves a collection of all a objects that have a name and/or id property. Objects in this collection are in HTML source order.
Just like getElementByID, this function returns all the A tags (links) that have a name or ID attribute. The important thing to note is the "source order" which is the order in which the links appear in the document.
Function GetAllAnchors(ie As Object) As Object Set GetAllAnchors = ie.document.anchors End Function
Images Collection
From MSDN:
Retrieves a collection, in source order, of img objects in the document.
Function GetAllImages(ie As Object) As Object Set GetAllImages = ie.document.images End Function
Links Collection
From MSDN:
Retrieves a collection of all a objects that specify the href property and all area objects in the document.
Function GetAllLinks(ie As Object) As Object Set GetAllLinks = ie.document.Links End Function
Document Object Methods
Some of the more popular methods are listed here. Each of these returns an object or collection of objects, which you iterate through in your code to fill in forms, click buttons, follow links, and so on.
Function GetSpecificElement(ie As Object, elementName As String) As Object Set GetSpecificElement = ie.document.getElementById(elementName) End Function
Function GetElementsByName(ie As Object, elementName As String) As Object Set GetElementsByName = ie.document.getElementsByName(elementName) End Function
Function GetElementsByTagName(ie As Object, tagName As String) As Object Set GetElementsByTagName = ie.document.getElementsByTagName(tagName) End Function
Sample usage
This function returns an instance of IE, and optionally visits a URL. It uses a simple RegEx pattern to validate the URL passed to the function.
Function GetIE(Optional URL As String) As Object
Dim regExp As Object ' VBScript_RegExp_55.regExp
On Error Resume Next
Set GetIE = CreateObject("InternetExplorer.Application")
On Error GoTo 0
If Not GetIE Is Nothing Then
If Len(URL) > 0 Then
' validate URL
' http://www.dailydoseofexcel.com/archives/2005/08/13/pattern-matching/
Set regExp = CreateObject("VBScript.RegExp")
With regExp
.MultiLine = False
.Global = False
.IgnoreCase = True
' http://regexlib.com/Search.aspx?k=url
.Pattern = "^https?://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$"
End With
' match URL against pattern
If regExp.Test(URL) Then
GetIE.Navigate URL
End If
End If
End If
End FunctionSub TestIEDocumentObject()
Dim ie As Object
Set ie = GetIE("http://www.google.com/")
If Not ie Is Nothing Then
Debug.Print GetCharSet(ie)
End If
End Sub




Hi JP,
Thanks for your code on entering a password into a webpage, it works a treat. I now need to select from a drop-down box several times, then copy a datasheet into excel. I'm also a nebie to HTML so slow instuctions would be appreciated!
Many thanks,
Rebecca
Hi JP,
I am trying to adapt your example to this site http://www.bjstrat.net/cgi-bin/cdca_web.exe
The initial objective is to input, say 3, in field txt2, hit Compute (which is no problem) and return the overall EV
Thanks for your hints
Cheers
James
See http://www.jpsoftwaretech.com/blog/2010/06/an-exploration-of-ie-browser-methods-part-ii/
To put the number "3" into the input field that has an ID of "txt2", try
ie.document.getElementById("txt2").value = 3
That should get you started.
Hi there,
Thanks for your answer …
Sorry I did not express myself clearly enough …
Thanks to you, I have already managed to get my code
working, with the "hard way" approach …
My interest is to get the XMLHTTP method working …
Thanks in advance for your insight.
Cheers
James
If you view the source of the page, it's a form using POST method to http://www.bjstrat.net/cgi-bin/cdca_web.exe. So all you need to do is send a POST request using the following URL:
http://www.bjstrat.net/cgi-bin/cdca_web.exe?txt2=3
For an example of sending a POST request and scraping the result, see http://www.jpsoftwaretech.com/blog/2011/01/another-way-to-hide-your-email/
HTH
Hi JP,
Many thanks for your answer …
Sincerely I did not even know about the existence of the "POST" method …
Thanks also for the very "inspiring link" … which will help me out …
Cheers
James
Hi JP,
I have stumbled upon some XBRL type of a XML document (from the Exchange Commission), I tried your method but it did not work very well? Could you offer me some pointers?
Here is the link, I'd like to scrape out the value for "Gross Profit", possible?
http://www.sec.gov/Archives/edgar/data/789019/000119312510171791/0001193125-10-171791.txt
Your insight would be appreciated.
That file is too big to be assigned to a String variable. You'll need to download the file, open and read parts of it into a String, then parse it for the value you want. You might also ask the SEC if they have some kind of API to request this information.
Sounds good, I will give that a try!
What technique do you recommend for trying to locate this data, since it would be extremely fragmented? I was thinking about finding a reference point in each file then offsetting a certain amount of characters to retrieve the data. Unfortunately, each file has a different character count to where the data is located, what do you recommend?
I really don't have anything to recommend other than what I mentioned, the file is terrible. Did you try saving the file as .xml and reading it into a MSXML document?
What would be your recommended method of reading this document as an MSXML file? Can this be done via Excel and some VBA? I'd like to try what you suggested, but am a novice with XML. The document I a trying to scrap, I have read, is a derivative of the XML structure (XBRL), hopefully what you suggest may just expose the data I am looking to capture?
Download the file and change the filename to .xml. Generally I do this in the local temp folder. Then load the xml into a MSXML document like this:
Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0") With xmlDoc .async = False .validateOnParse = False .Load "path to your XML file" End WithOnce you load the XML file into the document, you'll need to use MSXML's native methods to search for the information you want. For more code samples, search for "MSXML" on this blog.
I loaded the file as you suggested but I do not see anything within the xmlDoc.text? I saw the MSXML sample on this blog, but they all look like they are searching the .text area? Am I doing something wrong?
I am loving this "Easy Way"! But it seems there are some limitations for me. For some reason, I can use the command GetElementsByTagName( for 'img' and 'a' elements, but not for 'meta' or 'link'.
I've tried other GetElement functions (GetElementByID, getElementsByName(, etc), but nothing seems to make it work. The strange thing is that I can use the "Hard Way" and GetElementsByTagName( works just fine with 'meta' and 'link'. It just takes longer loading the page.
Any suggestions to make this work for 'meta' and 'link' with the "Easy Way"?
'Meta' and 'link' tags are HEAD elements that, I assume, aren't part of the official response. So they may not be available using the faster method. However when you automate Internet Explorer ("the hard way") you have direct access to all page elements.
I have vba InternetExplorer automation and a i get one page and in this page exist one form with attribute target="_blank", how to change this attribute to "_self" before subit the form?
What is the code you are using and what is the form source code?
I still haven't written the code because I don't know if I can make this type of modification in an attribute of the form and also don't know how to code attribute modification
OK. The question you asked cannot be answered without at least a link to the online form. Can you provide that?
Hi JP! I'd like to use xmlhttp on a page where authentication is needed, and after the authentication, the user has to choose a profile using a dropdown list. After the choose, I can programmatically navigate to a Url using POST method, and the page contains data that I download in an Access table. Acutally I'm using Internetexplorer object, but i wonder if I can speed up the process using xmlhttp (I have to load about 150 pages every day). How can I bypass or set the profile? Sorry for my poor English, if you want I can post my code.
Another question: how about GET method where the variables are not inserted in the url's querystring? Is it possible to pass parameters to xmlhttp object? Maybe I'm a bit confused about that…