An exploration of IE browser methods, part II

In a previous post I promised to go through a real example of how we can automate Internet Explorer to retrieve data from a website. We'll also enumerate some of the members of the IE Document Object as well as see how we can incorporate XMLHTTP into the mix.

The EPA's SunWise Program publishes a UV Index, which is the amount of UV exposure you can expect for the day. Their website has some neat graphs of how the UV Index works, and what each value means. I'm obliged to remind you that you should cover up in the sun, wear at least SPF 15 sunblock, and remember that just because it isn't sunny out doesn't mean you're not getting exposed to UV rays.

You can lookup the UV Index directly on their site. But we don't want to do that. We want to extract the UV index programmatically from the EPA's website.

The first thing we do is look at the source of the page (View > Source in Internet Explorer, Ctrl+U in FireFox) and see that the UV Index lookup is wrapped in HTML form tags. Wonderful! Here is the code.

<form action="http://oaspub.epa.gov/enviro/uv_search" method="post" id="uviform" onsubmit="return validateInput();">
  <table border="0" cellpadding="1" cellspacing="0" summary="The UV Index query form." style="width:auto; margin-right:60px;">
    <tr>
      <td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label">ZIP Code:</label></td>
      <td><input name="zipcode" type="text" id="zipcode" size="5" maxlength="5" /></td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td style="padding:0.7em 0;">or</td>
    </tr>
    <tr>
      <td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label2">City:</label></td>
      <td><input id="city_name" type="text" name="city_name" size="30" /></td>
    </tr>
    <tr>
      <td style="text-align:right; padding-right:0.5em; font-weight:bold;"><label for="label3">State:</label></td>
      <td><input name="state_code" type="text" id="state_code" title="Postal abbreviation" size="2" maxlength="2" /></td>
    </tr>
    <tr>
      <td>&nbsp;</td>
      <td style="padding-top:20px;"><input type="submit" id="submit" style="font-weight:bold;" value="Get UV Index" />
        <input type="reset" id="reset" style="font-size:0.9em;" value="Reset" /></td>
    </tr>
  </table>
</form>

This markup tells me:

  • The data you enter into this form is submitted to a URL for processing.
  • The base URL for my query is "http://oaspub.epa.gov/enviro/uv_search".
  • The form takes three inputs, named zipcode, city_name and state_code.

How did I figure that out?

  • The "action" attribute of the form tag gives you the base URL for any query using that form's data. So any IE navigation or XMLHTTP query I want must begin with "http://oaspub.epa.gov/enviro/uv_search".
  • There are three input fields for this particular form. Those are the fields that will be sent to the URL in the action attribute of the form tag. The name attribute for each one gives you the parameters you'll need to add to the query.

Any HTTP GET request must be formatted like this:

base URL?first_parameter=value&second_parameter=value&third_parameter=value

In other words, to start the parameter list, add a question mark to the end of the base URL. Each subsequent parameter is prepended with an ampersand ('&') and added to the end of the parameter list. Here's a great example: HTTP Client Methods – GET and POST

Since we can submit the form using only zip code, the URL for the query would look like this:

http://oaspub.epa.gov/enviro/uv_search?zipcode=11103

(It would actually be http://oaspub.epa.gov/enviro/uv_search?zipcode=11103&city_name=&state_code= but we can leave off the other parameters since they are blank anyway)

If you enter either URL into your web browser, and view the source of the resulting page, the tag we're looking for looks like this:

<img src="//www.epa.gov/enviro/gif/C_UV08.gif" alt="UVI 8" border=0>

This is the graphic that displays the UV Index. The index number itself is in the alt attribute of the IMG tag. So now we'll need to move to the HTML Object Model to parse the HTML from this page and extract the UV Index number.

The easy way

We'll use a simple trick: use the XMLHTTP object to retrieve the result of the query, instead of instantiating Internet Explorer. This will make our code run much faster, since we figured out how to get our result without having to actually visit the site.

No early bound references required!

Function GetUVIndex(zipCode As String) As String

Dim xml As Object ' MSXML2.XMLHTTP
Dim html As Object ' MSHTML.HTMLDocument
Dim imgLinks As Object ' MSHTML.IHTMLElementCollection
Dim imgLink As Object ' MSHTML.IHTMLElement
Dim result As String

  Set xml = CreateObject("MSXML2.XMLHTTP.6.0")

  With xml
    .Open "GET", "http://oaspub.epa.gov/enviro/uv_search?zipcode=" & zipCode, False
    .Send
  End With

  result = xml.responseText

  ' create HTML document
  Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument

  html.body.innerHTML = result

  Set imgLinks = html.getElementsByTagName("img")

  For Each imgLink In imgLinks
    If Left$(imgLink.getAttribute("alt"), 3) = "UVI" Then
      GetUVIndex = Right$(imgLink.getAttribute("alt"), _
          Len(imgLink.getAttribute("alt")) - InStrRev(imgLink.getAttribute("alt"), " "))
      Exit Function
    End If
  Next imgLink

End Function

Sample usage

Sub TestGetUVIndex()

Debug.Print GetUVIndex("11103")

End Sub

We use XMLHTTP to run the query, but instead of using the MSXML Object Model, we use the HTML Object Model because the result is HTML, not XML.

After grabbing the result page, we put it into a HTMLDocument Object. The getElementsByTagName Method returns a collection of all the elements with the given tag name (for example: input, a, img). The getAttribute Method will return the given attribute name for each tag element. Then it's a simple matter of looking for the correct tag information.

We're looking for the IMG tag that contains the UV Index in its alt attribute, so we use html.getElementsByTagName("img") and then loop through the resulting collection.

The hard way

Let's explore how we can do this using only IE automation (no XMLHTTP).

I call this "the hard way" because you will see how much more is required of this method.

You'll need to study the source code of the first webpage much more closely, to figure out the field names you want to fill out along with the appropriate methods for each. We didn't need to do that when we pointed XMLHTTP directly at the endpoint URL.

Here's the function and then we'll discuss further.

Function GetUVIndexWithIE(zipCode As String) As String

Dim ie As Object ' InternetExplorer
Dim html As Object ' MSHTML.HTMLDocument
Dim imgLinks As Object ' MSHTML.IHTMLElementCollection
Dim imgLink As Object ' MSHTML.IHTMLElement
Dim inputCollection As Object ' MSHTML.IHTMLElementCollection
Dim inputElement As Object ' MSHTML.IHTMLElement
Dim result As String

  Set ie = CreateObject("InternetExplorer.Application")

  ' go to target page
  ie.Navigate "http://www.epa.gov/sunwise/uvindex.html"
  'ie.Visible = True

  ' loop until page is loaded
  Do Until ie.READYSTATE = 4  ' READYSTATE_COMPLETE
    DoEvents
  Loop

  ' enter zip code
  ' there's only one page element with id="zipcode", so we
  ' can enter its value directly
  ie.document.getElementById("zipcode").value = zipCode

  ' click submit button
  ' there are multiple input elements named "submit", so we
  ' need to loop
  Set inputCollection = ie.document.getElementsByTagName("input")

  For Each inputElement In inputCollection
    If inputElement.getAttribute("value") = "Get UV Index" Then
      inputElement.Click
      Exit For
    End If
  Next inputElement

  ' loop until page is loaded
  Do Until ie.READYSTATE = 4  ' READYSTATE_COMPLETE
    DoEvents
  Loop

  ' grab screenshot
  result = ie.document.body.innerHTML

  ' close IE
  ie.Quit

  ' create HTML document
  Set html = CreateObject("htmlfile") ' New MSHTML.HTMLDocument

  html.body.innerHTML = result

  ' loop through IMG tags
  Set imgLinks = html.getElementsByTagName("img")

  For Each imgLink In imgLinks
    If Left$(imgLink.getAttribute("alt"), 3) = "UVI" Then
      GetUVIndexWithIE = Right$(imgLink.getAttribute("alt"), _
                Len(imgLink.getAttribute("alt")) - InStrRev(imgLink.getAttribute("alt"), " "))
      Exit Function
    End If
  Next imgLink

End Function

The first thing we do is instantiate Internet Explorer, which itself can take a few seconds. After navigating to the first page, we need to loop to wait until the page is loaded. This will usually happen quickly, but if the page never loads, or loads very slowly, we could be here a while. In the declarations section as well as inline, I provide the early-bound versions in case you want to set a reference to Microsoft Internet Controls and take advantage of early-binding, or if you just want to look up each object on MSDN to see its members.

Now we'll need to locate the input element for zip code. By studying the underlying HTML source for the page, I learned that the zip code field is the only one named "zipcode":

<input name="zipcode" type="text" id="zipcode" size="5" maxlength="5" />

That means it is the only field with the name attribute "zipcode". I can therefore input the zip code directly into the field like this:

ie.document.getElementById("zipcode").value = zipCode

Take note: this is how you input anything on a webpage using IE automation where the field has a unique name. Unfortunately, this is the only place where we're able to do this — we'll need to loop to find the Submit button because there are multiple Submit buttons on this page.

The Submit button is an input element; that is to say, the tag name is "input":

<input type="submit" id="submit" style="font-weight:bold;" value="Get UV Index" />

So we set an object reference to the collection of input elements using this syntax:

ie.document.getElementsByTagName("input")

Late bound, you would declare the object reference As Object; early bound it is a MSHTML.IHTMLElementCollection Object.

As we saw earlier, the getElementsByTagName Method returns a collection of all the elements with the given tag name. So if I needed all the links (<a> tag), I would use ie.document.getElementsByTagName("a") and if I wanted all the text boxes (<textarea> tag), I would write ie.document.getElementsByTagName("textarea"). For a list of HTML tags you can grab, see HTML Reference.

The Submit button's value attribute DOES have a unique name, so we'll check for that in our loop:

' use Click Method on appropriate Submit button
For Each inputElement In inputCollection
  If inputElement.getAttribute("value") = "Get UV Index" Then
    inputElement.Click
    Exit For
  End If
Next inputElement

We loop again to make sure the next page is fully loaded, then "screen scrape" it into a String variable. The next step is very important. Once we're done with IE, we call the Quit Method to make sure it doesn't hang around after the function ends.

After that, the code is identical to the function that uses XMLHTTP. The resulting page is put into an HTMLDocument Object and parsed for the UV Index.

The Results

How do the two functions stack up?

In my tests, GetUVIndexWithIE took anywhere from 1.5 to 2.5 seconds, where GetUVIndex took less than 1/10 of a second. Try this test procedure and report your findings.

Sub TestGetUVIndex()

Dim starttime As Single
Dim endtime As Single

  starttime = Timer
  Debug.Print GetUVIndexWithIE("11103")
  endtime = Timer

  Debug.Print "GetUVIndexWithIE took " & endtime - starttime & " seconds."

  starttime = Timer
  Debug.Print GetUVIndex("11103")
  endtime = Timer

  Debug.Print "GetUVIndex took " & endtime - starttime & " seconds."

End Sub

HTML Document Object Members

The Document Object is one of the more important Objects in the HTML Object Model because it represents the document being displayed by the browser at any given time. This is the typical object you'll be working with when automating IE, filling in forms and clicking web buttons. Here are some of its members.

All of these functions assume that you have already navigated to a URL.

Document Object Properties

The Charset Property returns something like "utf-8" for Unicode support.

Function GetCharSet(ie As Object) As String
  GetCharSet = ie.document.Charset
End Function

The last modified date of the current page.

Function GetLastModified(ie As Object) As Date
  GetLastModified = ie.document.LastModified
End Function

ReadyState Property

Function GetReadyState(ie As Object) As String
  GetReadyState = ie.document.READYSTATE
End Function

Current document URL

Function GetURL(ie As Object) As String
  GetURL = ie.document.URL
End Function

Document Object Collections

The Document Object has several important collections you can iterate through pretty easily.

All Elements

Function GetAllElements(ie As Object) As Object
  Set GetAllElements = ie.document.all
End Function

Anchors Collection

From MSDN:

Retrieves a collection of all a objects that have a name and/or id property. Objects in this collection are in HTML source order.

Just like getElementByID, this function returns all the A tags (links) that have a name or ID attribute. The important thing to note is the "source order" which is the order in which the links appear in the document.

Function GetAllAnchors(ie As Object) As Object
  Set GetAllAnchors = ie.document.anchors
End Function

Images Collection

From MSDN:

Retrieves a collection, in source order, of img objects in the document.

Function GetAllImages(ie As Object) As Object
  Set GetAllImages = ie.document.images
End Function

Links Collection

From MSDN:

Retrieves a collection of all a objects that specify the href property and all area objects in the document.

Function GetAllLinks(ie As Object) As Object
  Set GetAllLinks = ie.document.Links
End Function

Document Object Methods

Some of the more popular methods are listed here. Each of these returns an object or collection of objects, which you iterate through in your code to fill in forms, click buttons, follow links, and so on.

Function GetSpecificElement(ie As Object, elementName As String) As Object
  Set GetSpecificElement = ie.document.getElementById(elementName)
End Function
Function GetElementsByName(ie As Object, elementName As String) As Object
  Set GetElementsByName = ie.document.getElementsByName(elementName)
End Function
Function GetElementsByTagName(ie As Object, tagName As String) As Object
  Set GetElementsByTagName = ie.document.getElementsByTagName(tagName)
End Function

Sample usage

This function returns an instance of IE, and optionally visits a URL. It uses a simple RegEx pattern to validate the URL passed to the function.

Function GetIE(Optional URL As String) As Object

Dim regExp As Object ' VBScript_RegExp_55.regExp

  On Error Resume Next
  Set GetIE = CreateObject("InternetExplorer.Application")
  On Error GoTo 0

  If Not GetIE Is Nothing Then
    If Len(URL) > 0 Then
      ' validate URL

      ' http://www.dailydoseofexcel.com/archives/2005/08/13/pattern-matching/
      Set regExp = CreateObject("VBScript.RegExp")

      With regExp
        .MultiLine = False
        .Global = False
        .IgnoreCase = True
        ' http://regexlib.com/Search.aspx?k=url
        .Pattern = "^https?://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$"
      End With

      ' match URL against pattern
      If regExp.Test(URL) Then
        GetIE.Navigate URL
      End If
    End If
  End If
End Function
Sub TestIEDocumentObject()

Dim ie As Object

  Set ie = GetIE("http://www.google.com/")

  If Not ie Is Nothing Then
    Debug.Print GetCharSet(ie)
  End If
End Sub

Related Articles:

About JP

I'm just an average guy who writes VBA code for a living. This is my personal blog. Excel and Outlook are my thing, with a sprinkle of Access and Word here and there. Follow this space to learn more about VBA. Keep Reading »

Share This Article:

Share and bookmark this articledelicious buttonfacebook buttonlinkedin buttonstumbleupon buttontwitter button

comment bubble 20 Comments:

  1. Rebecca writes:

    Hi JP,

    Thanks for your code on entering a password into a webpage, it works a treat. I now need to select from a drop-down box several times, then copy a datasheet into excel. I'm also a nebie to HTML so slow instuctions would be appreciated!

    Many thanks,

    Rebecca

  2. James writes:

    Hi JP,

    I am trying to adapt your example to this site http://www.bjstrat.net/cgi-bin/cdca_web.exe

    The initial objective is to input, say 3, in field txt2, hit Compute (which is no problem) and return the overall EV

    Thanks for your hints

    Cheers
    James

  3. JP writes:

    If you view the source of the page, it's a form using POST method to http://www.bjstrat.net/cgi-bin/cdca_web.exe. So all you need to do is send a POST request using the following URL:

    http://www.bjstrat.net/cgi-bin/cdca_web.exe?txt2=3

    For an example of sending a POST request and scraping the result, see http://www.jpsoftwaretech.com/blog/2011/01/another-way-to-hide-your-email/

    HTH

  4. James writes:

    Hi JP,

    Many thanks for your answer …

    Sincerely I did not even know about the existence of the "POST" method …

    Thanks also for the very "inspiring link" … which will help me out …

    Cheers

    James

  5. 99bobster99 writes:

    Hi JP,

    I have stumbled upon some XBRL type of a XML document (from the Exchange Commission), I tried your method but it did not work very well? Could you offer me some pointers?

    Here is the link, I'd like to scrape out the value for "Gross Profit", possible?

    http://www.sec.gov/Archives/edgar/data/789019/000119312510171791/0001193125-10-171791.txt

    Your insight would be appreciated.

    • JP writes:

      That file is too big to be assigned to a String variable. You'll need to download the file, open and read parts of it into a String, then parse it for the value you want. You might also ask the SEC if they have some kind of API to request this information.

      • 99bobster99 writes:

        Sounds good, I will give that a try!

        What technique do you recommend for trying to locate this data, since it would be extremely fragmented? I was thinking about finding a reference point in each file then offsetting a certain amount of characters to retrieve the data. Unfortunately, each file has a different character count to where the data is located, what do you recommend?

        • JP writes:

          I really don't have anything to recommend other than what I mentioned, the file is terrible. Did you try saving the file as .xml and reading it into a MSXML document?

        • 99bobster99 writes:

          What would be your recommended method of reading this document as an MSXML file? Can this be done via Excel and some VBA? I'd like to try what you suggested, but am a novice with XML. The document I a trying to scrap, I have read, is a derivative of the XML structure (XBRL), hopefully what you suggest may just expose the data I am looking to capture?

        • JP writes:

          Download the file and change the filename to .xml. Generally I do this in the local temp folder. Then load the xml into a MSXML document like this:

          Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")
          
            With xmlDoc
              .async = False
              .validateOnParse = False
              .Load "path to your XML file"
            End With

          Once you load the XML file into the document, you'll need to use MSXML's native methods to search for the information you want. For more code samples, search for "MSXML" on this blog.

  6. 99bobster99 writes:

    I loaded the file as you suggested but I do not see anything within the xmlDoc.text? I saw the MSXML sample on this blog, but they all look like they are searching the .text area? Am I doing something wrong?

  7. Joe Bergevin writes:

    I am loving this "Easy Way"! But it seems there are some limitations for me. For some reason, I can use the command GetElementsByTagName( for 'img' and 'a' elements, but not for 'meta' or 'link'.
    I've tried other GetElement functions (GetElementByID, getElementsByName(, etc), but nothing seems to make it work. The strange thing is that I can use the "Hard Way" and GetElementsByTagName( works just fine with 'meta' and 'link'. It just takes longer loading the page.
    Any suggestions to make this work for 'meta' and 'link' with the "Easy Way"?

    • JP writes:

      'Meta' and 'link' tags are HEAD elements that, I assume, aren't part of the official response. So they may not be available using the faster method. However when you automate Internet Explorer ("the hard way") you have direct access to all page elements.

  8. José A M Pacheco writes:

    I have vba InternetExplorer automation and a i get one page and in this page exist one form with attribute target="_blank", how to change this attribute to "_self" before subit the form?

    • JP writes:

      What is the code you are using and what is the form source code?

      • José A M Pacheco writes:

        I still haven't written the code because I don't know if I can make this type of modification in an attribute of the form and also don't know how to code attribute modification

        • JP writes:

          OK. The question you asked cannot be answered without at least a link to the online form. Can you provide that?

  9. Giò writes:

    Hi JP! I'd like to use xmlhttp on a page where authentication is needed, and after the authentication, the user has to choose a profile using a dropdown list. After the choose, I can programmatically navigate to a Url using POST method, and the page contains data that I download in an Access table. Acutally I'm using Internetexplorer object, but i wonder if I can speed up the process using xmlhttp (I have to load about 150 pages every day). How can I bypass or set the profile? Sorry for my poor English, if you want I can post my code.

    Another question: how about GET method where the variables are not inserted in the url's querystring? Is it possible to pass parameters to xmlhttp object? Maybe I'm a bit confused about that… :-)

Comments on this article are closed. Why?

Site last updated: February 12, 2012