Scrape Website Data into Excel using VBA

2:14 PM Unknown 0 Comments

I'll be showing you an example on how to Scrape Data from a Website into Excel Worksheet using VBA. We'll be scraping data from www(dot)renewableuk(dot)com. Please also read the privacy policy of the website before mining data.

Goal:
Get all data under all column headings which can be found on this website i.e.
Wind Project, Region, ..., Type of Project

Requirements:
You need to add a reference, Microsoft HTML Object Library on your VBA project.

Usage:
You can call the ProcessWeb() sub directly by pressing F5 on the Microsoft Visual Basic Window.
Or you can add a button on your excel worksheet then assign ProcessWeb() as the macro.

VBA CODE:

Function ScrapeWebPage(ByVal URL As String)
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument

Dim WS As Worksheet

Dim i As Integer, row As Integer
Dim File As Integer
Dim Filename As String
Dim DataLine As String
File = FreeFile

Filename = ActiveWorkbook.Path & "\html.log"

Set WS = Sheets("DATA")

'create new XMLHTTP Object
Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send

While XMLHttpRequest.readyState <> 4
DoEvents
Wend

With HTMLDoc.body
'Set HTML Document
.innerHTML = XMLHttpRequest.responseText

'Get only Order List Tag of HTML Document
Set orderedlists = .getElementsByTagName("ol")

'Reset the Document to the HTML of the second ordered list element
'where we only need to extract the data
.innerHTML = orderedlists(1).innerHTML

'Now, we'll get the list items
Set ListItems = .getElementsByTagName("li")

'Open our log file for output stream
Open Filename For Output As #File
For Each li In ListItems

With tmpDoc.body
'Set the temp doc
.innerHTML = li.innerHTML

'There are about 10 columns, so there are 10 p's
Set ps = .getElementsByTagName("p")

For Each p In ps
'Print only the text, excluding the tags
Print #File, p.innerText
Next

End With
Next
'close the file
Close #File

End With

'Open the file again, we'll use it to retrieve each data lines
Open Filename For Input As #File

'Last row of the worksheet
row = lastRow + 1

While Not EOF(File)
For i = 1 To 10
'read the data from the log file
Line Input #File, DataLine

'Put the data on the 1st to 10th column
WS.Cells(row, i).Value = DataLine

Next i
row = row + 1
Wend
Close #File

End Function

'Get the total number pages we need to scrape
Function totalPage() As Integer
Dim HTMLDoc As New HTMLDocument
Dim tmpDoc As New HTMLDocument
Dim html As String
Dim mask As String
Dim URL As String

URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm"

Set XMLHttpRequest = CreateObject("MSXML2.XMLHTTP")
XMLHttpRequest.Open "GET", URL, False
XMLHttpRequest.send

html = XMLHttpRequest.responseText

With HTMLDoc.body
.innerHTML = Mid(html, InStr(1, html, ""), 300)
mask = Mid(.innerHTML, InStr(1, LCase(.innerHTML), "
") - 2, 2)
End With

totalPage = mask

End Function

Function lastRow() As Long
lastRow = Range("A65536").End(xlUp).row
End Function

Sub ProcessWeb()
Dim URL As String
Dim i As Integer

Range("2:2", Selection.End(xlDown)).ClearContents
Range("A2").Select

Application.ScreenUpdating = False
Application.Cursor = xlWait

URL = "http://www.renewableuk.com/en/renewable-energy/wind-energy/uk-wind-energy-database/index.cfm/page/"

For i = 1 To totalPage
ScrapeWebPage URL & i
Application.StatusBar = "Please wait while processing page " & i & " of " & totalPage & "..."
Next i

Application.ScreenUpdating = True
Application.Cursor = xlDefault
Application.StatusBar = ""

MsgBox "Data Extraction is Done!"

End Sub

0 comments: