Web
Scraping Services
Web scraping (also known as web harvesting
or web data
extraction)
involves the automated extraction of data (numbers or text) or
files from
websites. This
usually involves writing
a program which reads a website's HTML code, and then extracts the
relevant
information from that code.
Data
Extraction

↓

↓

Pulling data from websites and placing it into spreadsheets
can be a very time consuming task if done manually for large amounts of
data. Statistical
Consultants Ltd can
provide web scraping services which greatly speed up such tasks.
This would involve writing a program which takes the
following steps:
- Downloads the HTML code of a webpage.
- Picks the relevant information from the HTML code.
- Repeats steps 1 and 2, if there are other webpages.
- Saves the relevant information into a spreadsheet (or
text file).
Sometimes the programs would have several
stages e.g. the
first part finding the URLs of webpages from a website, and the second
part
extracting the information from those webpages.
Some website scraping examples:
- Business
Directories – It may be desirable to extract
data
from one or more business directory sites, and store the data in a
spreadsheet. If the
data is stored as a
spreadsheet, it would be easier to sort and append notes to (e.g. which
businesses
have been considered, contacted, classified etc).
A business listings spreadsheet would usually
have one business per row, and the types of information about the
businesses
separated by columns (information such as company name, phone number,
fax, address,
website, description, and type). If
more
than one directory is scraped, it may be possible to create a super
list (with
duplicates removed).
- Ratings
and Reviews Data – Some websites contain many
ratings and reviews. Ratings
and reviews
data has its uses in academic and market research.
If the data is stored in a spreadsheet, it
would make it easier to analyse with statistical software (or simply
sorted or
summarised within the spreadsheet itself).
- Price
and Product Lists – Many online stores (and
‘bricks-and-mortar’ businesses that provide
information about their products
online) have many products listed, but those listings often lack the
flexibility for the products data to be analysed thoroughly. A spreadsheet of products
data could include
information such as the product name, product code/number, product
category,
price, description, product webpage URL, the parent company or brand
etc. The scraping
of multiple sites could make it
much easier to make price comparisons and give an indication of pricing
variability. Scraped
product/pricing
lists could be used for business intelligence (e.g. gaining information
about a
competitor’s prices) or procurement.
After the data has been extracted, further programming might
be performed to enrich the data set e.g. a binary variable stating
whether or
not an entry contains a particular keyword.
See the Statistical
Programming / Data Processing Services page for more
information.
Automated
File
Downloading
Some websites may have a large number of
files available for
downloading. To
download all of the
files, it may be more efficient to automate the process rather than
download the
files manually.
The downloading could be automated by having a program
written which takes the following steps:
- Download the HTML code of the webpages where the
files are
stored.
- Search the HTML code, and record the URLs where the
files
are stored.
- Download the files.
|