Statistical Consultants Ltd

Web Scraping Services

Web scraping (also known as web harvesting or web data extraction) involves the automated extraction of data (numbers or text) or files from websites. This usually involves writing a program which reads a website's HTML code, and then extracts the relevant information from that code.

Data Extraction

webpages
↓
webpages HTML code
↓
file logos

Pulling data from websites and placing it into spreadsheets can be a very time consuming task if done manually for large amounts of data. Statistical Consultants Ltd can provide web scraping services which greatly speed up such tasks.

This would involve writing a program which takes the following steps:

Downloads the HTML code of a webpage.
Picks the relevant information from the HTML code.
Repeats steps 1 and 2, if there are other webpages.
Saves the relevant information into a spreadsheet (or text file).

Sometimes the programs would have several stages e.g. the first part finding the URLs of webpages from a website, and the second part extracting the information from those webpages.

Some website scraping examples:

Business Directories – It may be desirable to extract data from one or more business directory sites, and store the data in a spreadsheet. If the data is stored as a spreadsheet, it would be easier to sort and append notes to (e.g. which businesses have been considered, contacted, classified etc). A business listings spreadsheet would usually have one business per row, and the types of information about the businesses separated by columns (information such as company name, phone number, fax, address, website, description, and type). If more than one directory is scraped, it may be possible to create a super list (with duplicates removed).
Ratings and Reviews Data – Some websites contain many ratings and reviews. Ratings and reviews data has its uses in academic and market research. If the data is stored in a spreadsheet, it would make it easier to analyse with statistical software (or simply sorted or summarised within the spreadsheet itself).
Price and Product Lists – Many online stores (and ‘bricks-and-mortar’ businesses that provide information about their products online) have many products listed, but those listings often lack the flexibility for the products data to be analysed thoroughly. A spreadsheet of products data could include information such as the product name, product code/number, product category, price, description, product webpage URL, the parent company or brand etc. The scraping of multiple sites could make it much easier to make price comparisons and give an indication of pricing variability. Scraped product/pricing lists could be used for business intelligence (e.g. gaining information about a competitor’s prices) or procurement.

After the data has been extracted, further programming might be performed to enrich the data set e.g. a binary variable stating whether or not an entry contains a particular keyword.
See the Statistical Programming / Data Processing Services page for more information.

Automated File Downloading

Some websites may have a large number of files available for downloading. To download all of the files, it may be more efficient to automate the process rather than download the files manually.

The downloading could be automated by having a program written which takes the following steps:

Download the HTML code of the webpages where the files are stored.
Search the HTML code, and record the URLs where the files are stored.
Download the files.