Scraping Web Data

Appian Analytics Is Your Outsourced Solution

Contact an Appian Analytics Account Manager

What's Related:
William F. Sharpe portfolio optimization modeling system in Microsoft Excel
Survey tabulation services
Data analysis software development
Analytics Programming and Consulting

Sometimes there is valuable information available on web sites that do not have a download feature. Appian Analytics can help you get information from a website in a high volume, multi-page and/or in an automated process. Web scraping or "spidering" a web site can be done for you a one time or ongoing basis as needed.

There are a range of applications for this type of approach. Into he occasions when there is valuable data is stored in web pages at one or more web sites automated scraping rather time-consuming and error prone data entry can be a vastly superior alternative. These web pages might contain data that you want to include in an analysis, spreadsheet, or database. Getting the data into a the correctly formatted spreadsheet, database, or flat file can be time consuming and often clients will resort to manually inputting the data.

Appian Analytics can get the database created for you by scraping the web sites and parsing the data into data fields. The type of data that can be scrapped includes tabular and formatted data but also can include non formatted data. Sometimes the data might be stored across a number of web pages or in an irregular format. Regardless, there are ways to access that data and to do that one time or every week, month or at whatever interval your project requires.

Web Scraping Examples:

  • Extract Tabular Data - This is the easiest scenario where data is formatted on screen in a tabular or spreadsheet- like format. In these cases, automated software can query the one or more web pages as needed and generate a data file that you can use. What's more that data file can be further formatted and include label or text alterations, data grouping, calculations, sentiment scoring and more.
  • Text Data - There are a wide range of needs to extract text comments for use in a high volume analysis of those text comments. This might arise from a product or service review site or a blog or many other types of sites. In addition to scrapping text from websites you can also reformat, categorize, and tag those comments according to some methodology that you need. Additionally, there are ways to automatically assign and categorize this type of data using Natural Language Processing types of software which an assign sentiment or high level categorical attributes.
  • Competitive Product Pricing and Packaging - A very important application of web scraping is used to track and follow competitive product pricing and other packaging that is advertised on web sites. By automatically monitoring and tracking the data related to competitive product offerings your business can keep tabs on your own product or service positioning in the marketplace. Also, scraping and storing this competitive pricing and product or service data can be used to track changes over time and better understand the direction of the market or how competitive pricing, bundling, and positioning changes throughout a calendar year
  • For Sale Listings - Another common application of web site scraping is to record sales listing for the wide range of products that are now sold online. This can be done to better understand pricing for a product or service or to track how pricing can change over time for certain products. Changes in price can be the key indicator of product or service demand in the marketplace as well as discount strategy of retailers and other sellers.
  • Financial Data - There is an enormous wealth of financial data available online. While much of this data CAN be downloaded and used in a spreadsheet, still a huge amount of data is not available for simple download. Additionally, while data MAY be available for download it may not be in a format that can be easily integrated into a database that then can be used in analysis for high volume review. These issues and a host of others are a prime reason to seek outsourced help in this highly specialized field.

Keep in mind that not ONLY simple, single web sites can be monitored and scrapped or mined for information but in addition very large, multi-page websites and data from a large variety of websites can be scraped. This process can create a very easy to use spreadsheet or database that you can more easily use for your analysis or information gathering project.

Example Screen Shots of Web Site Data Scrapping:

Tabular Census Bureau Data

Web site data that is already formatted in a table on the website is the easiest data to scrape. The tabular structure provides the organization elements needed to effectively categorize the data once its scraped from the site. Even though its important to understand that if the data is contained on multiple pages that it must be "fit" together properly for use in a spreadsheet or database. As a result of this, simply scrapping the data itself is only the first step to making the data useable by analytic applications, reporting, and other decision making.

Additionally, you may require some type of data reformatting, text labeling, data organization, or other types of calculations or analysis on the data. These processes can be performed in addition the web scrape itself and therefore give you a more simple dataset to use in the application of your choice.

Example Print Image - Government Subcontracting Opportunities

A print image is a data file report that is created for presentation and printing. Often, older systems, especially mainframes, will output these types of reports. Generally, they are created using ASCII characters set in the file in such a way that it literally looks exactly how it will print. Again, this was the original way of creating reports for distribution via paper.

The issue is that there may be valuable information stored in these print image files that is not easy to analyze in separate spreadsheets and databases. Often, to accomplish an analysis with this data, the user needs to type the data from the print file into a spreadsheet or database before being able to perform any work.

What is Unstructured Content:

Unstructured content is a common name for information or data that lacks a consistent and systematic organization. This information may be text comments such as product reviews, service reviews, product ratings, or even blog entries and comments. Additionally, unstructured content could be actual data that is just not organized clearly in table structures that are easy to move into a spreadsheet or database table.

Without consistent and systematic organization, the information or data cannot be analyzed effectively with common data analysis tools such as spreadsheets, databases, and other analytical applications. Therefore, to gain maximum value from unstructured content you either need to use special text mining and analysis tools or extract the content and transform the content into a structure that will facilitate analysis.

Stop wasting time on data and focus on your business!

Outsourcing to Appian Analytics is a snap

Copyright © Appian Analytics. All rights reserved.