Identifying and interpreting context on the Web : an application-driven approach

Kelleher, Daniel

This work describes the use of contextual information in Web document processing. Contextual information is defined as the contents of a 'context set' of a document of interest on the Web. The context set of a document is determined by the hyperlink structure of the Web around the document of interest. This thesis suggests that, as in other text media, contextual information on the Web is a vital component of the information content of a document, and should be taken into account when interpreting or processing that information. Most existing hypertext document processing applications either ignore hyperlinks, or use them in a restrictive manner, to identify a particular form of contextual information. For example, many information retrieval applications use hyperlinks as indicators of document prestige or authority, conferred by the referring document to the referred document. Others use them to locate (or help to locate) similar content in order to augment a document index or provide additional information about a document's relevance. A related method is the clustering of Web documents, using the Web hyperlink structure to identify clusters of related documents, either to generate an aggregate that can be used in document indexing, or to simplify a visual representation of the graph structure in order to aid browsing. These approaches typically apply the information provided by hyperlinks to a particular application, such as information retrieval, or Web browsing. In contrast, this thesis proposes a flexible method for the inclusion of hypertext contextual information in a number of document processing applications, based on an adaptation of a term weighting measure based on the frequency of a term in a set of documents. The resultant non-linear measure can be incorporated into Web applications using a probabilistic model trained on pre-annotated data suitable for the domain of the application. In this work, the measure is implemented and evaluated on a number of Web document content processing applications. Specifically, three applications are presented: an adaptation of an existing automatic keyphrase extraction application, a Web document retrieval ranking algorithm, and a document collection homogeneity measure with a related homogeneous corpus generation application.

Identifying and interpreting context on the Web : an application-driven approach

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Advisor:

Qualification name:

Publisher:

Note:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata

Browse

My Account

Identifying and interpreting context on the Web : an application-driven approach

File Type:

Item Type:

Date:

Author:

Access:

Citation:

Download Item:

Abstract:

URI:

Advisor:

Qualification name:

Publisher:

Note:

Type of material:

URI:

Collections

Availability:

Keywords:

Metadata