Identifying and interpreting context on the Web : an application-driven approach
Citation:
Daniel Kelleher, 'Identifying and interpreting context on the Web : an application-driven approach', [thesis], Trinity College (Dublin, Ireland). School of Computer Science & Statistics, 2008, pp 217Download Item:
Abstract:
This work describes the use of contextual information in Web document processing. Contextual
information is defined as the contents of a 'context set' of a document of interest on the Web.
The context set of a document is determined by the hyperlink structure of the Web around the
document of interest. This thesis suggests that, as in other text media, contextual information
on the Web is a vital component of the information content of a document, and should be taken
into account when interpreting or processing that information. Most existing hypertext document
processing applications either ignore hyperlinks, or use them in a restrictive manner, to identify a
particular form of contextual information. For example, many information retrieval applications
use hyperlinks as indicators of document prestige or authority, conferred by the referring document
to the referred document. Others use them to locate (or help to locate) similar content in order
to augment a document index or provide additional information about a document's relevance. A
related method is the clustering of Web documents, using the Web hyperlink structure to identify
clusters of related documents, either to generate an aggregate that can be used in document indexing, or to simplify a visual representation of the graph structure in order to aid browsing. These
approaches typically apply the information provided by hyperlinks to a particular application, such
as information retrieval, or Web browsing. In contrast, this thesis proposes a flexible method for
the inclusion of hypertext contextual information in a number of document processing applications,
based on an adaptation of a term weighting measure based on the frequency of a term in a set of documents. The resultant non-linear measure can be incorporated into Web applications using a probabilistic model trained on pre-annotated data suitable for the domain of the application.
In this work, the measure is implemented and evaluated on a number of Web document content processing applications. Specifically, three applications are presented: an adaptation of an existing automatic keyphrase extraction application, a Web document retrieval ranking algorithm, and a document collection homogeneity measure with a related homogeneous corpus generation
application.
Author: Kelleher, Daniel
Advisor:
Luz, SaturninoQualification name:
Doctor of Philosophy (Ph.D.)Publisher:
Trinity College (Dublin, Ireland). School of Computer Science & StatisticsNote:
TARA (Trinity's Access to Research Archive) has a robust takedown policy. Please contact us if you have any concerns: rssadmin@tcd.ieType of material:
thesisAvailability:
Full text availableMetadata
Show full item recordLicences: