Characterisation and Clustering of the Blogosphere

In recent years we have seen an incomparable growth of information over the World
Wide Web. A significant proportion of this new content can be described as "user
generated", ie. it is contributed by novice users who use "Web 2.0" applications designed
to allow easy publication of information with little or no technological knowledge needed.

In our work, we focus on the phenomenon of the blogosphere - the collective term for
the 'public diaries' published by web users. The texts contained in weblogs, or 'blogs',
differ considerably from that found in other web pages in that they contain the personal
thoughts and perspectives of the author, presented in a completely unmediated
way. Blogs are presented in a reverse chronological sequence, with each entry being
timestamped. Furthermore, the blogosphere contains a large number of communities of
people with similar interests, who express and comment on each other's opinions.

With these new developments come great challenges and opportunities. Never before has such volumes of information, containing the personal perspectives of millions of people, been available in machine processable form. New techniques and approaches are needed in order to be able to manage and exploit the information.

Our interests concern the development of techniques which can improve the usability
of the information contained in the blogosphere by assigning the blog texts to groupings
which contain similar content. As the labels and numbers of such groups is not known in
advance this is known as clustering. Clustering of short text is a difficult task in general,
and is made more challenging again because of the informal nature of the writings that is
found in blogs.

Additionally, we consider approaches by which we can characterise particular features of
blog texts, such as the scope of the domain and the style of writing, in an unsupervised

  • Synergy Centre at ITT Dublin
  • QQI Brand Logo
  • Member of the Technological University for Dublin Alliance
©2019 Institute of Technology Tallaght • Built by Digital Crew