You are here: Home Plone products collective.classification

collective.classification (0.1)

by Olha Pelishok last modified 2010-08-09
Released on 2010-08-05 by G. Gozadinos under GPL - GNU General Public License available for All platforms.
Software development stage: alpha
Content classification/clustering through language processing

collective.classification aims to provide a set of tools for automatic document classification. Currently it makes use of the Natural Language Toolkit and features a trainable document classifier based on Part Of Speech (POS) tagging, heavily influenced by topia.termextract. This is not a finished product and is intended to be used for experimentation and development.

What is this all about?

It's mostly about having fun! The package is in a very early experimental stage and awaits eagerly contributions. You will get a good understanding of what works or not by looking at the tests. You might also be able to do some useful things with it: On a large site with a lot of content and tags (or subjects in the plone lingo) it might be difficult to assign tags to new content. In this case, a trained classifier could provide useful suggestions to an editor responsible for tagging content.

How it works?

At the moment there exist the following type of utilities:

  • POS taggers, utilities for classifying words in a document as Parts Of Speech. Two are provided at the moment, a Penn TreeBank tagger and a trigram tagger. Both can be trained with some other language than english which is what we do here.
  • Term extractors, utilities responsible for extracting the important terms from some document. The extractor we use here, assumes that in a document only nouns matter and uses a POS tagger to find those mostly used in a document. For details please look at the code and the tests.
  • Content classifiers, utilities that can tag content in predefined categories. Here, a naive Bayes classifier is used. Basically, the classifier looks at already tagged content, performs term extraction and trains itself using the terms and tags as an input. Then, for new content, the classifier will provide suggestions for tags according to the extracted terms of the content.
  • Clusterers, utilities that without prior knowledge of content classification can group content into groups according to feature similarity. At the moment NLTK's k-means clusterer is used.
Document Actions
Powered by Plone