Extending WEKA

August 22, 2011 by

Foreword

This article is part of John’s Salatas BSc. Thesis with subject “Implementation of Artificial Neural Networks and Applications in Foreign Exchange Time Series Analysis and Forecasting” (Greek text)  completed at May 2011 under the supervision of Ass. Prof. C. N. Anagnostopoulos (Cultural Technology and Communication Dpt., University of Aegean).

1. Introduction to WEKA

The WEKA environment (Waikato Environment for Knowledge Analysis) [1] was created by the need for an integrated computing environment that would provide to researchers easy access to many machine learning algorithms and also  provide a programming environment in which researchers can implement new algorithms without having to consider many  programming details. WEKA already has a large number of algorithms and distributed under the license GNU GPL v2. It is  developed using the Java programming language and is available at the WEKA’s website.

2. Extending WEKA

WEKA’s Application Programming Interface (API) allows the easy embedding in other java applications as well as extending it with new features that may be either additional machine learning algorithms and tools for data visualization, or even extensions of the Graphical User Interface (GUI) in order to support different workflows, as for example, the Time Series Analysis and Forecasting Environment [2].

The above characteristics are supplemented with the excellent technical support provided by the online user community through the relevant mailing list [3] and the documentation provided through the relevant wiki pages [4].

A good introduction on embedding WEKA in other applications can be found at [5] which describes the most basic and widely used components through a number of examples. The complete documentation (javadoc) of WEKA’s API can be found a [6].

2.1. 3rd Party Tools

Besides WEKA’s source code, in order to implement any extension, one may use a number of external 3rd party programming tools and libraries which are described below. The purpose of these tools is to automate many common programming tasks, such as unit testing or the source code’s management.

2.1.1. Unit Testing – The JUnit Library

For unit testing, WEKA uses the JUnit Library. JUnit is a framework used for the creation of automated test cases and is distributed for free under the License Common Public License v 1.0. A good introduction to the way of writing and organization of these scenarios is available at [7].

2.1.2. Source Code Management – Subversion

WEKA’s source code is also available through a software repository based on Apache Subversion. The Apache Subversion application enables developers to control the modifications to the source code in various stages of the software development cycle, leading to a more efficient collaboration between those involved in the development and thus, to increased productivity, especially in the case of  developing a large open source applications, which involves a large number of geographically distributed developers.

Apache Subversion is distributed for free under the Common Public License v 1.0 and a good introduction to this application is available at [8]. Finally, at [9] one can find brief instructions on how to use Apache Subversion with WEKA’s software repository.

2.1.3. Build Scripts – Apache Ant

The  build process for a new WEKA extension  package requires the Apache Ant build tool which is also distributed for free under the Apache License version 2.0 and a good introduction to this application is available at [10].

2.1.4. Integrated Development Environments (IDE) – Netbeans and Eclipse

The tools described above are usually integrated with other programming tools (i.e. code editors/debuggers) in a single IDE. The two most popular IDE’s for Java development probably are the following:

  • Netbeans, which is distributed for free under a double license: Common Development and DistributionLicense (CDDL) v1.0 and GNU GPL v2.
  • Eclipse, which is also distributed for free under the  Eclipse Public License (EPL) v. 1.0.

Both of these IDEs can be setup for the development of WEKA’s extensions according to the instructions provided at [11] and [12].

2.2. Implementation of new Classifiers and Clusterers

2.2.1. Implementation of new Classifier

All classifiers in WEKA should implement the interface weka.classifiers.Classifier. WEKA also provides a number of abstract classes that already  implement the weka.classifiers.Classifier interface. These abastract classes for version 3.7.3 are described in details in [13]. The most basic of these are AbstractClassifier which already implements several functions and RandomizableClassifier which inherits from AbstractClassifier and implements an additional parameter to initialize (seed) for a random number generator, if required, as in case of classifiers that need ton initialize random weights.

Properties

In order the algorithm’s parameter to be accessible through the WEKA’s Graphic User Interface (GUI), there must be a property definition, in conformance with the JavaBeans conventions [14], as follows [13]:

  • public void set<PropertyName>(<Type>) checks whether the supplied value is valid and only then updates the corresponding member variable. In any other case it should ignore the value and output a warning in the console or throw an IllegalArgumentException.
  • public <Type> get<PropertyName>() performs any necessary conversions of the internal value and returns it.
  • public String <propertyName>TipText() returns the help text that is available through the GUI. Should be the same as on the command-line. Note: everything after the first period “.” gets truncated from the tool tip that pops up in the GUI when hovering with the mouse cursor over the field in the GenericObjectEditor.

Furthermore the following methods should be implemented in order the algorithm’s parameter to be accessible through the command-line [13]:

  • public String[] getOptions() which returns a string array of command-line options that resemble the current classifier setup. Supplying this array to the setOptions(String[]) method must result in the same configuration.
  • public Enumeration listOptions() returns a java.util.Enumeration of weka.core.Option objects. This enumeration is used to display the help on the command-line, hence it needs to return the Option objects of the superclass as well.
  • public void setOptions(String[] options) which parses the options that the classifier would receive from a command-line invocation. A parameter and  argument are always two elements in the string array.

Capabilities

The method public Capabilities getCapabilities() returns meta-information on what type of data the classifier can handle, in regards to attributes and class attributes.

Building the model

The method public void buildClassifier(Instances instances) builds the model from scratch with the provided dataset. Each subsequent call of this method must result in the same model being built. The buildClassifier method also tests whether the supplied data can be handled at all by the classifier, utilizing the capabilities returned by the getCapabilities() method:

1
2
3
4
5
6
7
8
9
10
public void buildClassifier(Instances data) throws Exception {
    // test data against capabilities
    getCapabilities().testWithFail(data);
    // remove instances with missing class value,
    // but don't modify original data
    data = new Instances(data);
    data.deleteWithMissingClass();
    // actual model generation
    // ...
}

Instance classification

For the classification of an instance one of the following two method should be used [30]:

  • public double [] distributionForInstance(Instance instance) returns the class probabilities array of the prediction for the given weka.core.Instance object. If your classifier handles nominal class attributes, then you need to override this method.
  • public double classifyInstance(Instance instance) returns the classification or regression for the given weka.core.Instance object. In case of a nominal class attribute, this method returns the index of the class label that got predicted. You do not need to override this method in this case as the weka.classifiers.Classifier superclass already determines the class label index based on the probabilities array that the distributionForInstance(Instance) method returns (it returns the index in the array with the highest probability; in case of ties the first one). For numeric class attributes, you need to override this method, as it has to return the regression value predicted by the model.

Other methods

Beside the above methods, there are several other methods which should be or are highly recommended to be implemented for every classifier:

  • public String toString() which is used for outputting the built model. This is not required, but it is useful for the user to see properties of the model. Decision trees normally ouput the tree, support vector machines the support vectors and rule-based classifiers the generated rules.
  • public static void main(String [] argv) executes the classifier from command-line. If your new algorithm is called MyClassifier, then use the following code as your main method:
    1
    2
    3
    4
    5
    6
    7
    8
    
    /**
    * Main method for executing this classifier.
    *
    * @param args the options, use "-h" to display options
    */
    public static void main(String[] args) {
        AbstractClassifier.runClassifier(new MyClassifier(), args);
    }

2.2.2. Implementation of new Clusterer

In general the guidelines for implementing a new clusterer are similar to those described above for implementing a new classifier. All clusterers in WEKA should implement the interface weka.clusterers.Clusterer. WEKA also provides a number of abstract classes that already implement the weka.clusterers.Clusterer interface. These abastract classes for version 3.7.3 are described in details in [13]. The most basic of these are AbstractClusterer which already implements several functions and RandomizableClusterer which inherits from AbstractClusterer and implements an additional parameter to initialize (seed) for a random number generator, if required, as in case of clusterers that need ton initialize random weights.

Properties

In order the algorithm’s parameter to be accessible through the WEKA’s Graphic User Interface (GUI), there must be a property definition, in conformance with the JavaBeans conventions [14], as follows [13]:

  • public void set<PropertyName>(<Type>) checks whether the supplied value is valid and only then updates the corresponding member variable. In any other case it should ignore the value and output a warning in the console or throw an IllegalArgumentException.
  • public <Type> get<PropertyName>() performs any necessary conversions of the internal value and returns it.
  • public String <propertyName>TipText() returns the help text that is available through the GUI. Should be the same as on the command-line. Note: everything after the first period “.” gets truncated from the tool tip that pops up in the GUI when hovering with the mouse cursor over the field in the GenericObjectEditor.

Furthermore the following methods should be implemented in order the algorithm’s parameter to be accessible through the command-line [13]:

  • public String[] getOptions() which returns a string array of command-line options that resemble the current clusterer setup. Supplying this array to the setOptions(String[]) method must result in the same configuration.
  • public Enumeration listOptions() returns a java.util.Enumeration of weka.core.Option objects. This enumeration is used to display the help on the command-line, hence it needs to return the Option objects of the superclass as well.
  • public void setOptions(String[] options) which parses the options that the clusterer would receive from a command-line invocation. A parameter and argument are always two elements in the string array.

Capabilities

The method public Capabilities getCapabilities() returns meta-information on what type of data the clusterer can handle, in regards to attributes and class attributes.

Building the model

The method public void buildClusterer(Instances instances) builds the model from scratch with the provided dataset. Each subsequent call of this method must result in the same model being built. The buildClassifier method also tests whether the supplied data can be handled at all by the clusterer, utilizing the capabilities returned by the getCapabilities() method:

1
2
3
4
5
6
public void buildClusterer(Instances data) throws Exception {
    // test data against capabilities
    getCapabilities().testWithFail(data);
    // actual model generation
    ...
}

Instance clustering

For the clustering of an instance one of the following two method should be used [30]:

  • public double [] distributionForInstance(Instance instance) returns the cluster membership for this weka.core.Instance object. The membership is a double array containing the probabilities for each cluster.
  • public double clusterInstance(Instance instance) returns the index of the cluster the provided Instance belongs to.

Other methods

Beside the above methods, there are several other methods which should be or are highly recommended to be implemented for every clusterer:

  • public String toString() which should output some information on the generated model. Even though this is not required, it is rather useful for the user to get some feedback on the built model.
  • public static void main(String [] argv) executes the clusterer from command-line. If your new algorithm is called MyClusterer, then use the following code as your main method:
    1
    2
    3
    4
    5
    6
    7
    8
    
    /**
    * Main method for executing this clusterer.
    *
    * @param args the options, use "-h" to display options
    */
    public static void main(String[] args) {
        AbstractClusterer.runClusterer(new MyClusterer(), args);
    }

Finally, another method that is required to be implemented is the public int numberOfClusters() method which should returns the number of clusters that the model contains, after the model has been generated with the buildClusterer(Instances) method.

2.3. Anatomy of a Package

As of version 3.7.2 WEKA can automatically manage extensions that are available as a   package. So, the easiest way to distribute an extension is to create a package, which briefly is a zip archive that contains all the resources that are required by the extension. A typical structure for a package is shown in the following image:

The Anatomy of a WEKA's Package

The Anatomy of a WEKA’s Package

Under the “src” folder are places all the source code files and under the “test” folder the source code for the unit trsts. The “Description.props” contains the metadata used by the WEKA’s package manager. Finally the file “build_package.xml” contains the package’s build script for the Apache Ant. [15]

3. Conclusion

This article tried to provide a brief description on how to implement new alogoriths in WEKA for data classification and clusterization. We saw that by using WEKA a researcher can easily implement her own algorithms without other technical concernings like binding an algorithm with a GUI or even loading the data from a file/database, as these tasks and many others are handled transparently by the WEKA framework.

References

[1] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA Data Mining Software: An Update”, SIGKDD Explorations, 2009, Volume 11, Issue 1, pp.10-18.

[2] “Time Series Analysis and Forecasting with Weka”, Pentaho Community
last access: 22/08/2011.

[3]  “Wekalist – Weka machine learning workbench list”
last access: 22/08/2011.

[4] “Pages”, Weka wiki
last access: 22/08/2011.

[5]  “Use WEKA in your Java code”, Weka wiki
last access: 22/08/2011.

[6] “Weka Javadoc”
last access: 22/08/2011.

[7]  K. Beck, E. Gamma, “JUnit Cookbook”
last access: 22/08/2011.

[8]  B. Collins-Sussman, B. W. Fitzpatrick, C. M. Pilato, “Version Control with
Subversion”
, 2008.
last access: 22/08/2011.

[9] “Subversion repository”, Weka wiki
last access: 22/08/2011.

[10]  “Apache Ant 1.8.2 Manual”
last access: 22/08/2011.

[11] “Netbeans 6.0”, Weka wiki
last access: 22/08/2011.

[12] “Eclipse 3.4.x”, Weka wiki
last access: 22/08/2011.

[13] R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, D.
Scuse, “WEKA Manual for Version 3-7-3”, The University of Waikato, 2010.
last access: 22/08/2011.

[14] “JavaBeans Component Design Conventions”, The J2EE Tutorial, Sun Developer Network, 2002.
last access: 22/08/2011.

[15] “How are packages structured for the package management system?”, Weka
wiki
last access: 22/08/2011.

Leave a Reply

Your email address will not be published. Required fields are marked *