Tutorial: An Introduction to Machine Learning and Natural Language Processing Tools
- Nick Rizzolo, Mark Sammons, James Clarke, and Vivek Srikumar
Many of the technologies we rely on in our everyday lives depend on the ability to automatically handle natural language. Search engines determine the relevance of documents with respect to keywords. Spam detectors filter email messages based on their content. Automatic machine translators translate from one natural language to another. Systems such as these all utilize machine learning to leverage the information in large datasets and improve their performance with experience. In this tutorial, we introduce our suite of state-of-the-art NLP tools, focusing on our modeling language for general learning-based programs, the aptly named Learning Based Java (LBJava). As motivation, consider the following application whose implementation will become achievable by the end of the tutorial. Suppose you have a news feed through which you receive articles from the full spectrum of news sections; world news, politics, health, finance, sports, etc. Perhaps you'd like to filter these articles based on the appearance of people in them who are famous for different reasons. For example, they may be politicians, athletes, or corporate moguls. While a given type of famous person does tend to appear most commonly in a single news section, you'd like to see all news involving those types of people no matter the section. How can your news feed software automatically determine what a given person in the news is famous for?
Tuesday, 8/24, 9:30 - 10:45am, SC 1105
A classifier is simply a function that takes some object as input and produces a discrete output, classifying the object into one of a set of categories. In a traditional programming language, functions such as these must be hard coded entirely in the syntax of the language. LBJava, on the other hand, allows the partial specification of a classifier whose definition can only be completed via interaction with data. When paired with different data, the same LBJava code results in a different classifier. Specifically, we'll take a look at the well known "20 Newsgroups" dataset, language identification, and spam detection.
The Cognitive Computations Group has developed a suite of state-of-the-art NLP tools, many of which have online demos so you can try them out even before downloading them. We manage the application of these tools in experiments and NLP software using a service called the Curator. Time allowing, we'll begin discussing these tools during this first lecture.
Tuesday, 8/24, 5:00 - 6:30pm, SC 0220 (basement linux lab)
Part III: Feature Engineering
Any machine learning algorithm must be told which facets (or features) of the data to incorporate in the learned function which then weighs the importance of each. For example, when learning a spam detector, one option is to use as features the appearance of each possible word in the email's body. We may also want the learned function to consider the words in the subject line, or the values of the other various headers, or higher level properties of the text. This part of the tutorial is a hands-on exploration of these ideas, in which you'll try to improve a classifier from Part I by engineering your own features.
Thursday, 8/26, 9:30 - 10:45am, SC 2405 (note room change!!!)
Now that we have some expertise in learning based programming, we'll discuss the implementation of a "fame classifier", capable of classifying newsworthy people by what they're famous for. We'll employ our Named Entity Recognizer to detect when a person is mentioned in the text, as well as our Part of Speech Tagger to help us engineer effective features. LBJava has abstracted away the details of learning and applying these two NLP tools so that we can focus on building the fame classifier and its associated application.
The Cognitive Computations Group has developed a suite of state-of-the-art NLP tools, many of which have online demos so you can try them out even before downloading them. We manage the application of these tools in experiments and NLP software using a service called the Curator. The suite includes, but is not limited to:
- Learning Based Java,
- the Part of Speech Tagger,
- the Chunker,
- the Named Entity Recognizer,
- the Coreference Resolution system,
- the Semantic Role Labeler,
- NESim: a word similarity metric based on named entities,
- WNSim: a word similarity metric based on WordNet,
- and the Curator.
In this section of the tutorial, we give an overview of these tools, describing ways in which they can be used to develop better NLP applications.
Thursday, 8/26, 5:00 - 6:30pm, SC 0220 (basement linux lab)
This hands-on afternoon session is another opportunity for you to apply what you've learned to improve the fame classifier.