Tutorial: An Introduction to Machine Learning and Natural Language Processing Tools
- Mark Sammons, and Vivek Srikumar
Many of the technologies we rely on in our everyday lives depend on the ability to automatically handle natural language. Search engines determine the relevance of documents with respect to keywords. Spam detectors filter email messages based on their content. Automatic machine translators translate from one natural language to another. Systems such as these all use machine learning to leverage the information in large datasets and improve their performance with experience. In this tutorial, we introduce our suite of state-of-the-art NLP tools, focusing on:
- our modeling language for general learning-based programs, the aptly named Learning Based Java (LBJ);
- our NLP tool management system, Curator;
- our NLP data structure library, Edison
As motivation, consider the following application whose implementation will become achievable by the end of the tutorial. Suppose you have a news feed through which you receive articles from the full spectrum of news sections; world news, politics, health, finance, sports, etc. Perhaps you'd like to filter these articles based on the appearance of people in them who are famous for different reasons. For example, they may be politicians, athletes, or corporate moguls. While a given type of famous person does tend to appear most commonly in a single news section, you'd like to see all news involving those types of people no matter the section. How can your news feed software automatically determine what a given person in the news is famous for?
Thursday, 5/26, 3:30 - 4:30am, SC 1111
A classifier is simply a function that takes some object as input and produces a discrete output, classifying the object into one of a set of categories. In a traditional programming language, functions such as these must be hard coded entirely in the syntax of the language. LBJ, on the other hand, allows the partial specification of a classifier whose definition can only be completed via interaction with data. When paired with different data, the same LBJ code results in a different classifier. Specifically, we'll take a look at the well known "20 Newsgroups" dataset, language identification, and spam detection.
The Cognitive Computations Group has developed a suite of state-of-the-art NLP tools, many of which have online demos so you can try them out even before downloading them. We manage the application of these tools in experiments and NLP software using a service called the Curator. Time allowing, we'll begin discussing these tools during this first lecture.
Friday, 5/27, 1:00 - 4:30pm, SC 2405
Applications that handle text documents could potentially make use of a broad array of NLP tools, such as those we introduced in the first part of this tutorial. However, tools from different sources may use different interfaces and different programming languages, making it difficult to directly integrate them into an end user application. In this session we demonstrate how to use Curator, Edison, and LBJ to use a range of NLP tools to process text at run-time, and incorporate machine learning components that use the output of these tools to solve a real NLP problem.