Tutorial: Machine Learning Tools in Natural Language Processing
This tutorial explores the use of SNoW and FEX, two of our core machine learning tools, to solve text processing problems. Specifically, we apply SNoW and FEX to context sensitive spelling and named entity tagging. The tutorial also uses a number of our other standard tools (such as our Part-of-Speech Tagger) and custom scripts to preprocess/postprocess data for each task. Finally, it mentions ways to streamline the process using perl and shell scripts, and SNoW and FEX's server modes.
For each tutorial session below, you will find links to the software and data you need, script files that detail command line usage, and (where appropriate) helper scripts.
script command, hence the suffix .script...).
Session 1Context Sensitive Spelling
- sample text
- sentence segmentation tool (script: using the sentence splitter)
- word splitter (script: using the word splitter)
- Part-of-Speech Tagger (script: using the POS-tagger)
- FEX (script: using FEX (basic))
- preprocessing test materials
- Context Sensitive Spelling materials
- explanation of POS tags
Session 2Named Entity Tagging with FEX and SNoW
- NE data (original form)
- NE data (desired form)
- Data reformatting script
- NE Tagging: postprocessing script: analyzing SNoW output
- NE Tagging: new data tools script: tagging new (raw) text
- Shallow Parser (Read Only) script: using the Shallow Parser
- chunk-to-column script
- Sample FEX script: NE Tagging
More details about SNoW, including its many useful tuning parameters and support for inference, can be found in the comprehensive SNoW user manual.
Fex's user manual is included in its distribution tarball. In addition to explaining FEX's scripting language and input formats, this manual gives details of other specialized FEX modes.
The following resources demonstrate FEX's document mode, which is not covered in this tutorial.
- Document Classification ( ppt ) ( pdf )
- Document Classifier Materials
- Document Classifier Processed Data
If you have questions about these materials, particularly if you are attending the current tutorial sessions, please contact me at firstname.lastname@example.org -- likewise if you find incorrectly labeled resources, errors, broken links, etc.
Under each session heading are links to slides in powerpoint and pdf formats. NOTE: some slides will not display correctly in pdf format.
The resources to accompany each session are provided further down the page.
- Session 1
Text Processing and Feature Extraction
Introduction; Preprocessing; Feature Extraction ppt
Multi-class Classification with SNoW ppt pdf
- Session 2
Applying FEX and SNoW to Named Entity Tagging
Candidate Selection and Feature Extraction: Named Entity Tagging ppt
- Session 3
Learning Based Java
The Basics: ppt pdf
In the resources section below, see the User's Manual and the README in the "toy context sensitive spelling corrector" for more info.