edu.washington.cs.knowitall.extractor
Class HtmlSentenceExtractor

java.lang.Object
  extended by edu.washington.cs.knowitall.extractor.Extractor<java.lang.String,java.lang.String>
      extended by edu.washington.cs.knowitall.extractor.SentenceExtractor
          extended by edu.washington.cs.knowitall.extractor.HtmlSentenceExtractor

public class HtmlSentenceExtractor
extends SentenceExtractor

An Extractor class for extracting NpChunkedSentence objects from a String containing HTML. Is backed by an OpenNLP SentenceDetector object. Uses the code in HtmlUtils to extract plain text from HTML.

Author:
afader

Constructor Summary
HtmlSentenceExtractor()
          Constructs a new HtmlSentenceExtractor object using the default OpenNLP SentenceDetector object, as returned by DefaultObjects.getDefaultSentenceDetector().
HtmlSentenceExtractor(opennlp.tools.sentdetect.SentenceDetector detector)
          Constructs a new SentenceExtractor object using the given OpenNLP SentenceDetector object.
 
Method Summary
protected  java.lang.Iterable<java.lang.String> extractCandidates(java.lang.String htmlBlock)
          Runs the OpenNLP SentenceDetector object on the given String source, and returns an Iterable object over the detected sentences.
static void main(java.lang.String[] args)
          Extracts sentences from HTML passed via standard input, or through a file given as an argument to the program.
 
Methods inherited from class edu.washington.cs.knowitall.extractor.SentenceExtractor
getSentenceDetector
 
Methods inherited from class edu.washington.cs.knowitall.extractor.Extractor
addMapper, compose, extract, extract, getMappers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlSentenceExtractor

public HtmlSentenceExtractor(opennlp.tools.sentdetect.SentenceDetector detector)
Constructs a new SentenceExtractor object using the given OpenNLP SentenceDetector object.

Parameters:
detector -

HtmlSentenceExtractor

public HtmlSentenceExtractor()
                      throws java.io.IOException
Constructs a new HtmlSentenceExtractor object using the default OpenNLP SentenceDetector object, as returned by DefaultObjects.getDefaultSentenceDetector().

Throws:
java.io.IOException
Method Detail

extractCandidates

protected java.lang.Iterable<java.lang.String> extractCandidates(java.lang.String htmlBlock)
Description copied from class: SentenceExtractor
Runs the OpenNLP SentenceDetector object on the given String source, and returns an Iterable object over the detected sentences.

Overrides:
extractCandidates in class SentenceExtractor
Parameters:
htmlBlock - The source to extract from.
Returns:
An Iterable object over the candidate extractions.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Extracts sentences from HTML passed via standard input, or through a file given as an argument to the program. Removes brackets from sentences using the BracketsRemover mapper class, and filters sentences using the SentenceEndFilter, SentenceStartFilter, and SentenceLengthFilter mapper classes. Prints the resulting sentences to standard output, one sentence per line.

Parameters:
args -
Throws:
java.lang.Exception