Interface KeywordExtractor

  • All Implemented Interfaces:

    
    public interface KeywordExtractor
    
                        

    Implementations can extract keywords from text

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
    • Field Summary

      Fields 
      Modifier and Type Field Description
    • Constructor Summary

      Constructors 
      Constructor Description
    • Enum Constant Summary

      Enum Constants 
      Enum Constant Description
    • Method Summary

      Modifier and Type Method Description
      abstract Set<String> extractKeywords(String text) Extract keywords from the given text
      Double matchCountToScore(Integer matchCount) Converts a match count to a similarity score between 0 and 1.
      abstract Set<String> getKeywords() All known keywords
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

    • Method Detail

      • extractKeywords

         abstract Set<String> extractKeywords(String text)

        Extract keywords from the given text

        Parameters:
        text - the text to extract keywords from
        Returns:

        the set of extracted keywords

      • matchCountToScore

         Double matchCountToScore(Integer matchCount)

        Converts a match count to a similarity score between 0 and 1.

        This default implementation uses an exponent of 0.4 to provide nonlinear scoring that is generous to partial matches while still giving diminishing returns. For example, matching 2 out of 15 keywords yields ~0.45 rather than 0.13, reflecting that even partial matches can be quite valuable.

        The formula is: (matchCount / totalKeywords)^0.4

        This approach aligns with information retrieval principles where early matches are most significant, but avoids being overly harsh on documents that match only a subset of keywords.

        Parameters:
        matchCount - the number of keywords that matched
        Returns:

        a similarity score from 0.0 to 1.0