Class RobotsManager


  • public class RobotsManager
    extends org.apache.manifoldcf.core.database.BaseTable
    This class manages the database table into which we write robots.txt files for hosts. The data resides in the database, as well as in cache (up to a certain point). The result is that there is a memory limited, database-backed repository of robots files that we can draw on.

    robotsdata
    FieldTypeDescription        
    hostnameVARCHAR(255)Primary Key
    robotsdataBIGINT
    expirationtimeBLOB


    • Constructor Summary

      Constructors 
      Constructor Description
      RobotsManager​(org.apache.manifoldcf.core.interfaces.IThreadContext tc, org.apache.manifoldcf.core.interfaces.IDBInterface database)
      Constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.Boolean checkFetchAllowed​(java.lang.String userAgent, java.lang.String hostName, long currentTime, java.lang.String pathString, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
      Read robots.txt data from the cache or from the database.
      void deinstall()
      Uninstall the manager.
      protected static boolean doesPathMatch​(java.lang.String path, int pathIndex, java.lang.String spec, int specIndex)
      Recursive method for matching specification to path.
      protected static boolean doesPathMatch​(java.lang.String path, java.lang.String spec)
      Check if path matches specification
      protected static java.lang.String getRobotsKey​(java.lang.String hostName)
      Construct a key which represents an individual host name.
      void install()
      Install the manager.
      protected static java.lang.String makeReadable​(java.lang.String inputString)
      Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
      protected RobotsManager.RobotsData readRobotsData​(java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
      Read robots data, if it exists.
      void writeRobotsData​(java.lang.String hostName, long expirationTime, java.io.InputStream data)
      Write robots.txt, replacing any existing row.
      • Methods inherited from class org.apache.manifoldcf.core.database.BaseTable

        addTableIndex, analyzeTable, beginTransaction, buildConjunctionClause, constructCountClause, constructDistinctOnClause, constructDoubleCastClause, constructOffsetLimitClause, constructRegexpClause, constructSubstringClause, endTransaction, findConjunctionClauseMax, getDatabaseCacheKey, getDBInterface, getMaxInClause, getMaxOrClause, getSleepAmt, getTableIndexes, getTableName, getTableSchema, getTransactionID, getWindowedReportMaxRows, makeTableKey, noteModifications, performAddIndex, performAlter, performCommit, performCreate, performDelete, performDrop, performInsert, performModification, performQuery, performQuery, performRemoveIndex, performUpdate, prepareRowForSave, readRow, reindexTable, signalRollback, sleepFor
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • RobotsManager

        public RobotsManager​(org.apache.manifoldcf.core.interfaces.IThreadContext tc,
                             org.apache.manifoldcf.core.interfaces.IDBInterface database)
                      throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Constructor. Note that one robotsmanager handle is only useful within a specific thread context, so the calling connector object logic must recreate the handle whenever the thread context changes.
        Parameters:
        tc - is the thread context.
        database - is the database handle.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
    • Method Detail

      • install

        public void install()
                     throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Install the manager.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • deinstall

        public void deinstall()
                       throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Uninstall the manager.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • checkFetchAllowed

        public java.lang.Boolean checkFetchAllowed​(java.lang.String userAgent,
                                                   java.lang.String hostName,
                                                   long currentTime,
                                                   java.lang.String pathString,
                                                   org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
                                            throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Read robots.txt data from the cache or from the database.
        Parameters:
        hostName - is the host for which the data is desired.
        currentTime - is the time of the check.
        Returns:
        null if the record needs to be fetched, true if fetch is allowed.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • writeRobotsData

        public void writeRobotsData​(java.lang.String hostName,
                                    long expirationTime,
                                    java.io.InputStream data)
                             throws org.apache.manifoldcf.core.interfaces.ManifoldCFException,
                                    java.io.IOException
        Write robots.txt, replacing any existing row.
        Parameters:
        hostName - is the host.
        expirationTime - is the time this data should expire.
        data - is the robots data stream. May be null.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
        java.io.IOException
      • getRobotsKey

        protected static java.lang.String getRobotsKey​(java.lang.String hostName)
        Construct a key which represents an individual host name.
        Parameters:
        hostName - is the name of the connector.
        Returns:
        the cache key.
      • readRobotsData

        protected RobotsManager.RobotsData readRobotsData​(java.lang.String hostName,
                                                          org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)
                                                   throws org.apache.manifoldcf.core.interfaces.ManifoldCFException
        Read robots data, if it exists.
        Returns:
        null if the data doesn't exist at all. Return robots data if it does.
        Throws:
        org.apache.manifoldcf.core.interfaces.ManifoldCFException
      • makeReadable

        protected static java.lang.String makeReadable​(java.lang.String inputString)
        Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
      • doesPathMatch

        protected static boolean doesPathMatch​(java.lang.String path,
                                               java.lang.String spec)
        Check if path matches specification
      • doesPathMatch

        protected static boolean doesPathMatch​(java.lang.String path,
                                               int pathIndex,
                                               java.lang.String spec,
                                               int specIndex)
        Recursive method for matching specification to path.