Class Robots.Host
- java.lang.Object
-
- org.apache.manifoldcf.crawler.connectors.rss.Robots.Host
-
- Enclosing class:
- Robots
protected class Robots.Host extends java.lang.ObjectThis class maintains status for a given host. There's an instance of this class for each host in the robots cache.
-
-
Field Summary
Fields Modifier and Type Field Description protected intcheckingRobotsThis will be set to nonzero if the robots structure is currently in useprotected java.lang.StringhostNameHost nameprotected longinvalidTimeTimestamp.protected booleanisValidThis flag describes whether or not the host record is valid yet.protected intportPortprotected java.lang.StringprotocolProtocolprotected booleanreadingRobotsThis will be set to "true" if the robots.txt for this host is in the process of being read.protected java.util.ArrayListrecordsThis is the list of robots records for the host, or null if no robots.txt found.
-
Constructor Summary
Constructors Constructor Description Host(java.lang.String protocol, int port, java.lang.String hostName)Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleancanBeFlushed(long currentTime)Check if the current record can be flushed.booleanisFetchAllowed(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)Check a given path string against this host's robots file.protected voidmakeValid(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit)Initialize the record.protected voidparseRobotsTxt(java.io.BufferedReader r, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities)Parse the robots.txt file using a reader.
-
-
-
Field Detail
-
protocol
protected java.lang.String protocol
Protocol
-
port
protected int port
Port
-
hostName
protected java.lang.String hostName
Host name
-
invalidTime
protected long invalidTime
Timestamp. This is the time that the cache record becomes invalid.
-
isValid
protected boolean isValid
This flag describes whether or not the host record is valid yet.
-
records
protected java.util.ArrayList records
This is the list of robots records for the host, or null if no robots.txt found.
-
readingRobots
protected boolean readingRobots
This will be set to "true" if the robots.txt for this host is in the process of being read.
-
checkingRobots
protected int checkingRobots
This will be set to nonzero if the robots structure is currently in use
-
-
Method Detail
-
isFetchAllowed
public boolean isFetchAllowed(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String pathString, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit) throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption, org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionCheck a given path string against this host's robots file.- Parameters:
currentTime- is the current time in milliseconds since epoch.pathString- is the path string to check.- Returns:
- true if crawling is allowed, false otherwise.
- Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruptionorg.apache.manifoldcf.core.interfaces.ManifoldCFException
-
canBeFlushed
public boolean canBeFlushed(long currentTime)
Check if the current record can be flushed. This is not quite the same as whether the record is valid, since a not-yet-valid record still should not be flushed when there is activity going on with that record!
-
makeValid
protected void makeValid(org.apache.manifoldcf.core.interfaces.IThreadContext threadContext, java.lang.String throttleGroupName, long currentTime, java.lang.String userAgent, java.lang.String from, java.lang.String proxyHost, int proxyPort, java.lang.String proxyAuthDomain, java.lang.String proxyAuthUsername, java.lang.String proxyAuthPassword, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities, int connectionLimit) throws org.apache.manifoldcf.agents.interfaces.ServiceInterruption, org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionInitialize the record. This method reads the robots file on the specified protocol/host/port, and parses it according to the rules.- Throws:
org.apache.manifoldcf.agents.interfaces.ServiceInterruptionorg.apache.manifoldcf.core.interfaces.ManifoldCFException
-
parseRobotsTxt
protected void parseRobotsTxt(java.io.BufferedReader r, java.lang.String hostName, org.apache.manifoldcf.crawler.interfaces.IProcessActivity activities) throws java.io.IOException, org.apache.manifoldcf.core.interfaces.ManifoldCFExceptionParse the robots.txt file using a reader. Is NOT expected to close the stream.- Throws:
java.io.IOExceptionorg.apache.manifoldcf.core.interfaces.ManifoldCFException
-
-