What does PageReader do?

PageReader is the mergeflow component that downloads web content referenced in web feeds (like RSS and Atom feeds) that users added to their mergeflow.net topic profiles.

What is mergeflow?

mergeflow is a "smart RSS feed reader". Beyond what traditional RSS feed readers offer, mergeflow clusters news by topic, discovers emerging topics, and shows the development of news topics over time. All this is done completely automatically, without any manual intervention.

Why does mergeflow download the page content?

We need a certain amount of text for each document to perform automatic internal analyses. If enough content is included in the web feed itself, the feed item's URL will not be downloaded.

Is my content redistributed via mergeflow?

No. Mergeflow users can only see references to your documents, including the title and one or two lines of text as a preview, like in search engines. The user will have to click on the link to the original page to read the text. Your page will always open in a full browser window without any limitations.

I do not offer a web feed. How did PageReader find my pages?

Our users often subscribe to search engine alerts or blog search engine result pages. So if e. g. one of your new or updated pages is added to a search engine index, it might appear in alerts sent by this search engine.

Is PageReader obeying my robots.txt file?

Yes. We support this standard for robot exclusion. For example, if you do not want PageReader to retrieve content from the /archive directory, you can add lines like this:

User-agent: MergeFlow-PageReader
Disallow: /archive/

To explicitly allow access for some robots (Google and mergeflow in the example below) but exclude all others, your robots.txt could look like this:

User-agent: Googlebot
Disallow: 
User-agent: MergeFlow-PageReader
Disallow:
User-agent: *
Disallow: /

Can PageReader cause high load on my server?

No. We try to keep traffic and server load as low as possible. PageReader supports HTTP compression and conditional GET, will not start concurrent requests on a single domain, and will pause between requests on the same domain. Please note that PageReader will not extract and follow links in downloaded pages. This means that PageReader does not behave like a search engine "crawler" or "spider", so usually it will fetch no more than a few recently published documents.

Please note that there is also another mergeflow component, FeedFetcher, that retrieves web feeds (like RSS and Atom) from your page. FeedFetcher behaves like any other feed reader.

To learn more about mergeflow, please visit our website.