ATEE - Automatic Topic Extraction Engine

ATEE stands for Automatic Topic Extraction Engine, our own solution based on an adaptive machine learning algorithm for natural language processing.


Our experience with publishers shows that topics/tags which they have provided are commonly of low quality. It is often the result of:

  • no tag classification in their system 
  • using keywords as tags
  • the human factor, where authors assign a wrong or irrelevant value for a topic, misspell it, etc

This directly influences the quality of data and insights we are able to provide.


ATEE helps publishers to eliminate mentioned factors in order to produce meaningful data and insights.


How do you enable ATEE?


ATEE is a core feature, but it’s disabled by default.
If you want to use it, please send us a written consent and specify the domain you’d like to switch the ATEE for. After that, ATEE will be enabled at no additional cost.


Usage and presentation


Detected topics are treated in the same way as those that are specified by the publisher.
It won’t affect provided topics in any way. If the engine recognizes the topic that already exists, the detected topic will be discarded.
The number of detected topics is limited to the 5 most relevant within a single post.
Detected topics are clearly marked from those that were sent through our tracker, they all have a laboratory flask icon beside each topic name.


Technical requirements


Our crawler script is different from our tracker - those are two different services.
The crawler uses Apify web scraping and automation platform that runs on AWS servers.
Sometimes it happens that crawler requests are being blocked on the client’s side, so they have to enable our crawler to access their content. There are two ways that it can be solved, and clients can use one or both of them:

  1. Filter by fixed IP range
    Crawler runs from a fixed range of IPs listed in this JSON file.
    Those IPs have been white-listed so our crawler can access the content and extract relevant information.
  2. Filter by User-Agent
    Our crawler identifies itself with this User-Agent string:
    Mozilla/5.0 (compatible; contentinsights.com data-extractor/1.0; +http://contentinsights.com)
    If possible, the client should enable all HTTP requests with this header.