Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 5 Next »

Server Side Collect

Input Data:

  • bucket name: s3://weborama-<publisher_name>.

  • path: /seed_urls/<yyyymm>/<dd>/<hh>: Par exemple: /seed_urls/202106/08/02/

  • Frequency: every hour.

  • File name: input_yyyymm-dd-hh.csv.gz

  • File format: compressed CSV

  • File content: 1 url per row.

Protocol needs to be added in the source file, otherwise urls will be dropped

Client Side Collect

We deployed the Contextual Collect tag. In order to be used by a Publisher to collect urls, you have to first ask goldenfish-support@weborama.com to create a dedicated contextual client and provide you with the client ID. 

You will need this ID for the following setup that should be implemented by the client in his websites:

<script async src="https://cstatic.weborama.com/bigsea/contextual/v1/weboctx.min.js"></script>
    <script>
      window.weboCtx = window.weboCtx || [];
      // COLLECT ID
      weboCtx.push(function() {
        this.collectURL({
          debug: false
          clientID: <GoldenFishClientID>, // mandatory. 
          targetURL: document.URL
        });
      });

The <GoldenFishClientID> Should be replaced by the Publisher Contextual Client account ID provided by the support team.

Once you do this setup with a publisher you should inform me in order to start to ingest the collected urls by the GoldenFish backend team.

Rss Detector

We have created an RssDetector that accepts a whitelist of domain names provided by a client and checks to validate which domains have an Rss feed. Once confirmed, the validated URL for the Rss feed will be forwarded to the RssFetcherService so that any URLs published through the feed inthe future will be profiled.

  • Input: a csv file in s3
    {bucket}/{messageId}/input/

  • Output: 3 files in s3: a statistics file, a configuration file (configuration to add to RssFetcherService) and a file with the hosts to unblacklist
    {bucket}/{messageId}/output/stats.txt
    {bucket}/{messageId}/output/configuration.yml
    {bucket}/{messageId}/output/blacklisted-hosts.txt

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.