Collect

Server Side Collect

Input Data:

bucket name: s3://weborama-<publisher_name>.
path: /seed_urls/<yyyymm>/<dd>/<hh>: Par exemple: /seed_urls/202106/08/02/
Frequency: every hour.
File name: input_yyyymm-dd-hh.csv.gz
File format: compressed CSV
File content: 1 url per row.

Protocol needs to be added in the source file, otherwise urls will be dropped

Client Side Collect

We deployed the Contextual Collect tag. In order to be used by a Publisher to collect urls, you have to first ask goldenfish-support@weborama.com to create a dedicated contextual client and provide you with the client ID.

You will need this ID for the following setup that should be implemented by the client in his websites:

The <GoldenFishClientID> Should be replaced by the Publisher Contextual Client account ID provided by the support team.

Once you do this setup with a publisher you should inform me in order to start to ingest the collected urls by the GoldenFish backend team.

Rss Detector

We have created an RssDetector that accepts a whitelist of domain names provided by a client and checks to validate which domains have an Rss feed. Once confirmed, the validated URL for the Rss feed will be forwarded to the RssFetcherService so that any URLs published through the feed inthe future will be profiled.

Input: a csv file in s3
{bucket}/{messageId}/input/
Output: 3 files in s3: a statistics file, a configuration file (configuration to add to RssFetcherService) and a file with the hosts to unblacklist
{bucket}/{messageId}/output/stats.txt
{bucket}/{messageId}/output/configuration.yml
{bucket}/{messageId}/output/blacklisted-hosts.txt

Collect

Server Side Collect

Client Side Collect

Rss Detector

0 Comments