Collect

Server Side Collect

Input Data:

bucket name: s3://weborama-<publisher_name>.
path: /seed_urls/<yyyymm>/<dd>/<hh>: Par exemple: /seed_urls/202106/08/02/
Frequency: every hour.
File name: input_yyyymm-dd-hh.csv.gz
File format: compressed CSV
File content: 1 url per row.

Protocol needs to be added in the source file, otherwise urls will be dropped

Client Side Collect

We deployed the Contextual Collect tag. In order to be used by a Publisher to collect urls, you have to first ask goldenfish-support@weborama.com to create a dedicated contextual client and provide you with the client ID.

You will need this ID for the following setup that should be implemented by the client in his websites:

The <GoldenFishClientID> Should be replaced by the Publisher Contextual Client account ID provided by the support team.

Once you do this setup with a publisher you should inform me in order to start to ingest the collected urls by the GoldenFish backend team.

Rss Detector

We have created an RssDetector that accepts a whitelist of domain names provided by a client and checks to validate which domains have an Rss feed. Once confirmed, the URL for the domains Rss feed will be forwarded to the RssFetcherService. This service will capture any URLs published by the Rss feeds in the future for profiling.

For example, Disney UK provided us with a CSV containing a whitelist of domains. We ran the Rss Detector on the file and it positively identified ~39% of the domains as having an Rss feed. Then a new touchpoint, 128, was created for Disney, as well as a corresponding Owner in Contextual. The final stage is then to update the RssFetcherService to include the Rss Feeds from Disney.