Server Side Collect
Input Data:
bucket name: s3://weborama-<publisher_name>.
path: /seed_urls/<yyyymm>/<dd>/<hh>: Par exemple: /seed_urls/202106/08/02/
Frequency: every hour.
File name: input_yyyymm-dd-hh.csv.gz
File format: compressed CSV
File content: 1 url per row.
Protocol needs to be added in the source file, otherwise urls will be dropped
Client Side Collect
We deployed the Contextual Collect tag. In order to be used by a Publisher to collect urls, you have to first ask goldenfish-support@weborama.com to create a dedicated contextual client and provide you with the client ID.
You will need this ID for the following setup that should be implemented by the client in his websites:
<script async src="https://cstatic.weborama.com/bigsea/contextual/v1/weboctx.min.js"></script>
<script>
window.weboCtx = window.weboCtx || [];
// COLLECT ID
weboCtx.push(function() {
this.collectURL({
debug: false
clientID: <GoldenFishClientID>, // mandatory.
targetURL: document.URL
});
});
The <GoldenFishClientID> Should be replaced by the Publisher Contextual Client account ID provided by the support team.
Once you do this setup with a publisher you should inform me in order to start to ingest the collected urls by the GoldenFish backend team.
Rss Detector
We have created an RssDetector that accepts a whitelist of domain names provided by a client and checks to validate which domains have an Rss feed. Once confirmed, the validated URL for the Rss feed will be forwarded to the RssFetcherService so that any URLs published through the feed inthe future will be profiled.
Input: a csv file in s3
{bucket}/{messageId}/input/
Output: 3 files in s3: a statistics file, a configuration file (configuration to add to RssFetcherService) and a file with the hosts to unblacklist
{bucket}/{messageId}/output/stats.txt
{bucket}/{messageId}/output/configuration.yml
{bucket}/{messageId}/output/blacklisted-hosts.txt
0 Comments