Input Corpus
The goal of this feature is to unbundle the GoldenFish original profiling from the URLs fetching process it was based on.
The Input corpus feature allows handling text-only documents with no need for a fetching step, which ensures time savings on delivery and solves the content fetching & paywall issue.
In order to set up an Input Corpus workflow for a new Client, an internal initial setup phase is necessary :
Creation of a new bucket S3 on AWS to store Client’s Docs + new bucket S3 on AWS to deliver profiled client’s documents
Creation of a dedicated BigSea touchpoint for this new Input Corpus Client
Creation of a new dedicated Input Corpus Owner on GoldenFish
Creation of a new GoldenFish Client Account linked to this new Owner and new Delivery Target
Info for Publishers and Input Corpus user :
Input Corpus Clients will have to push Input Corpus Data to a specific AWS S3 bucket:
E.g:
bigsea-europe-prod-corpora-seed
or
weborama-myclient-corpora-seed
The files should be pushed to a dedicated folder
E.g: weborama-myclient-corpora-seed/telegramme/
File Format to respect:
Compressed JSON
input_corpus_telegramme_2021.gz
The format naming for a daily delivery: input_corpus_telegramme_20220106.gz
The format naming for an hourly delivery: input_corpus_telegramme_2022010608.gz
The file content:
{
"content_id": "12682665",
"title": "Le maire\u00a0de Belz a souhait\u00e9 ses voeux 2021 en vid\u00e9o\u00a0",
"content": "Covid oblige, impossible d\u2019organiser les v\u0153ux de la municipalit\u00e9 de Belz qui r\u00e9unissent des centaines de personnes aux Ast\u00e9ries. D\u00e8s lors, le maire, Bruno Goasmat, est pass\u00e9 au mode distanciel\u00a0avec des v\u0153ux en ligne.Apr\u00e8s un tour d\u2019horizon de la situation et des….”,
"description": "bla bla",
"keywords": ""Voeux", "Bruno Goasmat"",
}
Mandatory fields :
ID
Title
Content or Description
Output Document Profiles:
The delivery will be in dedicated aws s3 bucket:
bigsea-europe-prod-document-profiles-exported/telegramme/inputCorpus/
weborama-myclient-document-profiles-exported/inputCorpus/
The file path will be:
bigsea-europe-prod-document-profiles-exported/telegramme/inputCorpus/2021/11/23/14/
The file format will be:
Compressed JSON files.
The format name will be:
<yyyymmddhh>-doc-profiles-<owner>-<targetID>.json.gz
E.g 2021112314-doc-profiles-tlgic-36.json.gz
{"content_id":"12807040","segments":[{"id":"Classical music & instruments_c30258","score":4,"ttl":2592000},{"id":"Halloween_c30115","score":1,"ttl":2592000},{"id":"Films_c30302","score":1,"ttl":2592000},{"id":"Cinemas_c30257","score":1,"ttl":2592000},{"id":"Popular Events_c30178","score":1,"ttl":2592000},{"id":"Art_c30309","score":3,"ttl":2592000},{"id":"Music_c30163","score":2,"ttl":2592000},{"id":"Family_c30295","score":3,"ttl":2592000},{"id":"Going out_c30317","score":2,"ttl":2592000},{"id":"Theatre_c30217","score":2,"ttl":2592000}],"id_type":"MoonFishLabel","lang":"fr"}