/
Input Corpus

Input Corpus

The goal of this feature is to unbundle the GoldenFish original profiling from the URLs fetching process it was based on.
The Input corpus feature allows handling text-only documents with no need for a fetching step, which ensures time savings on delivery and solves the content fetching & paywall issue.


In order to set up an Input Corpus workflow for a new Client, an internal initial setup phase is necessary :

  • Creation of a new bucket S3 on AWS to store Client’s Docs + new bucket S3 on AWS to deliver profiled client’s documents

  • Creation of a dedicated BigSea touchpoint for this new Input Corpus Client

  • Creation of a new dedicated Input Corpus Owner on GoldenFish

  • Creation of a new GoldenFish Client Account linked to this new Owner and new Delivery Target


Info for Publishers and Input Corpus user :


Input Corpus Clients will have to push Input Corpus Data to a specific AWS S3 bucket:

E.g:

bigsea-europe-prod-corpora-seed

or

weborama-myclient-corpora-seed

The files should be pushed to a dedicated folder

E.g: weborama-myclient-corpora-seed/telegramme/

File Format to respect:

Compressed JSON

input_corpus_telegramme_2021.gz

The format naming for a daily delivery: input_corpus_telegramme_20220106.gz

The format naming for an hourly delivery: input_corpus_telegramme_2022010608.gz

The file content:

{
"content_id": "12682665",
"title": "Le maire\u00a0de Belz a souhait\u00e9 ses voeux 2021 en vid\u00e9o\u00a0",
"content": "Covid oblige, impossible d\u2019organiser les v\u0153ux de la municipalit\u00e9 de Belz qui r\u00e9unissent des centaines de personnes aux Ast\u00e9ries. D\u00e8s lors, le maire, Bruno Goasmat, est pass\u00e9 au mode distanciel\u00a0avec des v\u0153ux en ligne.Apr\u00e8s un tour d\u2019horizon de la situation et des….”,
"description": "bla bla",
"keywords": ""Voeux", "Bruno Goasmat"",
}

 

Mandatory fields :

  • ID

  • Title

  • Content or Description

Output Document Profiles:

The delivery will be in dedicated aws s3 bucket:

bigsea-europe-prod-document-profiles-exported/telegramme/inputCorpus/

weborama-myclient-document-profiles-exported/inputCorpus/

The file path will be:

bigsea-europe-prod-document-profiles-exported/telegramme/inputCorpus/2021/11/23/14/

The file format will be:

Compressed JSON files.

 

The format name will be:

<yyyymmddhh>-doc-profiles-<owner>-<targetID>.json.gz

E.g 2021112314-doc-profiles-tlgic-36.json.gz

{"content_id":"12807040","segments":[{"id":"Classical music & instruments_c30258","score":4,"ttl":2592000},{"id":"Halloween_c30115","score":1,"ttl":2592000},{"id":"Films_c30302","score":1,"ttl":2592000},{"id":"Cinemas_c30257","score":1,"ttl":2592000},{"id":"Popular Events_c30178","score":1,"ttl":2592000},{"id":"Art_c30309","score":3,"ttl":2592000},{"id":"Music_c30163","score":2,"ttl":2592000},{"id":"Family_c30295","score":3,"ttl":2592000},{"id":"Going out_c30317","score":2,"ttl":2592000},{"id":"Theatre_c30217","score":2,"ttl":2592000}],"id_type":"MoonFishLabel","lang":"fr"}

 

Related content

Collect
Read with this
Setup Process for Publishers
Setup Process for Publishers
More like this