Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Push The goal of this feature is to unbundle the GoldenFish original profiling from the URLs fetching process it was based on.
The Input corpus feature allows handling text-only documents with no need for a fetching step, which ensures time savings on delivery and solves the content fetching & paywall issue.


In order to set up an Input Corpus workflow for a new Client, an internal initial setup phase is necessary :

  • Creation of a new bucket S3 on AWS to store Client’s Docs + new bucket S3 on AWS to deliver profiled client’s documents

  • Creation of a dedicated BigSea touchpoint for this new Input Corpus Client

  • Creation of a new dedicated Input Corpus Owner on GoldenFish

  • Creation of a new GoldenFish Client Account linked to this new Owner and new Delivery Target


Info for Publishers and Input Corpus user :


Input Corpus Clients will have to push Input Corpus Data to an a specific AWS S3 bucket:

E.g:

bigsea-europe-prod-corpora-seed

...

weborama-myclient-corpora-seed

Then the The files should be pushed to a dedicated folder

E.g: weborama-myclient-corpora-seed/telegramme/

...

File Format to respect:

Compressed JSON

input_corpus_telegramme_2021.gz

...

{
"content_id": "12682665",
"title": "Le maire\u00a0de Belz a souhait\u00e9 ses voeux 2021 en vid\u00e9o\u00a0",
"content": "Covid oblige, impossible d\u2019organiser les v\u0153ux de la municipalit\u00e9 de Belz qui r\u00e9unissent des centaines de personnes aux Ast\u00e9ries. D\u00e8s lors, le maire, Bruno Goasmat, est pass\u00e9 au mode distanciel\u00a0avec des v\u0153ux en ligne.Apr\u00e8s un tour d\u2019horizon de la situation et des….”,
"description": "bla bla",
"keywords": ""Voeux", "Bruno Goasmat"",
}

The mandatory Mandatory fields are :

  • ID

...

  • Title

...

  • Content or Description

Output Document Profiles:

...

bigsea-europe-prod-document-profiles-exported/telegramme/inputCorpus/2021/11/23/14/

The file format will be:

compressed json Compressed JSON files.

The format name will be:

...