Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • bucket name: s3://weborama-adloox<publisher_name>.

  • path: /seed_urls/<yyyymm>/<dd>/<hh>: Par exemple: /seed_urls/202106/08/02/

  • Frequency: every hour.

  • File name: input_yyyymm-dd-hh.csv.gz

  • File format: compressed CSV

  • File content: 1 url per row.

Note

Protocol needs to be added in the source file, otherwise urls will be dropped

E.g:

https://ville-data.com/nombre-d-habitants/La-Chataigneraie-85-85059

https://www.passeportsante.net/fr/Actualites/Dossiers/DossierComplexe.aspx?doc=vegetaux-proteines-le-quinoa

Output Data:

...

{"segments":[{"id":"c3315","score":4},{"id":"c0364","score":4}],"id_type":"MoonFish","lang":"fr","url":"ville-data.com/nombre-d-habitants/La-Chataigneraie-85-85059","url_type":"FullUrl"}

{"segments":[{"id":"c3322","score":3},{"id":"c0363","score":3},{"id":"c0002","score":1}],"id_type":"MoonFish","lang":"fr","url":"www.passeportsante.net/fr/Actualites/Dossiers/DossierComplexe.aspx?doc=vegetaux-proteines-le-quinoa","url_type":"FullUrl"}

{"segments":[{"id":"c3657","score":3},{"id":"c3660","score":3}],"id_type":"MoonFish","lang":"es","url":"http://okdiario.com/look/casa-real/cuenta-atras-cita-mas-incomoda-reyes-1185754/fotos/9","url_type":"FullUrl "}FullUrl"}

  • Frequency: every 2 hours.

...