data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Create dataset gaceta_parlamentaria_mexico

Open albertvillanova opened this issue 4 years ago • 2 comments

  • uid: gaceta_parlamentaria_mexico
  • type: primary
  • description:
    • name: Gaceta Parlamentaria - Mexico
    • description: Contiene los documentos consitutivos y agendas legislativas aprobadas por la cámara de diputados de México.
    • homepage: http://gaceta.diputados.gob.mx/
    • validated: True
  • languages:
    • language_names:
      • Spanish
    • language_comments:
    • language_locations:
      • Latin America and the Caribbean
      • Mexico
    • validated: False
  • custodian:
    • name: Camara de Diputados
    • in_catalogue:
    • type: A government organization
    • location: Mexico
    • contact_name:
    • contact_email: [email protected]
    • contact_submitter: False
    • additional: http://www.diputados.gob.mx/
    • validated: False
  • availability:
    • procurement:
      • for_download: Yes - it has a direct download link or links
      • download_url: http://gaceta.diputados.gob.mx/
      • download_email:
    • licensing:
      • has_licenses: No
      • license_text:
      • license_properties:
      • license_list:
    • pii:
      • has_pii: No
      • generic_pii_likely:
      • generic_pii_list:
      • numeric_pii_likely:
      • numeric_pii_list:
      • sensitive_pii_likely:
      • sensitive_pii_list:
      • no_pii_justification_class: general knowledge not written by or referring to private persons
      • no_pii_justification_text:
    • validated: False
  • source_category:
    • category_type: collection
    • category_web:
    • category_media: Laws
    • validated: False
  • media:
    • category:
      • text
    • text_format:
      • .PDF
    • audiovisual_format:
    • image_format:
    • database_format:
    • text_is_transcribed: No
    • instance_type: post
    • instance_count: 1K<n<10K
    • instance_size: n>10,000
    • validated: False
  • fname: gaceta_parlamentaria_mexico.json

albertvillanova avatar Nov 23 '21 10:11 albertvillanova

#self-assign

cakiki avatar Jan 29 '22 11:01 cakiki

@albertvillanova done: https://huggingface.co/datasets/bigscience-catalogue-data/gaceta_parlamentaria_mexico

Data needs further processing. (HTML / PDF / DOC)

I'd also appreciate it if a Spanish speaker could go through the "TODO" folder and confirm that it's mostly noise.

Scraping was based on this script

cakiki avatar Jan 29 '22 21:01 cakiki