data_tooling
data_tooling copied to clipboard
Create dataset gaceta_parlamentaria_mexico
- uid: gaceta_parlamentaria_mexico
- type: primary
- description:
- name: Gaceta Parlamentaria - Mexico
- description: Contiene los documentos consitutivos y agendas legislativas aprobadas por la cámara de diputados de México.
- homepage: http://gaceta.diputados.gob.mx/
- validated: True
- languages:
- language_names:
- Spanish
- language_comments:
- language_locations:
- Latin America and the Caribbean
- Mexico
- validated: False
- language_names:
- custodian:
- name: Camara de Diputados
- in_catalogue:
- type: A government organization
- location: Mexico
- contact_name:
- contact_email: [email protected]
- contact_submitter: False
- additional: http://www.diputados.gob.mx/
- validated: False
- availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: http://gaceta.diputados.gob.mx/
- download_email:
- licensing:
- has_licenses: No
- license_text:
- license_properties:
- license_list:
- pii:
- has_pii: No
- generic_pii_likely:
- generic_pii_list:
- numeric_pii_likely:
- numeric_pii_list:
- sensitive_pii_likely:
- sensitive_pii_list:
- no_pii_justification_class: general knowledge not written by or referring to private persons
- no_pii_justification_text:
- validated: False
- procurement:
- source_category:
- category_type: collection
- category_web:
- category_media: Laws
- validated: False
- media:
- category:
- text
- text_format:
- audiovisual_format:
- image_format:
- database_format:
- text_is_transcribed: No
- instance_type: post
- instance_count: 1K<n<10K
- instance_size: n>10,000
- validated: False
- category:
- fname: gaceta_parlamentaria_mexico.json
#self-assign
@albertvillanova done: https://huggingface.co/datasets/bigscience-catalogue-data/gaceta_parlamentaria_mexico
Data needs further processing. (HTML / PDF / DOC)
I'd also appreciate it if a Spanish speaker could go through the "TODO" folder and confirm that it's mostly noise.
Scraping was based on this script