dkan
dkan copied to clipboard
Update MysqlImport.php
The detectDelimiter function in the code aims to automatically detect the delimiter used in a CSV file. This is useful because in CSV files, the delimiter (the character that separates fields) can vary depending on regional settings and language. For example, in the United States and other English-speaking countries, the most common delimiter is a comma (,), while in some European and Latin American countries, a semicolon (;) is used as the delimiter.
The detectDelimiter function works as follows:
It defines a set of possible delimiters to check: , (comma), ; (semicolon), and \t (tabulation).
It opens the CSV file and reads a limited number of lines (in this case, a maximum of 10 lines) for analysis.
For each possible delimiter, it counts the average number of columns in the analyzed lines. This is done by dividing the total sum of columns across all lines by the number of lines.
The delimiter that produces the highest average number of columns is considered the most likely delimiter and is selected as the detected delimiter.
In summary, this function helps automatically determine the correct delimiter for a given CSV file, which is especially useful when working with CSV files that may have been generated in different regional settings or languages, ensuring that the import process is accurate without requiring manual intervention to specify the delimiter.
but for use semicolon you need edit this file docroot/core/lib/Drupal/Core/Database/Connection.php
and change the variable 'allow_delimiter_in_query' => from false to true
and the front-end and database are correctly estructured.
fixes [org/repo/issue#]
- [ ] Test coverage exists
- [ ] Documentation exists
QA Steps
for test you can use this harvest process and data.json only have 3 datasets.
-
drush dkan:harvest:register '{ "identifier": "50_datasets", "extract": { "type": "\Harvest\ETL\Extract\DataJson", "uri": "https://raw.githubusercontent.com/tridoxx/urlsdatosabiertos/main/medatapequeno.json" }, "transforms": [], "load": { "type": "\Drupal\harvest\Load\Dataset" } }'
-
drush dkan:harvest:run 50_datasets
-
drush queue:run datastore_import
- [ x] Add manual QA steps in checklist format for a reviewer to perform to confirm that the feature or fix is working. Include as much details as possible so that the reviewer doesn't lose time figuring out how to perform steps.
@tridoxx interesting approach, and I think this has a lot of potential. Are you sure the highest number of columns is the best measure of which is the best delimiter? It seems like there are a lot of cases where this would not be true. Imagine a tab-separated file with only three columns, but one of them was a long text field where there were often several commas? I would recommend also checking to ensure that the number of columns per row is identical; if not, we have clearly not correctly identified the delimiter.
Also, to merge this it would need to meet Drupal coding standards and contain tests for the new methods. Thanks!
@tridoxx going to close this as we've not heard back from you in a while on this, feel free to re-open or submit a new PR.