datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Results 69 datatrove issues
Sort by recently updated
recently updated
newest added

Hi, Let's say, I have a slurm cluster that contains 100 nodes, each node has 100 cores. Assuming I have 10000 tasks. This is my current code: ``` dist_executor =...

If the passed language is a string, it should be turned into a list like this `[languages]` rather than `list(languages)`. The latter will turn a string into a list of...

I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192...

In the Gopher filter, there's this filter ``` # that 80 % of words in a document contain at least one alphabetic character if ( self.max_non_alpha_words_ratio and sum([any((c.isalpha() for c...

Dutch a relatively hard language to tokenize, especially when it comes to the possessive. From my testing in the past, I do prefer spaCy's tokenization though, as it does not...

### Summary Adds a `job_id_position` parameter to the `launch_slurm_job` method, allowing users to specify the position of the job ID in the `sbatch` command output. Defaults to `-1` if not...

I was looking at dolma, and they have a nice filter to filter out CreativeCommons data only. It might be worthwhile to add something similar to datatrove, too. https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/taggers/licenses.py#L19