datatrove issues

Multi-node parallelism on slurm clusters

6

Hi, Let's say, I have a slurm cluster that contains 100 nodes, each node has 100 cores. Assuming I have 10000 tasks. This is my current code: ``` dist_executor =...

shizhediao

Fix languages listify bug

1

If the passed language is a string, it should be turned into a list like this `[languages]` rather than `list(languages)`. The latter will turn a string into a list of...

BramVanroy

Add several open-source text extraction libraries

garrethlee

SLURM cannot achieve cross-node parallelism

I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192...

ShayDuane

Naming Gopher's "max_non_alpha_words_ratio"

In the Gopher filter, there's this filter ``` # that 80 % of words in a document contain at least one alphabetic character if ( self.max_non_alpha_words_ratio and sum([any((c.isalpha() for c...

BramVanroy

Use spaCy tokenizer for Dutch

3

Dutch a relatively hard language to tokenize, especially when it comes to the possessive. From my testing in the past, I do prefer spaCy's tokenization though, as it does not...

BramVanroy

Add `job_id_position` Parameter to `launch_slurm_job` Method

### Summary Adds a `job_id_position` parameter to the `launch_slurm_job` method, allowing users to specify the position of the job ID in the `sbatch` command output. Defaults to `-1` if not...

StephenRebel

I was looking at dolma, and they have a nice filter to filter out CreativeCommons data only. It might be worthwhile to add something similar to datatrove, too. https://github.com/allenai/dolma/blob/64886d9db15bd99acea9e28740ae20a510875dfb/python/dolma/taggers/licenses.py#L19

BramVanroy

Video support for datatrove

WIP

guipenedo

datatrove
datatrove copied to clipboard

Metadata

Multi-node parallelism on slurm clusters

Fix languages listify bug

Add several open-source text extraction libraries

SLURM cannot achieve cross-node parallelism

Naming Gopher's "max_non_alpha_words_ratio"

Use spaCy tokenizer for Dutch

Add `job_id_position` Parameter to `launch_slurm_job` Method

[FEATURE] CC Filter

Video support for datatrove

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard