FastAI.jl
FastAI.jl copied to clipboard
Textmodel integration
Can do the following:
lm = FastText.LanguageModel(true)
classifier = FastText.TextClassifier(lm)
FastText.train_classifier!(classifier) # Would throw an error as I haven't fully enabled the model to work with FastAI's data container.
I think you've readded some files from TextModels.jl that we don't need, could you remove those? 🙂
Sure! Will clean up in the next commit.
- Deleted the files from TextModels.jl that are unnecessary.
- Add a utility function for padding the batches for sequence data (Very naively implemented, as a starting point though)
julia> batches = FastText.load_batchseq(data, task)
julia> batches[1][1]
92-element Vector{Vector{Int64}}:
[25000, 25000, 25000, 25000, 25000, 25000, 25000, 25000]
[633779, 633779, 633779, 633779, 633779, 633779, 633779, 633779]
[2731, 34, 315, 354, 2087, 2209, 70, 1307]
[44047, 435, 633779, 633779, 6589, 633779, 633779, 205]
⋮
[0, 0, 0, 0, 0, 213, 0, 0]
[0, 0, 0, 0, 0, 25, 0, 0]
[0, 0, 0, 0, 0, 1778, 0, 0]
julia> batches[1][2]
8-element Vector{Int64}:
1
1
1
1
1
0
1
1
- fastai's way of encoding data doesn't include the removal of stop words (and Jeremy recommends it). So, I removed the stop word removal step.
- I've added a
vocab_sizekeyword argument toTextClassificationSingle. - Added
<unk>,<pad>to the vocabulary.
Next:
- A batch loader for textdata with padding length = max(sentence length in a batch).
Should the vocab CSV files be checked in? I would've assumed they would be artifacts or DataDeps as well.
julia> data, blocks = load(datarecipes()["imdb"])
((mapobs(loadfile, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64})), mapobs(parentname, ObsView(::MLDatasets.FileDataset{typeof(identity), String}, ::Vector{Int64}))), (Paragraph(), Label{String}(["neg", "pos"])))
julia> task = TextClassificationSingle(blocks, data)
SupervisedTask(Paragraph -> Label{String})
julia> model = FastAI.taskmodel(task, FastText.LanguageModel(false, task))
#90 (generic function with 1 method)
julia> batches = FastText.load_batchseq(data, task)
6250-element Vector{Tuple{Vector{Vector{Int64}}, WARNING: both Losses and NNlib export "ctc_loss"; uses of it in module Flux must be qualified
Flux.OneHotArray{UInt32, 2, 1, 2, Vector{UInt32}}}}:
([[35, 35, 35, 35], [3, 3, 3, 9], [40, 18025, 15, 14], [224, 10, 3541, 3040], [737, 34, 24, 505], [49, 7, 809, 3], [4, 4, 221, 3836], [1927, 104, 4,
3], [7, 16, 629, 28440], [6, 351, 7, 17] … [2, 2, 2, 44], [2, 2, 2, 3], [2, 2, 2, 9839], [2, 2, 2, 17], [2, 2, 2, 1041], [2, 2, 2, 27], [2, 2, 2, 3], [2, 2, 2, 3836], [2, 2, 2, 3], [2, 2, 2, 28440]], [0 0 1 1; 1 1 0 0])
julia> using FluxTraining
julia> td, vd = splitobs(batches, at=0.9)
julia> using Flux
julia> learner = Learner(model, Flux.Losses.logitcrossentropy, callbacks=[Metrics(accuracy)]; data=(td, vd))
Learner()
julia> fit!(learner, 1)
Epoch 1 TrainingPhase(): 0%|█ | ETA: 4 days, 3:35:31
The changes have been merged to #258.