Text preprocessing for language models and classification models

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline
In [2]:
from lgm import *
In [3]:
# torch.cuda.get_device_name(torch.cuda.current_device())
In [4]:
path = untar_data(URLs.IMDB)
In [5]:
# path.ls()

We define a subclass of ItemList, called TextList, that will read the texts in the corresponding filenames.

  • ItemList is basically just a container that has length and can be indexed / sliced into.
  • TextList just adds a method from_files that recursively lists files with the .txt extension in a path and reads their contents

Just in case there are some text log files, we restrict the ones we take to the training, test, and unsupervised folders.

In [6]:
textlist = TextList.from_files(path, include=['train', 'test', 'unsup'])

We should expect a total of 100,000 texts.

In [7]:
len(textlist.items)
Out[7]:
100000

Note the textlist contains just the paths to the files. The files are read "just in time" when we index/slice into the list.

In [8]:
# textlist

Here are the first 3 texts:

In [9]:
textlist[:3]
Out[9]:
["I think the cards were stacked against Webmaster, because right from the start there was this itchy feeling, like something was wrong but I couldn't quite put my finger on it. Then it hit me. Dubbed. For a little while, they managed most of the lines either as voice over or off screen, with just a little hint here and there, until it became painfully obvious. This is the kind of dubbing that grates on the nerves, with nothing even remotely funny about it. I hate dubbing, but at least, however misplaced, martial arts films badly dubbed tend to have a sense of humour about it.<br /><br />What I wanted was a film about a hacker doing actual hacking and stuff like that. Maybe like a reverse side of the table of the movie Hackers (being about the person trying to keep them out instead about the people trying to get in). What I got was some poorly written, nonsensical at times murder mystery with a ton of bad chase sequences, a supposedly inept hacker who was neutered without his little ego, and a director who obviously didn't know how to handle a camera. I just wanted to reach in there, grab the camera from the guy, and shoot the dang thing myself. The editing wasn't much better. The acting? Well, I guess if the lead guy didn't have such a bad script to work from, he'd be at least watchable. The main bad guy was OK, too, but pretty much everyone else was a joke. Dubbing didn't help, but the acting was pretty bad even taking that into consideration.<br /><br />Before I get into more bad, let's perk up to a few good things. Well, one or two. Despite the rudimentary graphics, I rather enjoyed the cyberworld stuff, what little there was of it, and would have much rather watched a movie mostly about, in, and around that than the tepid surroundings outside in the 'real world'. The falsifying thumbprint thing seemed kind of cool, but ended up being rather useless in the scheme of things. The heart gadget, which reminded me of Guillermo Del Toro's Cronos, was interesting, though as a plot device ripped directly from the pages of Escape from New York, it was horribly conceived in the long run. There's one point where the bad guy is unconscious and our hero is right there. Why didn't he try the bad guy's thumb print then (since the heart device was thumb activated)? Nonetheless, some interesting gadgets and cyber stuff, if only they could have been utilized better.<br /><br />Now, who here dares compare this movie to Blade Runner? Both films take place sometime in the future and there's some kind of off kilter type of romance in it, sort of, in both films. There's the investigation of a murder, but I've seen many a film with a murder at its center that are nothing like either Blade Runner or Webmaster. Identity, for instance. That's about where the similarities end. Period. There is no comparison, just as there wouldn't be between Blade Runner and Hackers. Ridley Scott is a brilliant director with a great mind for art direction and knows some fundamentals of film-making, like where to put the camera. Webmaster is cheap, and not just because of the budget. It's cheap because of bad writing, and because it more often than not takes the easy way out (like writing in a character who barely appeared in the first place to save the girl in the end to get her to Point B by herself without the Hero, so that he could do his thing. Very convenient. Or, the attempted sympathy factor for a character we have no reason to care for. Or, inane things like the car set up to stall them just enough for someone to get away.).<br /><br />It tries to be hip, it tries to be exploitive, it even tries a twist ending that's not the least bit surprising, and it tries to be thrilling. But a bunch of near identical chase sequences, bad writing, editing, horribly shot, bad acting, etc. does not a thrilling movie make.",
 'the boys were the most appealing things in the entire movie. the girls were lame and pathetic, i mean, how can they own their own clothing line, dolls, movies, producing studios, and not smell this bomb from far away? in order to gain some sort of responsibility, which i dont really see the sense in the punishment..., they are sent to paris, far far far away from home to live with the so-called strict grandfather who holds an important standing with paris. i cant really remember what he was, so who really cares? the detail doesnt help, the girls are sent to paris to learn something.. so what exactly do they learn when they meet two french boys and are able to manipulate the guy that supposed to watch them so they can meet these guys on scooters? the typical pre-teen movie, having all pre-teens wishing to misbehave and be able to afford the trip to paris or some far away country away from parents? i dont really like the olsens anyways, they never could really shake off the image of michelle, on full house... in case you didnt see that, then you were lucky from the start. (F F-)',
 "I would like to comment on how the girls are chosen. why is that their are always more white women chosen then their are black women. every episode their is always more white women then black one's. as if to say white women are better looking then black women. I would like for once see more black women then white. and it not just your show it's like that in a lot of shows always more white's. but i would have thought since you as the head honcho of the show you would see this yourself and have more black women on your show. but you are just like the rest trying to act like you are so fair and nice. you are just a big fony hypocrite."]

We split the data into a train set (90%) and a validation set (10%).

In [10]:
splitdata = SplitData.split_by_func(textlist, partial(random_splitter, proportion_valid=0.1))
splitdata
Out[10]:
SplitData
Train: TextList (90012 items)
[/home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/4225_2.txt, /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/9483_2.txt, /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/10083_2.txt, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
Valid: TextList (9988 items)
[/home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/6613_2.txt, /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/9339_3.txt, /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb/test/neg/7784_1.txt, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb

Tokenizing

We need to tokenize the dataset first, which is splitting a sentence in individual tokens. Those tokens are the basic words or punctuation signs with some morphemic analysis -- "don't" is split into "do" and "n't" etc. We will use a processor for this, in conjunction with the spacy library.

Before tokenization, we clean up and process the texts in various ways:

  • we remove HTML markup
  • we identify character repetitions, replace them with only one occurence of the character and the number of repetitions
  • we do the same for word repetitions.

After tokenization, we post-process the tokens a little more:

  • we lower-case all-caps tokens and add a special token to mark that change
  • we lower-case capitalization and mark that change,
  • we add beginning-of-string BOS tokens; we will concatenate all texts and stream the result, so BOS tokens will mark the original text boundaries for the neural network model

Since tokenizing and applying the rules before and after tokenization takes a bit of time, we'll parallelize all this using ProcessPoolExecutor to go faster.

In [11]:
proc_tok = TokenizeProcessor(max_workers=4)

Numericalizing

Once we have tokenized our texts, we replace each token by an individual number, i.e., we numericalize the texts. Again, we do this with a processor.

When we do language modeling, we will infer the labels from the text during training, so there's no need to label. The training loop expects labels however, so we need to add dummy ones.

In [12]:
proc_num = NumericalizeProcessor()

Actually processing the text

The label_by_func function from the lgm library takes a splitdata object, i.e., a dataset split into train and validation, and:

  • labels each of them by a function -- a dummy function in this case, which labels everything with $0$;
  • processes each of them with the provided processors; in this case, we tokenize and numericalize the indepedent variable -- hence processor_x; processor_y would list processors for the dependent variable
  • reassembles the labeled and processed train and validation sets into a new SplitData object
In [13]:
# %time labeled_list = label_by_func(splitdata, lambda x: 0, processor_x = [proc_tok, proc_num])
100.00% [46/46 00:40<00:00]
100.00% [5/5 00:07<00:00]
CPU times: user 13.8 s, sys: 1.98 s, total: 15.8 s
Wall time: 57.2 s

Once the texts have been processed, they become lists of numbers, but we can still access the underlying raw data in x_obj (or y_obj for the targets, but we don't have any here).

In [14]:
labeled_list.train
Out[14]:
LabeledData
x: TextList (90012 items)
[[2, 18, 122, 8, 4051, 86, 17559, 467, 7, 48211, 10, 106, 229, 51, 8, 392, 54, 25, 19, 19652, 573, 10, 53, 158, 25, 379, 30, 18, 95, 35, 205, 301, 77, 3685, 34, 16, 9, 7, 115, 16, 614, 89, 9, 7, 2324, 9, 7, 28, 12, 139, 153, 10, 46, 1336, 110, 13, 8, 431, 372, 26, 584, 143, 55, 141, 278, 10, 27, 57, 12, 139, 2882, 148, 11, 54, 10, 383, 16, 900, 2290, 606, 9, 7, 19, 15, 8, 266, 13, 3507, 20, 17094, 34, 8, 5714, 10, 27, 178, 76, 2604, 171, 59, 16, 9, 18, 736, 3507, 10, 30, 45, 243, 10, 216, 8588, 10, 1618, 1653, 124, 953, 2324, 2440, 14, 41, 12, 291, 13, 1175, 59, 16, 9, 24, 7, 64, 18, 488, 25, 12, 31, 59, 12, 11985, 423, 868, 12073, 11, 539, 53, 20, 9, 7, 298, 53, 12, 7404, 510, 13, 8, 2337, 13, 8, 29, 7, 14778, 36, 129, 59, 8, 422, 282, 14, 409, 111, 61, 319, 59, 8, 100, 282, 14, 99, 17, 33, 9, 7, 64, 18, 209, 25, 65, 859, 433, 10, 5147, 45, 234, 590, 778, 27, 12, 6604, 13, 97, 1245, 834, 10, 12, 1500, 2792, 11985, 49, 25, 26211, 230, 40, 139, 3854, 10, 11, 12, 170, 49, 551, 87, 35, 140, 109, 14, 2816, 12, 381, 9, 18, 57, 488, 14, 2055, 17, 54, 10, 3906, 8, 381, 51, 8, 231, 10, 11, 1200, 8, 16662, 169, 564, 9, 7, 8, 823, 25, 35, 94, 146, 9, 7, 8, 137, 66, 7, 88, 10, 18, 497, 63, 8, 469, 231, 87, 35, 41, 160, 12, 97, 247, 14, 181, 51, 10, 39, 321, 43, 45, 243, 1694, 9, 7, 8, 304, 97, 231, 25, 600, 10, 117, 10, 30, 207, 94, 312, 346, 25, 12, 1004, 9, 7, 3507, 87, 35, 365, 10, 30, 8, 137, 25, 207, 97, 76, 651, 20, 104, 6091, 9, 24, 7, 182, 18, 99, 104, 69, 97, 10, 306, 22, 26212, 72, 14, 12, 191, 67, 198, 9, 7, 88, 10, 42, 55, 127, 9, 7, 481, 8, 16286, 3021, 10, 18, 270, 530, 8, 0, 539, 10, 64, 139, 54, 25, 13, 16, 10, 11, 73, 41, 94, 270, 320, 12, 29, 680, 59, 10, 17, 10, 11, 208, 20, 93, 8, 9131, 5811, 1003, 17, 8, 62, 164, 195, 62, 9, 7, 8, 42022, 0, 169, 486, 266, 13, 601, 10, 30, 1038, 72, 129, 270, 3182, 17, 8, 4193, 13, 198, 9, 7, 8, 509, 7768, 10, 80, 1594, 89, 13, 7, 19937, 7, 4611, 7, 8313, 22, 7, 52556, 10, 25, 236, 10, 173, 26, 12, 130, 2586, 3022, 2536, 51, 8, 4997, 13, 7, 1028, 51, 7, 183, 7, 799, 10, 16, 25, 2200, 4318, 17, 8, 219, 512, 9, 7, 54, 22, 42, 244, 135, 8, 97, 231, 15, 7709, 11, 279, 571, 15, 229, 54, 9, 7, 155, 87, 35, 39, 371, 8, 97, 231, 22, 7975, 2583, 115, 36, 252, 8, 509, 2586, 25, 7975, 21732, 33, 66, 7, 2712, 10, 65, 236, 10214, 11, 17789, 539, 10, 63, 81, 46, 95, 41, 98, 9359, 146, 9, 24, 7, 166, 10, 49, 148, 10414, 1749, 19, 29, 14, 7, 3669, 7, 5051, 66, 7, 218, 124, 213, 290, 5092, 17, 8, 718, 11, 54, 22, 65, 266, 13, 141, 12849, 555, 13, 884, 17, 16, 10, 460, 13, 10, 17, 218, 124, 9, 7, 54, 22, 8, 3356, 13, 12, 590, 10, 30, 18, 161, 131, 128, 12, 31, 27, 12, 590, 45, 112, 2183, 20, 38, 178, 53, 372, 7, 3669, 7, 5051, 55, 7, 48211, 9, 7, 1968, 10, 28, 1978, 9, 7, 20, 22, 59, 135, 8, 4266, 149, 9, 7, 829, 9, 7, 54, 15, 74, 1872, 10, 57, 26, 54, 73, 35, 43, 223, 7, 3669, 7, 5051, 11, 7, 14778, 9, 7, 10839, 7, 1090, 15, 12, 545, 170, 27, 12, 101, 350, 28, 552, 468, 11, 698, 65, 34599, 13, 31, 23, 255, 10, 53, 135, 14, 301, 8, 381, 9, 7, 48211, 15, 699, 10, 11, 37, 57, 106, 13, 8, 354, 9, 7, 16, 22, 699, 106, 13, 97, 519, 10, 11, 106, 16, 69, 424, 93, 37, 325, 8, 751, 116, 61, 36, 53, 519, 17, 12, 123, 49, 1176, 1633, 17, 8, 107, 290, 14, 597, 8, 258, 17, 8, 149, 14, 99, 56, 14, 7, 244, 628, 47, 822, 230, 8, 7, 571, 10, 52, 20, 39, 95, 58, 40, 169, 9, 7, 70, 6651, 9, 7, 55, 10, 8, 3195, 2420, 2255, 28, 12, 123, 90, 41, 74, 313, 14, 477, 28, 9, 7, 55, 10, 4250, 198, 53, 8, 537, 292, 72, 14, 16101, 111, 57, 215, 28, 305, 14, 99, 262, 9, 33, 9, 24, 7, 16, 517, 14, 43, 2839, 10, 16, 517, 14, 43, 13968, 10, 16, 76, 517, 12, 981, 294, 20, 22, 37, 8, 243, 246, 1650, 10, 11, 16, 517, 14, 43, 3321, 9, 7, 30, 12, 757, 13, 744, 5622, 1245, 834, 10, 97, 519, 10, 823, 10, 2200, 342, 10, 97, 137, 10, 532, 9, 91, 37, 12, 3321, 29, 114, 9], [2, 8, 890, 86, 8, 110, 2560, 198, 17, 8, 464, 29, 9, 8, 547, 86, 877, 11, 1165, 10, 18, 403, 10, 109, 78, 46, 222, 79, 222, 4178, 369, 10, 4367, 10, 118, 10, 3728, 2545, 10, 11, 37, 6299, 19, 1898, 51, 248, 262, 66, 17, 646, 14, 3183, 65, 460, 13, 4368, 10, 80, 18, 58, 1142, 83, 84, 8, 291, 17, 8, 4319, 92, 10, 46, 38, 1444, 14, 1492, 10, 248, 248, 248, 262, 51, 363, 14, 441, 27, 8, 52, 23, 462, 6919, 3240, 49, 1761, 48, 643, 2011, 27, 1492, 9, 18, 197, 1142, 83, 419, 64, 39, 25, 10, 52, 49, 83, 2076, 66, 8, 1566, 91, 1142, 365, 10, 8, 547, 38, 1444, 14, 1492, 14, 865, 158, 396, 52, 64, 632, 58, 46, 865, 68, 46, 920, 127, 714, 890, 11, 38, 496, 14, 8507, 8, 231, 20, 459, 14, 126, 111, 52, 46, 78, 920, 151, 456, 34, 58515, 66, 8, 758, 1832, 23, 1250, 29, 10, 283, 44, 1832, 23, 2113, 4817, 14, 52557, 11, 43, 496, 14, 4227, 8, 1292, 14, 1492, 55, 65, 248, 262, 661, 262, 51, 728, 66, 18, 58, 1142, 83, 53, 8, 52558, 3900, 10, 46, 133, 95, 83, 4464, 141, 8, 1508, 13, 2938, 10, 34, 389, 349, 92, 17, 440, 32, 87, 1142, 84, 20, 10, 115, 32, 86, 1880, 51, 8, 392, 9, 36, 2261, 28345, 33], [2, 18, 73, 53, 14, 949, 34, 109, 8, 547, 38, 2366, 9, 155, 15, 20, 79, 38, 235, 69, 482, 366, 2366, 115, 79, 38, 333, 366, 9, 190, 437, 79, 15, 235, 69, 482, 366, 115, 333, 42, 22, 9, 26, 63, 14, 157, 482, 366, 38, 146, 280, 115, 333, 366, 9, 18, 73, 53, 28, 302, 84, 69, 333, 366, 115, 482, 9, 11, 16, 37, 57, 147, 142, 16, 22, 53, 20, 17, 12, 188, 13, 289, 235, 69, 482, 22, 9, 30, 18, 73, 41, 217, 252, 32, 26, 8, 432, 24478, 13, 8, 142, 32, 73, 84, 19, 674, 11, 41, 69, 333, 366, 34, 147, 142, 9, 30, 32, 38, 57, 53, 8, 399, 282, 14, 506, 53, 32, 38, 52, 1252, 11, 360, 9, 32, 38, 57, 12, 220, 0, 15770, 9], ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
y: ItemList (90012 items)
[0, 0, 0, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
In [15]:
labeled_list.train.x_obj(0)
Out[15]:
"_BOS_ i think the cards were stacked against _CAP_ webmaster , because right from the start there was this itchy feeling , like something was wrong but i could n't quite put my finger on it . _CAP_ then it hit me . _CAP_ dubbed . _CAP_ for a little while , they managed most of the lines either as voice over or off screen , with just a little hint here and there , until it became painfully obvious . _CAP_ this is the kind of dubbing that grates on the nerves , with nothing even remotely funny about it . i hate dubbing , but at least , however misplaced , martial arts films badly dubbed tend to have a sense of humour about it . \n\n _CAP_ what i wanted was a film about a hacker doing actual hacking and stuff like that . _CAP_ maybe like a reverse side of the table of the movie _CAP_ hackers ( being about the person trying to keep them out instead about the people trying to get in ) . _CAP_ what i got was some poorly written , nonsensical at times murder mystery with a ton of bad chase sequences , a supposedly inept hacker who was neutered without his little ego , and a director who obviously did n't know how to handle a camera . i just wanted to reach in there , grab the camera from the guy , and shoot the dang thing myself . _CAP_ the editing was n't much better . _CAP_ the acting ? _CAP_ well , i guess if the lead guy did n't have such a bad script to work from , he 'd be at least watchable . _CAP_ the main bad guy was ok , too , but pretty much everyone else was a joke . _CAP_ dubbing did n't help , but the acting was pretty bad even taking that into consideration . \n\n _CAP_ before i get into more bad , let 's perk up to a few good things . _CAP_ well , one or two . _CAP_ despite the rudimentary graphics , i rather enjoyed the _UNK_ stuff , what little there was of it , and would have much rather watched a movie mostly about , in , and around that than the tepid surroundings outside in the ' real world ' . _CAP_ the falsifying _UNK_ thing seemed kind of cool , but ended up being rather useless in the scheme of things . _CAP_ the heart gadget , which reminded me of _CAP_ guillermo _CAP_ del _CAP_ toro 's _CAP_ cronos , was interesting , though as a plot device ripped directly from the pages of _CAP_ escape from _CAP_ new _CAP_ york , it was horribly conceived in the long run . _CAP_ there 's one point where the bad guy is unconscious and our hero is right there . _CAP_ why did n't he try the bad guy 's thumb print then ( since the heart device was thumb activated ) ? _CAP_ nonetheless , some interesting gadgets and cyber stuff , if only they could have been utilized better . \n\n _CAP_ now , who here dares compare this movie to _CAP_ blade _CAP_ runner ? _CAP_ both films take place sometime in the future and there 's some kind of off kilter type of romance in it , sort of , in both films . _CAP_ there 's the investigation of a murder , but i 've seen many a film with a murder at its center that are nothing like either _CAP_ blade _CAP_ runner or _CAP_ webmaster . _CAP_ identity , for instance . _CAP_ that 's about where the similarities end . _CAP_ period . _CAP_ there is no comparison , just as there would n't be between _CAP_ blade _CAP_ runner and _CAP_ hackers . _CAP_ ridley _CAP_ scott is a brilliant director with a great mind for art direction and knows some fundamentals of film - making , like where to put the camera . _CAP_ webmaster is cheap , and not just because of the budget . _CAP_ it 's cheap because of bad writing , and because it more often than not takes the easy way out ( like writing in a character who barely appeared in the first place to save the girl in the end to get her to _CAP_ point b by herself without the _CAP_ hero , so that he could do his thing . _CAP_ very convenient . _CAP_ or , the attempted sympathy factor for a character we have no reason to care for . _CAP_ or , inane things like the car set up to stall them just enough for someone to get away . ) . \n\n _CAP_ it tries to be hip , it tries to be exploitive , it even tries a twist ending that 's not the least bit surprising , and it tries to be thrilling . _CAP_ but a bunch of near identical chase sequences , bad writing , editing , horribly shot , bad acting , etc . does not a thrilling movie make ."
In [16]:
labeled_list.train.x_obj(47)
Out[16]:
"_BOS_ _CAP_ _UNK_ _CAP_ ivory _CAP_ wayans was so funny in _CAP_ low _CAP_ down _CAP_ dirty _CAP_ shame that i had to see this one and it was one of the worst he has done and _CAP_ steven _CAP_ seagal did n't help much . _CAP_ it starts off with some odd religious killings that do n't make much sense to _CAP_ jim _CAP_ campbell ( _CAP_ keenan ) . _CAP_ he is surprised to see a new partner waiting for him to work by his side to crack the case but _CAP_ jack _CAP_ cole does n't seem to be who everyone thinks he is until _CAP_ jack 's ex wife is killed in one of those ritual killings that end up making him the suspect as well . _CAP_ it 's the same thing as all of his other movies : _CAP_ smoke past , cia involvement and now trying to be a normal cop . _CAP_ why does _CAP_ steven dress up like he is from a _CAP_ western movie ? _CAP_ and the prayer _UNK_ on top of that make things a little confusing ."
In [17]:
labeled_list.valid
Out[17]:
LabeledData
x: TextList (9988 items)
[[2, 7, 19, 29, 261, 486, 53, 12, 101, 338, 17, 1832, 23, 391, 9, 21, 7, 306, 22, 114, 12, 29, 59, 42, 13, 8, 857, 11, 110, 3266, 8293, 19142, 13, 8, 702, 962, 50, 7, 11, 306, 22, 196, 7, 1928, 7, 0, 26, 7, 3023, 7, 3248, 7, 3740, 50, 21, 7, 20, 22, 135, 19, 29, 436, 1999, 379, 9, 7, 155, 196, 48, 303, 49, 3225, 74, 8167, 13, 8, 145, 39, 22, 2382, 66, 7, 11, 115, 10, 155, 306, 19, 303, 494, 40, 123, 104, 37, 7, 3023, 7, 3740, 10, 30, 7, 1928, 7, 0, 17, 12, 811, 16369, 66, 7, 26, 18, 1846, 168, 19, 29, 34, 27852, 10, 18, 87, 35, 186, 564, 3759, 19, 145, 25, 177, 7, 3023, 7, 3740, 9, 7, 39, 87, 35, 185, 53, 108, 10, 706, 53, 108, 10, 506, 53, 108, 10, 55, 76, 1242, 53, 108, 9, 18, 95, 37, 99, 521, 19, 212, 10, 11, 40424, 10, 18, 95, 37, 390, 8, 29, 9, 7, 68, 7, 863, 7, 5222, 11, 7, 613, 7, 7457, 86, 196, 26, 8, 11539, 7, 9444, 7, 7146, 11, 8, 7, 6711, 7, 542, 10, 90, 87, 35, 477, 63, 46, 86, 1844, 1313, 4145, 13, 79, 316, 121, 106, 110, 13, 200, 85, 133, 76, 575, 13, 151, 351, 383, 90, 237, 8, 29, 9, 7, 30, 10, 27, 305, 26, 4775, 17, 511, 22, 1964, 26, 7, 3023, 7, 3740, 10, 32, 41, 14, 58, 146, 9, 7, 68, 7, 1901, 7, 3996, 25, 196, 26, 7, 9670, 10, 16, 25, 8, 187, 886, 9, 7, 30, 10, 7, 1901, 7, 3996, 113, 200, 284, 39, 17, 212, 25, 7, 9670, 9, 7, 0, 87, 35, 76, 371, 9, 7, 64, 254, 41, 98, 12, 101, 29, 25, 664, 7747, 47, 19, 9, 7, 1397, 8, 212, 20, 19, 29, 797, 14, 58, 28, 8, 675, 23, 3098, 672, 64, 21, 7, 1908, 7, 1963, 7, 1790, 21, 87, 28, 18364, 10, 16, 9680, 9, 7, 117, 139, 13, 8, 164, 7, 3023, 7, 3740, 11, 117, 94, 5276, 28, 8, 2184, 13, 1451, 1076, 664, 19, 29, 104, 12, 21, 946, 14, 494, 141, 332, 23, 116, 163, 8, 3661, 9, 21, 7, 8, 81, 67, 169, 59, 19, 15, 20, 16, 25, 727, 29, 9, 18, 87, 35, 41, 14, 479, 77, 268, 23, 4123, 308, 34, 19, 452, 13, 1223, 9], [2, 92, 63, 81, 7, 1010, 85, 2531, 262, 51, 16, 9, 7, 84, 10, 18, 122, 20, 19, 29, 60, 65, 1046, 9, 7, 88, 10, 8, 304, 123, 22, 886, 91, 10, 45, 243, 9, 7, 213, 61, 8, 241, 7, 4745, 7, 15253, 169, 10, 11, 32, 161, 209, 8, 11164, 13, 12, 549, 29, 50, 7, 13, 286, 10, 32, 103, 1609, 69, 93, 332, 13, 8, 31, 10, 30, 10, 455, 88, 9, 7, 37, 20, 94, 13, 12, 1873, 9, 24, 7, 52, 10, 148, 16, 288, 96, 32, 213, 12, 758, 10, 20653, 10, 5537, 1569, 258, 36, 7, 10833, 7, 30030, 10, 49, 22, 177, 12, 549, 556, 33, 656, 138, 374, 29692, 12, 188, 10, 680, 143, 12, 21, 1551, 7688, 21, 36, 18, 167, 5273, 16, 22, 12, 1491, 3150, 51, 8, 29, 132, 29, 22, 53, 19, 242, 235, 4223, 20, 889, 4680, 33, 775, 7, 4745, 7, 15253, 9, 7, 572, 10, 13, 286, 10, 0, 507, 69, 61, 13, 136, 9, 7, 2356, 7, 7510, 7, 650, 22, 123, 10, 12, 3135, 2122, 49, 15, 57, 64, 0, 793, 36, 1214, 10, 18, 58, 35, 477, 64, 8, 123, 22, 164, 413, 15, 10, 18, 53, 8, 27036, 146, 33, 9, 7, 8, 127, 183, 374, 159, 14, 84, 7, 4745, 7, 15253, 36, 42, 14, 15634, 10, 42, 14, 114, 267, 13, 8, 0, 33, 10, 11, 46, 238, 61, 13, 16, 27, 40, 2729, 1654, 9, 7, 15446, 8098, 10, 11, 300, 528, 61, 2841, 17, 8, 149, 9, 24, 7, 63, 81, 7, 1010, 10, 55, 120, 686, 31, 1177, 28, 20, 522, 10, 87, 35, 41, 160, 12, 375, 681, 13, 745, 14, 3009, 344, 5305, 9, 7, 55, 298, 63, 745, 14, 3009, 344, 5305, 847, 12, 139, 69, 61, 13, 8, 118, 8083, 45, 111, 9, 7, 16, 22, 5818, 23, 11893, 625, 53, 19, 20, 114, 89, 69, 93, 12, 139, 7361, 49, 14, 43, 69, 5371, 27, 36, 12, 139, 31, 462, 62, 7, 324, 7, 393, 7, 750, 62, 293, 14, 350, 92, 33], [2, 12, 444, 13, 1823, 302, 1580, 19, 10, 554, 252, 7, 815, 7, 3041, 2733, 17, 16, 10, 16, 95, 35, 43, 97, 9, 379, 50, 7, 16, 22, 97, 26, 256, 78, 43, 9, 7, 54, 15, 52, 94, 14, 454, 45, 11, 16, 22, 37, 8, 642, 9, 7, 28, 1978, 10, 17, 42, 150, 7, 0, 2326, 104, 8, 2446, 11, 68, 39, 293, 61, 39, 15, 1084, 1630, 12, 352, 287, 5263, 50, 7, 135, 25, 21, 8, 170, 22, 21, 1179, 68, 39, 342, 20, 150, 66, 50, 66, 7, 261, 8, 187, 290, 7, 3041, 22, 25, 68, 39, 4000, 14, 58, 19, 505, 9, 7, 16, 22, 397, 12, 902, 42, 60, 14, 84, 160, 12, 503, 303, 159, 994, 17, 19, 358, 1412, 28, 12, 31, 9, 24, 7, 1271, 22, 259, 17, 19, 29, 78, 43, 462, 137, 9, 7, 696, 7, 14188, 15, 813, 17, 57, 14, 723, 72, 8, 340, 1585, 148, 10, 30, 44, 39, 91, 15, 280, 15674, 17, 12, 70, 480, 11, 5304, 342, 150, 17, 65, 266, 13, 1522, 9, 7, 37, 14, 761, 8, 21, 7, 3039, 21, 10, 8, 258, 25, 52, 16236, 18, 488, 14, 1200, 56, 57, 14, 149, 56, 4778, 11, 1823, 26, 88, 9, 7, 63, 18, 95, 221, 19, 12, 2751, 10, 18, 73, 9, 7, 902, 20, 973, 91, 35, 2037, 148, 9, 7, 19, 397, 1870, 13, 12, 97, 1004, 55, 48, 2501, 908, 113, 57, 28, 267, 9, 7, 19, 154, 2711, 26, 12, 3143, 13, 109, 97, 628, 23, 118, 78, 99, 9], ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
y: ItemList (9988 items)
[0, 0, 0, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
In [18]:
labeled_list.valid.x_obj(5)
Out[18]:
"_BOS_ _CAP_ congo is another multi - _UNK_ dollar adaptation of _CAP_ crichton 's works . _CAP_ like _CAP_ jurassic _CAP_ park , _CAP_ the _CAP_ lost _CAP_ world , _CAP_ sphere , etc , the film raped the book of its true meaning and essence . i 'll make this short and to the point . _CAP_ the scenery is beautiful . _CAP_ the actors , well it 's the best they can do . _CAP_ the script ? _CAP_ try _UNK_ hundreds of pages into an hour and half movie . _CAP_ you get a mess in the end but how neat of a mess is what counts and _CAP_ congo falls somewhere below that . _CAP_ there were some silly moments , like why did the killer gorillas decide to jump into the lava ? _CAP_ and _CAP_ amy , raised by humans , surrounded by humans , yet can intimidate dozens of killer apes around her ? _CAP_ what sort of twist of common sense is that ? _CAP_ which brings me to this . _CAP_ if there was an annoying character in every movie , _CAP_ amy ranks of one here . _CAP_ you see _CAP_ amy is this naive little female ape who can talk with a special backpack and harness strapped to her . _CAP_ neat idea , but it gets annoying after awhile hearing her talk . _CAP_ congo is worthwhile to see , and not deplorable , but certainly not a memorable film either ."

We can also convert numericalized text back by using the deprocess method associated with the numericalization processor proc_num.

In [19]:
print(proc_num.deprocess(labeled_list.train[0][0]))
['_BOS_', 'i', 'think', 'the', 'cards', 'were', 'stacked', 'against', '_CAP_', 'webmaster', ',', 'because', 'right', 'from', 'the', 'start', 'there', 'was', 'this', 'itchy', 'feeling', ',', 'like', 'something', 'was', 'wrong', 'but', 'i', 'could', "n't", 'quite', 'put', 'my', 'finger', 'on', 'it', '.', '_CAP_', 'then', 'it', 'hit', 'me', '.', '_CAP_', 'dubbed', '.', '_CAP_', 'for', 'a', 'little', 'while', ',', 'they', 'managed', 'most', 'of', 'the', 'lines', 'either', 'as', 'voice', 'over', 'or', 'off', 'screen', ',', 'with', 'just', 'a', 'little', 'hint', 'here', 'and', 'there', ',', 'until', 'it', 'became', 'painfully', 'obvious', '.', '_CAP_', 'this', 'is', 'the', 'kind', 'of', 'dubbing', 'that', 'grates', 'on', 'the', 'nerves', ',', 'with', 'nothing', 'even', 'remotely', 'funny', 'about', 'it', '.', 'i', 'hate', 'dubbing', ',', 'but', 'at', 'least', ',', 'however', 'misplaced', ',', 'martial', 'arts', 'films', 'badly', 'dubbed', 'tend', 'to', 'have', 'a', 'sense', 'of', 'humour', 'about', 'it', '.', '\n\n', '_CAP_', 'what', 'i', 'wanted', 'was', 'a', 'film', 'about', 'a', 'hacker', 'doing', 'actual', 'hacking', 'and', 'stuff', 'like', 'that', '.', '_CAP_', 'maybe', 'like', 'a', 'reverse', 'side', 'of', 'the', 'table', 'of', 'the', 'movie', '_CAP_', 'hackers', '(', 'being', 'about', 'the', 'person', 'trying', 'to', 'keep', 'them', 'out', 'instead', 'about', 'the', 'people', 'trying', 'to', 'get', 'in', ')', '.', '_CAP_', 'what', 'i', 'got', 'was', 'some', 'poorly', 'written', ',', 'nonsensical', 'at', 'times', 'murder', 'mystery', 'with', 'a', 'ton', 'of', 'bad', 'chase', 'sequences', ',', 'a', 'supposedly', 'inept', 'hacker', 'who', 'was', 'neutered', 'without', 'his', 'little', 'ego', ',', 'and', 'a', 'director', 'who', 'obviously', 'did', "n't", 'know', 'how', 'to', 'handle', 'a', 'camera', '.', 'i', 'just', 'wanted', 'to', 'reach', 'in', 'there', ',', 'grab', 'the', 'camera', 'from', 'the', 'guy', ',', 'and', 'shoot', 'the', 'dang', 'thing', 'myself', '.', '_CAP_', 'the', 'editing', 'was', "n't", 'much', 'better', '.', '_CAP_', 'the', 'acting', '?', '_CAP_', 'well', ',', 'i', 'guess', 'if', 'the', 'lead', 'guy', 'did', "n't", 'have', 'such', 'a', 'bad', 'script', 'to', 'work', 'from', ',', 'he', "'d", 'be', 'at', 'least', 'watchable', '.', '_CAP_', 'the', 'main', 'bad', 'guy', 'was', 'ok', ',', 'too', ',', 'but', 'pretty', 'much', 'everyone', 'else', 'was', 'a', 'joke', '.', '_CAP_', 'dubbing', 'did', "n't", 'help', ',', 'but', 'the', 'acting', 'was', 'pretty', 'bad', 'even', 'taking', 'that', 'into', 'consideration', '.', '\n\n', '_CAP_', 'before', 'i', 'get', 'into', 'more', 'bad', ',', 'let', "'s", 'perk', 'up', 'to', 'a', 'few', 'good', 'things', '.', '_CAP_', 'well', ',', 'one', 'or', 'two', '.', '_CAP_', 'despite', 'the', 'rudimentary', 'graphics', ',', 'i', 'rather', 'enjoyed', 'the', '_UNK_', 'stuff', ',', 'what', 'little', 'there', 'was', 'of', 'it', ',', 'and', 'would', 'have', 'much', 'rather', 'watched', 'a', 'movie', 'mostly', 'about', ',', 'in', ',', 'and', 'around', 'that', 'than', 'the', 'tepid', 'surroundings', 'outside', 'in', 'the', "'", 'real', 'world', "'", '.', '_CAP_', 'the', 'falsifying', '_UNK_', 'thing', 'seemed', 'kind', 'of', 'cool', ',', 'but', 'ended', 'up', 'being', 'rather', 'useless', 'in', 'the', 'scheme', 'of', 'things', '.', '_CAP_', 'the', 'heart', 'gadget', ',', 'which', 'reminded', 'me', 'of', '_CAP_', 'guillermo', '_CAP_', 'del', '_CAP_', 'toro', "'s", '_CAP_', 'cronos', ',', 'was', 'interesting', ',', 'though', 'as', 'a', 'plot', 'device', 'ripped', 'directly', 'from', 'the', 'pages', 'of', '_CAP_', 'escape', 'from', '_CAP_', 'new', '_CAP_', 'york', ',', 'it', 'was', 'horribly', 'conceived', 'in', 'the', 'long', 'run', '.', '_CAP_', 'there', "'s", 'one', 'point', 'where', 'the', 'bad', 'guy', 'is', 'unconscious', 'and', 'our', 'hero', 'is', 'right', 'there', '.', '_CAP_', 'why', 'did', "n't", 'he', 'try', 'the', 'bad', 'guy', "'s", 'thumb', 'print', 'then', '(', 'since', 'the', 'heart', 'device', 'was', 'thumb', 'activated', ')', '?', '_CAP_', 'nonetheless', ',', 'some', 'interesting', 'gadgets', 'and', 'cyber', 'stuff', ',', 'if', 'only', 'they', 'could', 'have', 'been', 'utilized', 'better', '.', '\n\n', '_CAP_', 'now', ',', 'who', 'here', 'dares', 'compare', 'this', 'movie', 'to', '_CAP_', 'blade', '_CAP_', 'runner', '?', '_CAP_', 'both', 'films', 'take', 'place', 'sometime', 'in', 'the', 'future', 'and', 'there', "'s", 'some', 'kind', 'of', 'off', 'kilter', 'type', 'of', 'romance', 'in', 'it', ',', 'sort', 'of', ',', 'in', 'both', 'films', '.', '_CAP_', 'there', "'s", 'the', 'investigation', 'of', 'a', 'murder', ',', 'but', 'i', "'ve", 'seen', 'many', 'a', 'film', 'with', 'a', 'murder', 'at', 'its', 'center', 'that', 'are', 'nothing', 'like', 'either', '_CAP_', 'blade', '_CAP_', 'runner', 'or', '_CAP_', 'webmaster', '.', '_CAP_', 'identity', ',', 'for', 'instance', '.', '_CAP_', 'that', "'s", 'about', 'where', 'the', 'similarities', 'end', '.', '_CAP_', 'period', '.', '_CAP_', 'there', 'is', 'no', 'comparison', ',', 'just', 'as', 'there', 'would', "n't", 'be', 'between', '_CAP_', 'blade', '_CAP_', 'runner', 'and', '_CAP_', 'hackers', '.', '_CAP_', 'ridley', '_CAP_', 'scott', 'is', 'a', 'brilliant', 'director', 'with', 'a', 'great', 'mind', 'for', 'art', 'direction', 'and', 'knows', 'some', 'fundamentals', 'of', 'film', '-', 'making', ',', 'like', 'where', 'to', 'put', 'the', 'camera', '.', '_CAP_', 'webmaster', 'is', 'cheap', ',', 'and', 'not', 'just', 'because', 'of', 'the', 'budget', '.', '_CAP_', 'it', "'s", 'cheap', 'because', 'of', 'bad', 'writing', ',', 'and', 'because', 'it', 'more', 'often', 'than', 'not', 'takes', 'the', 'easy', 'way', 'out', '(', 'like', 'writing', 'in', 'a', 'character', 'who', 'barely', 'appeared', 'in', 'the', 'first', 'place', 'to', 'save', 'the', 'girl', 'in', 'the', 'end', 'to', 'get', 'her', 'to', '_CAP_', 'point', 'b', 'by', 'herself', 'without', 'the', '_CAP_', 'hero', ',', 'so', 'that', 'he', 'could', 'do', 'his', 'thing', '.', '_CAP_', 'very', 'convenient', '.', '_CAP_', 'or', ',', 'the', 'attempted', 'sympathy', 'factor', 'for', 'a', 'character', 'we', 'have', 'no', 'reason', 'to', 'care', 'for', '.', '_CAP_', 'or', ',', 'inane', 'things', 'like', 'the', 'car', 'set', 'up', 'to', 'stall', 'them', 'just', 'enough', 'for', 'someone', 'to', 'get', 'away', '.', ')', '.', '\n\n', '_CAP_', 'it', 'tries', 'to', 'be', 'hip', ',', 'it', 'tries', 'to', 'be', 'exploitive', ',', 'it', 'even', 'tries', 'a', 'twist', 'ending', 'that', "'s", 'not', 'the', 'least', 'bit', 'surprising', ',', 'and', 'it', 'tries', 'to', 'be', 'thrilling', '.', '_CAP_', 'but', 'a', 'bunch', 'of', 'near', 'identical', 'chase', 'sequences', ',', 'bad', 'writing', ',', 'editing', ',', 'horribly', 'shot', ',', 'bad', 'acting', ',', 'etc', '.', 'does', 'not', 'a', 'thrilling', 'movie', 'make', '.']

Compare with the original text (assuming the first text made it into train, not validation):

In [20]:
print(textlist[0])
I think the cards were stacked against Webmaster, because right from the start there was this itchy feeling, like something was wrong but I couldn't quite put my finger on it. Then it hit me. Dubbed. For a little while, they managed most of the lines either as voice over or off screen, with just a little hint here and there, until it became painfully obvious. This is the kind of dubbing that grates on the nerves, with nothing even remotely funny about it. I hate dubbing, but at least, however misplaced, martial arts films badly dubbed tend to have a sense of humour about it.<br /><br />What I wanted was a film about a hacker doing actual hacking and stuff like that. Maybe like a reverse side of the table of the movie Hackers (being about the person trying to keep them out instead about the people trying to get in). What I got was some poorly written, nonsensical at times murder mystery with a ton of bad chase sequences, a supposedly inept hacker who was neutered without his little ego, and a director who obviously didn't know how to handle a camera. I just wanted to reach in there, grab the camera from the guy, and shoot the dang thing myself. The editing wasn't much better. The acting? Well, I guess if the lead guy didn't have such a bad script to work from, he'd be at least watchable. The main bad guy was OK, too, but pretty much everyone else was a joke. Dubbing didn't help, but the acting was pretty bad even taking that into consideration.<br /><br />Before I get into more bad, let's perk up to a few good things. Well, one or two. Despite the rudimentary graphics, I rather enjoyed the cyberworld stuff, what little there was of it, and would have much rather watched a movie mostly about, in, and around that than the tepid surroundings outside in the 'real world'. The falsifying thumbprint thing seemed kind of cool, but ended up being rather useless in the scheme of things. The heart gadget, which reminded me of Guillermo Del Toro's Cronos, was interesting, though as a plot device ripped directly from the pages of Escape from New York, it was horribly conceived in the long run. There's one point where the bad guy is unconscious and our hero is right there. Why didn't he try the bad guy's thumb print then (since the heart device was thumb activated)? Nonetheless, some interesting gadgets and cyber stuff, if only they could have been utilized better.<br /><br />Now, who here dares compare this movie to Blade Runner? Both films take place sometime in the future and there's some kind of off kilter type of romance in it, sort of, in both films. There's the investigation of a murder, but I've seen many a film with a murder at its center that are nothing like either Blade Runner or Webmaster. Identity, for instance. That's about where the similarities end. Period. There is no comparison, just as there wouldn't be between Blade Runner and Hackers. Ridley Scott is a brilliant director with a great mind for art direction and knows some fundamentals of film-making, like where to put the camera. Webmaster is cheap, and not just because of the budget. It's cheap because of bad writing, and because it more often than not takes the easy way out (like writing in a character who barely appeared in the first place to save the girl in the end to get her to Point B by herself without the Hero, so that he could do his thing. Very convenient. Or, the attempted sympathy factor for a character we have no reason to care for. Or, inane things like the car set up to stall them just enough for someone to get away.).<br /><br />It tries to be hip, it tries to be exploitive, it even tries a twist ending that's not the least bit surprising, and it tries to be thrilling. But a bunch of near identical chase sequences, bad writing, editing, horribly shot, bad acting, etc. does not a thrilling movie make.

Since the preprocessing takes time, we save the intermediate result using pickle. Don't use any lambda functions in your processors or they won't be able to pickle.

In [21]:
import pickle
# pickle.dump(labeled_list, open(path/'labeled_list_lm.pkl', 'wb'))
In [22]:
labeled_list = pickle.load(open(path/'labeled_list_lm.pkl', 'rb'))

Batching for language models

Convert our labeled_list to a DataBunch, that is, batchifying it through a dataloader requires additional work.

  • we don't just want batches of IMDB reviews: we want to stream in all the texts concatenated and determine batches based on that single concatenated text stream + the BPTT (backprop through time) hyperparam.
  • we also have to prepare the targets that are the next words in the text.

This is done with an LM_PreLoader class, which processes texts for language models (hence LM) so that they can be fed into dataloaders (hence PreLoader):

  • at the beginning of each epoch, we shuffle the texts (if shuffle=True) and create a big stream by concatenating all of them
  • we divide this big stream in batch_size smaller streams
  • we read these smaller streams in chunks of bptt length

Let's see how this works for the smaller validation set:

In [23]:
dl = DataLoader(LM_PreLoader(labeled_list.valid, shuffle=True), batch_size=64)

Let's check it all works ok: x1, y1, x2 and y2 should all be of size batch_size by bptt. The texts in each row of x1 should continue in x2. y1 and y2 should have the same texts as their x counterpart, shifted of one position to the right.

In [24]:
iter_dl = iter(dl)
x1,y1 = next(iter_dl)
x2,y2 = next(iter_dl)
In [25]:
x1.size(),y1.size()
Out[25]:
(torch.Size([64, 70]), torch.Size([64, 70]))
In [26]:
print(proc_num.deprocess(x1[0]))
['_BOS_', '_CAP_', 'this', 'movie', 'twists', 'the', 'facts', 'of', '_CAP_', 'anne', 'and', '_CAP_', 'mary', "'s", 'lives', 'into', 'something', 'unrecognizable', '.', '_CAP_', 'to', 'make', '_CAP_', 'mary', '_CAP_', 'boleyn', ',', 'who', 'in', 'fact', 'was', 'a', 'rather', 'dim', 'and', 'foolish', 'creature', ',', 'and', 'make', 'her', 'the', '"', 'good', '"', 'sister', 'is', 'just', 'silly', '.', '_CAP_', 'it', 'is', '_CAP_', 'anne', 'who', 'was', 'in', 'fact', 'the', 'far', 'more', 'interesting', 'character', ',', 'and', 'that', 'is', 'why', 'it']
In [27]:
print(proc_num.deprocess(y1[0]))
['_CAP_', 'this', 'movie', 'twists', 'the', 'facts', 'of', '_CAP_', 'anne', 'and', '_CAP_', 'mary', "'s", 'lives', 'into', 'something', 'unrecognizable', '.', '_CAP_', 'to', 'make', '_CAP_', 'mary', '_CAP_', 'boleyn', ',', 'who', 'in', 'fact', 'was', 'a', 'rather', 'dim', 'and', 'foolish', 'creature', ',', 'and', 'make', 'her', 'the', '"', 'good', '"', 'sister', 'is', 'just', 'silly', '.', '_CAP_', 'it', 'is', '_CAP_', 'anne', 'who', 'was', 'in', 'fact', 'the', 'far', 'more', 'interesting', 'character', ',', 'and', 'that', 'is', 'why', 'it', 'is']
In [28]:
print(proc_num.deprocess(x2[0]))
['is', 'her', 'life', ',', 'and', 'not', '_CAP_', 'mary', "'s", ',', 'that', 'has', 'been', 'told', 'so', 'often', '.', '\n\n', '_CAP_', 'in', 'response', 'to', 'an', 'earlier', 'review', ',', 'i', 'fail', 'to', 'see', 'how', '_CAP_', 'anne', "'s", 'life', 'was', 'so', '"', 'criminal', '"', '...', 'to', 'me', 'it', "'s", '_CAP_', 'henry', 'who', 'was', 'the', 'real', 'criminal', '.', '_CAP_', 'whatever', '_CAP_', 'anne', "'s", 'motives', 'for', 'winning', 'the', 'king', 'and', 'withholding', 'her', 'affections', 'in', 'order', 'to']
In [29]:
print(proc_num.deprocess(y2[0]))
['her', 'life', ',', 'and', 'not', '_CAP_', 'mary', "'s", ',', 'that', 'has', 'been', 'told', 'so', 'often', '.', '\n\n', '_CAP_', 'in', 'response', 'to', 'an', 'earlier', 'review', ',', 'i', 'fail', 'to', 'see', 'how', '_CAP_', 'anne', "'s", 'life', 'was', 'so', '"', 'criminal', '"', '...', 'to', 'me', 'it', "'s", '_CAP_', 'henry', 'who', 'was', 'the', 'real', 'criminal', '.', '_CAP_', 'whatever', '_CAP_', 'anne', "'s", 'motives', 'for', 'winning', 'the', 'king', 'and', 'withholding', 'her', 'affections', 'in', 'order', 'to', 'gain']

Let's use a convenience function to do this quickly, for both the train and validation datasets.

In [30]:
# def get_lm_dls(train_ds, valid_ds, batch_size, bptt, **kwargs):
#     return (DataLoader(LM_PreLoader(train_ds, batch_size, bptt, shuffle=True),
#                        batch_size=batch_size, **kwargs),
#             DataLoader(LM_PreLoader(valid_ds, batch_size, bptt, shuffle=False),
#                        batch_size=2*batch_size, **kwargs))

# def lm_databunchify(splitdata, batch_size, bptt, **kwargs):
#     return DataBunch(*get_lm_dls(splitdata.train, splitdata.valid, batch_size, bptt, **kwargs))
In [31]:
batch_size = 64
bptt = 70
data = lm_databunchify(labeled_list, batch_size, bptt)

Optional: check out the doc string for the __getitem__ method of LM_PreLoader for a simple example of how batching for language models works.

In [32]:
# LM_PreLoader.__getitem__?

Batching for classification

When we will want to tackle classification, gathering the data will be a bit different: first we will label our texts with the folder they come from, and then we will need to apply padding to batch them together. To avoid mixing very long texts with very short ones, we will also use Sampler to sort our samples by length, with a bit of randomness for the training set.

We label our data with CategoryProcessor, which converts the labels (levels of the dependent categorical variable) to integers.

In [33]:
proc_cat = CategoryProcessor()
In [34]:
textlist = TextList.from_files(path, include=['train', 'test'])
splitdata = SplitData.split_by_func(textlist, partial(grandparent_splitter, valid_name='test'))
labeled_list = label_by_func(splitdata, parent_labeler,
                             processor_x = [proc_tok, proc_num],
                             processor_y=proc_cat)
100.00% [13/13 00:12<00:00]
100.00% [13/13 00:12<00:00]
In [35]:
proc_cat.level_dict
Out[35]:
['neg', 'pos']
In [36]:
proc_cat.otoi
Out[36]:
{'neg': 0, 'pos': 1}
In [37]:
labeled_list.train
Out[37]:
LabeledData
x: TextList (25000 items)
[[2, 7, 19, 29, 15, 175, 7, 1613, 2312, 31, 17, 8, 369, 13, 7, 8, 7, 10530, 7, 2589, 9, 7, 37, 20, 20, 15, 2773, 97, 30, 28, 8, 212, 20, 110, 2312, 124, 3609, 7313, 11, 1798, 28, 8, 760, 46, 658, 14, 1646, 9, 7, 63, 32, 390, 12, 24194, 1132, 13, 136, 34, 8, 1788, 11, 8, 116, 8, 7, 11076, 78, 677, 12, 136, 10, 93, 402, 32, 227, 390, 19, 29, 9, 18, 157, 10, 597, 147, 308, 11, 824, 7, 8, 7, 1524, 11, 7, 8, 7, 24863, 55, 7, 8, 7, 1736, 9, 7, 68, 105, 7, 1613, 989, 865, 20, 558, 100, 157, 97, 690, 66, 7, 16, 25, 4593, 14, 84, 2805, 2403, 49, 38, 37, 1514, 14, 3947, 36, 3730, 66, 2805, 157, 97, 690, 66, 33, 11, 1068, 121, 18, 83, 95, 37, 2025, 14, 9, 7, 103, 10, 16, 73, 161, 98, 101, 63, 8, 29, 85, 636, 65, 34907, 7, 166, 20, 73, 43, 158, 18, 321, 53, 14, 968, 14, 84, 9, 7, 49, 25, 8, 46050, 49, 1129, 19, 16754, 21245, 23, 1596, 82, 13, 12, 29, 27, 7, 8, 7, 21651, 7, 9395, 11, 7, 8, 7, 10312, 66], [2, 7, 83, 12, 410, 29, 9, 7, 16, 22, 14, 43, 847, 10, 173, 9, 7, 707, 12, 375, 354, 96, 178, 44, 20, 4114, 10, 48, 556, 36, 63, 32, 78, 655, 64, 71, 91, 21, 137, 21, 33, 49, 235, 60, 589, 27, 1037, 17, 12, 3487, 150, 10, 12, 145, 17, 12, 21078, 2026, 242, 15519, 119, 2886, 10, 12, 3359, 17, 8, 1577, 10, 532, 9, 7, 30, 54, 38, 65, 1743, 803, 9, 7, 285, 8, 82, 15, 37, 183, 10, 28, 8, 110, 192, 10, 54, 22, 12, 191, 525, 20, 38, 35, 52, 25255, 9, 7, 28, 42, 10, 8, 333, 231, 91, 35, 719, 68, 39, 22, 2847, 36, 8, 107, 75, 33, 11, 39, 15, 35, 76, 42, 13, 8, 107, 382, 14, 719, 9, 7, 30, 20, 22, 1282, 9, 7, 69, 3451, 10, 54, 22, 12, 70, 236, 981, 2614, 7, 16814, 22, 4872, 11, 7, 4426, 199, 7, 1361, 20, 18, 87, 35, 84, 568, 9, 7, 68, 7, 1361, 594, 7, 16814, 39, 695, 64, 71, 87, 10, 18, 2448, 64, 39, 326, 11, 64, 7, 16814, 14504, 27, 9, 7, 30, 68, 8, 1510, 1947, 49, 39, 83, 25, 10, 18, 25, 3533, 780, 45, 8, 7021, 13, 8, 3410, 9, 7, 16, 95, 43, 106, 13, 77, 626, 13, 577, 27, 8, 504, 10, 55, 20, 16, 22, 12, 2083, 1025, 981, 9, 24, 7, 372, 116, 10, 8, 29, 22, 207, 97, 11, 58, 35, 126, 16, 63, 54, 22, 256, 146, 34, 92, 7, 931, 32, 203, 17, 8, 1278, 28, 12, 699, 4995, 505, 9], [2, 18, 257, 37, 273, 49, 15, 519, 151, 24, 5588, 842, 28, 19, 29, 30, 1712, 89, 16, 0, 9, 18, 41, 131, 3328, 13, 201, 124, 11, 1243, 1428, 11, 19, 42, 15, 877, 16, 15, 81, 59, 0, 219, 11, 284, 89, 20, 15, 44, 18, 95, 213, 9, 7, 130, 15, 410, 10, 137, 15, 76, 458, 9, 7, 11, 54, 15, 74, 987, 45, 44, 9, 24, 7, 76, 8, 7, 599, 7, 23345, 124, 38, 146, 93, 19, 9, 512, 262, 51, 12000, 9, 18, 847, 14, 736, 8, 137, 10, 80, 78, 43, 6115, 17, 1491, 14, 398, 118, 10, 63, 8, 130, 15, 67, 9, 24, 7, 19, 85, 74, 2713, 10, 70, 139, 7, 624, 10, 24, 11, 12, 397, 7560, 196, 9, 24, 18, 320, 19, 27, 297, 102, 374, 24, 49, 18, 445, 38, 152, 694, 14, 89, 50, 24, 7, 46, 488, 89, 14, 11988, 8, 3772, 61, 8, 1824, 9, 18, 78, 37, 284, 272, 95, 41, 24, 388, 19, 4581, 12, 67, 737, 9], ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
y: ItemList (25000 items)
[0, 0, 0, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
In [38]:
labeled_list.train.x_obj(0)
Out[38]:
"_BOS_ _CAP_ this movie is another _CAP_ christian propaganda film in the line of _CAP_ the _CAP_ omega _CAP_ code . _CAP_ not that that is necessarily bad but for the fact that most propaganda films sacrifice sincerity and realism for the message they wish to deliver . _CAP_ if you enjoy a styrofoam portrayal of life on the streets and the way the _CAP_ gospel can change a life , than perhaps you may enjoy this movie . i say , save your money and rent _CAP_ the _CAP_ cross and _CAP_ the _CAP_ switchblade or _CAP_ the _CAP_ mission . _CAP_ when will _CAP_ christian directors learn that sometimes people say bad words ? _CAP_ it was frustrating to see criminals depicted who are not allowed to swear ( huh ? criminals say bad words ? ) and flat characters i really could not relate to . _CAP_ also , it would 've been great if the movie had shown some t&a. _CAP_ now that would be something i 'd like to pay to see . _CAP_ who was the blockhead who compared this communion wafer - thin story of a movie with _CAP_ the _CAP_ boondock _CAP_ saints and _CAP_ the _CAP_ sting ?"
In [39]:
labeled_list.train.x_obj(47)
Out[39]:
'_BOS_ _CAP_ now please do n\'t start calling me names like , " unpatriotic " , " weirdo " and more . \n\n _CAP_ the very length of this movie ( 4 hours .. ! ! ! ) is its biggest mistake . _CAP_ no editing at all - seems like j.p. _CAP_ dutta fell in love with his project too much . _CAP_ even _CAP_ lagaan was 4 hours long - but it was entertaining and gave a message as well . \n\n _CAP_ it \'s based on true incidents and real people . _CAP_ kudos to it , but were the repetitive war scenes really needed ? _CAP_ on top of it the focus constantly shifted from one battalion / squadron to another and it was impossible to keep a track of them all . \n\n _CAP_ between the skirmishes , there were songs about loneliness , _UNK_ and related stuff . _CAP_ there were chummy conversations . _CAP_ in the beginning it gave some relief from the violence but became so monotonous later that one could even correctly predict nature of the forthcoming talk . \n\n _CAP_ why were the soldiers walking around as if they were lions in jungle , fully unaware that enemy was lurking somewhere near ? _CAP_ and when they were shot , it elicited sympathy but it seemed _UNK_ of them to be so cocksure of their safety in the first place . \n\n _CAP_ music was melodious and the lyrics were soulful but did not fit with the movie . _CAP_ better to listen to them on the soundtrack rather than in the movie . \n\n _CAP_ acting was the saving grace : _CAP_ from seasoned veterans like _CAP_ sanjay _CAP_ dutt and _CAP_ ajay _CAP_ devgan , to relative newbies like _CAP_ abhishek _CAP_ bachchan and _CAP_ akshaye _CAP_ khanna , everyone acted like a pro . _CAP_ manoj _CAP_ bajpai and _CAP_ ashutosh _CAP_ rana deserve a special mention for lightening up the mood whenever necessary . \n\n _CAP_ dialogues ranged from brilliant ( " _CAP_ from _CAP_ madhuri .. with _CAP_ love ! ! " ) to illogical / monotonous ( " _CAP_ pakistan se _UNK_ _UNK_ _CAP_ _UNK_ mein hain " ) . _CAP_ and the expletive spree consisting of all the _UNK_ , _UNK_ , _CAP_ cs and f - words was n\'t really required . \n\n loc _CAP_ kargil attempts to provide a fitting tribute to the brave _CAP_ indian soldiers , but tries too hard and ultimately fails . _CAP_ indian soldiers surely deserve a better tribute .'
In [40]:
labeled_list.valid
Out[40]:
LabeledData
x: TextList (25000 items)
[[2, 18, 122, 8, 4051, 86, 17559, 467, 7, 48211, 10, 106, 229, 51, 8, 392, 54, 25, 19, 19652, 573, 10, 53, 158, 25, 379, 30, 18, 95, 35, 205, 301, 77, 3685, 34, 16, 9, 7, 115, 16, 614, 89, 9, 7, 2324, 9, 7, 28, 12, 139, 153, 10, 46, 1336, 110, 13, 8, 431, 372, 26, 584, 143, 55, 141, 278, 10, 27, 57, 12, 139, 2882, 148, 11, 54, 10, 383, 16, 900, 2290, 606, 9, 7, 19, 15, 8, 266, 13, 3507, 20, 17094, 34, 8, 5714, 10, 27, 178, 76, 2604, 171, 59, 16, 9, 18, 736, 3507, 10, 30, 45, 243, 10, 216, 8588, 10, 1618, 1653, 124, 953, 2324, 2440, 14, 41, 12, 291, 13, 1175, 59, 16, 9, 24, 7, 64, 18, 488, 25, 12, 31, 59, 12, 11985, 423, 868, 12073, 11, 539, 53, 20, 9, 7, 298, 53, 12, 7404, 510, 13, 8, 2337, 13, 8, 29, 7, 14778, 36, 129, 59, 8, 422, 282, 14, 409, 111, 61, 319, 59, 8, 100, 282, 14, 99, 17, 33, 9, 7, 64, 18, 209, 25, 65, 859, 433, 10, 5147, 45, 234, 590, 778, 27, 12, 6604, 13, 97, 1245, 834, 10, 12, 1500, 2792, 11985, 49, 25, 26211, 230, 40, 139, 3854, 10, 11, 12, 170, 49, 551, 87, 35, 140, 109, 14, 2816, 12, 381, 9, 18, 57, 488, 14, 2055, 17, 54, 10, 3906, 8, 381, 51, 8, 231, 10, 11, 1200, 8, 16662, 169, 564, 9, 7, 8, 823, 25, 35, 94, 146, 9, 7, 8, 137, 66, 7, 88, 10, 18, 497, 63, 8, 469, 231, 87, 35, 41, 160, 12, 97, 247, 14, 181, 51, 10, 39, 321, 43, 45, 243, 1694, 9, 7, 8, 304, 97, 231, 25, 600, 10, 117, 10, 30, 207, 94, 312, 346, 25, 12, 1004, 9, 7, 3507, 87, 35, 365, 10, 30, 8, 137, 25, 207, 97, 76, 651, 20, 104, 6091, 9, 24, 7, 182, 18, 99, 104, 69, 97, 10, 306, 22, 26212, 72, 14, 12, 191, 67, 198, 9, 7, 88, 10, 42, 55, 127, 9, 7, 481, 8, 16286, 3021, 10, 18, 270, 530, 8, 0, 539, 10, 64, 139, 54, 25, 13, 16, 10, 11, 73, 41, 94, 270, 320, 12, 29, 680, 59, 10, 17, 10, 11, 208, 20, 93, 8, 9131, 5811, 1003, 17, 8, 62, 164, 195, 62, 9, 7, 8, 42022, 0, 169, 486, 266, 13, 601, 10, 30, 1038, 72, 129, 270, 3182, 17, 8, 4193, 13, 198, 9, 7, 8, 509, 7768, 10, 80, 1594, 89, 13, 7, 19937, 7, 4611, 7, 8313, 22, 7, 52556, 10, 25, 236, 10, 173, 26, 12, 130, 2586, 3022, 2536, 51, 8, 4997, 13, 7, 1028, 51, 7, 183, 7, 799, 10, 16, 25, 2200, 4318, 17, 8, 219, 512, 9, 7, 54, 22, 42, 244, 135, 8, 97, 231, 15, 7709, 11, 279, 571, 15, 229, 54, 9, 7, 155, 87, 35, 39, 371, 8, 97, 231, 22, 7975, 2583, 115, 36, 252, 8, 509, 2586, 25, 7975, 21732, 33, 66, 7, 2712, 10, 65, 236, 10214, 11, 17789, 539, 10, 63, 81, 46, 95, 41, 98, 9359, 146, 9, 24, 7, 166, 10, 49, 148, 10414, 1749, 19, 29, 14, 7, 3669, 7, 5051, 66, 7, 218, 124, 213, 290, 5092, 17, 8, 718, 11, 54, 22, 65, 266, 13, 141, 12849, 555, 13, 884, 17, 16, 10, 460, 13, 10, 17, 218, 124, 9, 7, 54, 22, 8, 3356, 13, 12, 590, 10, 30, 18, 161, 131, 128, 12, 31, 27, 12, 590, 45, 112, 2183, 20, 38, 178, 53, 372, 7, 3669, 7, 5051, 55, 7, 48211, 9, 7, 1968, 10, 28, 1978, 9, 7, 20, 22, 59, 135, 8, 4266, 149, 9, 7, 829, 9, 7, 54, 15, 74, 1872, 10, 57, 26, 54, 73, 35, 43, 223, 7, 3669, 7, 5051, 11, 7, 14778, 9, 7, 10839, 7, 1090, 15, 12, 545, 170, 27, 12, 101, 350, 28, 552, 468, 11, 698, 65, 34599, 13, 31, 23, 255, 10, 53, 135, 14, 301, 8, 381, 9, 7, 48211, 15, 699, 10, 11, 37, 57, 106, 13, 8, 354, 9, 7, 16, 22, 699, 106, 13, 97, 519, 10, 11, 106, 16, 69, 424, 93, 37, 325, 8, 751, 116, 61, 36, 53, 519, 17, 12, 123, 49, 1176, 1633, 17, 8, 107, 290, 14, 597, 8, 258, 17, 8, 149, 14, 99, 56, 14, 7, 244, 628, 47, 822, 230, 8, 7, 571, 10, 52, 20, 39, 95, 58, 40, 169, 9, 7, 70, 6651, 9, 7, 55, 10, 8, 3195, 2420, 2255, 28, 12, 123, 90, 41, 74, 313, 14, 477, 28, 9, 7, 55, 10, 4250, 198, 53, 8, 537, 292, 72, 14, 16101, 111, 57, 215, 28, 305, 14, 99, 262, 9, 33, 9, 24, 7, 16, 517, 14, 43, 2839, 10, 16, 517, 14, 43, 13968, 10, 16, 76, 517, 12, 981, 294, 20, 22, 37, 8, 243, 246, 1650, 10, 11, 16, 517, 14, 43, 3321, 9, 7, 30, 12, 757, 13, 744, 5622, 1245, 834, 10, 97, 519, 10, 823, 10, 2200, 342, 10, 97, 137, 10, 532, 9, 91, 37, 12, 3321, 29, 114, 9], [2, 8, 890, 86, 8, 110, 2560, 198, 17, 8, 464, 29, 9, 8, 547, 86, 877, 11, 1165, 10, 18, 403, 10, 109, 78, 46, 222, 79, 222, 4178, 369, 10, 4367, 10, 118, 10, 3728, 2545, 10, 11, 37, 6299, 19, 1898, 51, 248, 262, 66, 17, 646, 14, 3183, 65, 460, 13, 4368, 10, 80, 18, 58, 1142, 83, 84, 8, 291, 17, 8, 4319, 92, 10, 46, 38, 1444, 14, 1492, 10, 248, 248, 248, 262, 51, 363, 14, 441, 27, 8, 52, 23, 462, 6919, 3240, 49, 1761, 48, 643, 2011, 27, 1492, 9, 18, 197, 1142, 83, 419, 64, 39, 25, 10, 52, 49, 83, 2076, 66, 8, 1566, 91, 1142, 365, 10, 8, 547, 38, 1444, 14, 1492, 14, 865, 158, 396, 52, 64, 632, 58, 46, 865, 68, 46, 920, 127, 714, 890, 11, 38, 496, 14, 8507, 8, 231, 20, 459, 14, 126, 111, 52, 46, 78, 920, 151, 456, 34, 58515, 66, 8, 758, 1832, 23, 1250, 29, 10, 283, 44, 1832, 23, 2113, 4817, 14, 52557, 11, 43, 496, 14, 4227, 8, 1292, 14, 1492, 55, 65, 248, 262, 661, 262, 51, 728, 66, 18, 58, 1142, 83, 53, 8, 52558, 3900, 10, 46, 133, 95, 83, 4464, 141, 8, 1508, 13, 2938, 10, 34, 389, 349, 92, 17, 440, 32, 87, 1142, 84, 20, 10, 115, 32, 86, 1880, 51, 8, 392, 9, 36, 2261, 28345, 33], [2, 18, 73, 53, 14, 949, 34, 109, 8, 547, 38, 2366, 9, 155, 15, 20, 79, 38, 235, 69, 482, 366, 2366, 115, 79, 38, 333, 366, 9, 190, 437, 79, 15, 235, 69, 482, 366, 115, 333, 42, 22, 9, 26, 63, 14, 157, 482, 366, 38, 146, 280, 115, 333, 366, 9, 18, 73, 53, 28, 302, 84, 69, 333, 366, 115, 482, 9, 11, 16, 37, 57, 147, 142, 16, 22, 53, 20, 17, 12, 188, 13, 289, 235, 69, 482, 22, 9, 30, 18, 73, 41, 217, 252, 32, 26, 8, 432, 24478, 13, 8, 142, 32, 73, 84, 19, 674, 11, 41, 69, 333, 366, 34, 147, 142, 9, 30, 32, 38, 57, 53, 8, 399, 282, 14, 506, 53, 32, 38, 52, 1252, 11, 360, 9, 32, 38, 57, 12, 220, 0, 15770, 9], ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
y: ItemList (25000 items)
[0, 0, 0, ...]
Path: /home/ady/Desktop/Dropbox/lgm_notebooks/data/imdb
In [41]:
labeled_list.valid.x_obj(5)
Out[41]:
'_BOS_ _CAP_ this film truly was poor . i went to the theatre expecting something exciting , and instead was afforded the opportunity to hone my " guess the next plot twist before it happens " skills . _CAP_ seriously , the plot was written with an extra thick crayon so everyone could see . _CAP_ nothing was truly shocking . _CAP_ in fact , even the gore was met with such complete suspension of belief that it really did n\'t add up to much . \n\n _CAP_ the excessive wise cracking and cops talking shop at the crime scenes made it seem all the more phony . _CAP_ and the scene where _CAP_ lambert \'s character is struggling with the clues and reaches his " investigative epiphany " goes to great lengths to indicate the level of intellect expected from the audience - little . \n\n _CAP_ probably the most annoying aspect of the cinematography was the " x - _CAP_ files " treatment : _CAP_ every building in the film , whether it \'s the precinct building , or a house at noon , or a hospital , was suffering from a lack of any discernible lighting ( not to mention a lack of \' patients \' in the case of the hospital ) . i do n\'t recall a single scene when someone flipped on a light switch . _CAP_ it sure would have been nice . \n\n _CAP_ mr. _CAP_ lambert really is n\'t an _CAP_ oscar - grade actor , so i suppose you have to take this film for what it \'s worth . _CAP_ in the end , i \'ve reached the conclusion that the only thing that would make this film seem more entertaining is to watch it after watching " _CAP_ the _CAP_ warriors " . _CAP_ otherwise , you \'re left with an effort that is dull and unoriginal , and nowhere near the equal of films of the genre such as " _CAP_ silence of the _CAP_ lambs " .'
In [42]:
# import pickle
# pickle.dump(labeled_list, open(path/'labeled_list_clas.pkl', 'wb'))
In [43]:
labeled_list = pickle.load(open(path/'labeled_list_clas.pkl', 'rb'))

Let's check that the labels seem consistent with the texts.

In [44]:
[(labeled_list.train.x_obj(i), labeled_list.train.y_obj(i)) for i in [10, 732, 12552]]
Out[44]:
[('_BOS_ _CAP_ gee , what a crappy movie this was ! i can not understand what people find so scary about " _CAP_ the _CAP_ grudge " . _CAP_ the director plays one trick ( i \'d have to admit a very good one , that is brought to life very stylized ) and then he repeats it for the rest of the movie over and over again . _CAP_ as a consequence i startled a few times in the first quarter of the movie , but once i knew the drill i practically fell asleep as _CAP_ the _CAP_ grudge grew more and more predictable by the minute . _CAP_ to conclude , i can say that there are a lot better movies in the genre to begin with , that the so - called predecessor " _CAP_ the _CAP_ ring " was way scarier and that buying a ticket for " _CAP_ the _CAP_ grudge " is a waste of money .',
  'neg'),
 ('_BOS_ because you can put it on fast forward and watch the inane story , without having to listen to banal dialogue , and be finished in 10 minutes max . _CAP_ come to think of it , even 10 minutes is too much to waste on _CAP_ enid - _CAP_ _UNK_ - meets - struggling - wanna - be - artists . _CAP_ vomit .',
  'neg'),
 ("_BOS_ i totally _UNK_ thought that this was a great movie for _UNK_ wells from _UNK_ island , and promise shown of a barely then known dana _UNK_ was _UNK_ and for that it can hardly be disregarded as meaningless _UNK_ it was n't scary and was n't meant to _UNK_ wo nt ruin the _UNK_ it was unusual the way that it was _UNK_ mean the kids characters were great and i did n't know what to expect in the _UNK_ basic plot also had a lot more to do with these kids than you say the fact that these kids were expert fishermen is very central to the plot especially _UNK_ also helps them out of a jam towards the _UNK_ also has the plus of not being overly _UNK_ think it clocks in at under 95 minutes",
  'pos')]

For the validation set, we will simply sort the samples by length, and we begin with the longest ones for memory reasons (it's better to always have the biggest tensors first).

For the training set, we want some kind of randomness on top of this:

  • we shuffle the texts and build megabatches of size 50 * batch_size
  • we sort those megabatches by length before splitting them in 50 minibatches; that way we will have randomized batches of roughly the same length
  • then we make sure to have the biggest batch first and shuffle the order of the other batches
  • we also make sure the last batch stays at the end because its size is probably lower than batch_size

Padding: we add the padding token (id of 1) at the end of each sequence to make them all the same size when batching them. Note that we need padding at the end to be able to use PyTorch convenience functions that will let us ignore that padding (more on this later in the ULMFiT notebook).

In [45]:
# def pad_collate(samples, pad_idx=1, pad_first=False):
#     # identify the longest document in the minibatch
#     max_len = max([len(sample[0]) for sample in samples])
#     # create rectangular tensor that can accommodate all documents
#     # in the batch up to that max_len, and fill it with padding.
#     results = torch.zeros(len(samples), max_len).long() + pad_idx
#     # take documents in the minibatch and put them in the tensor
#     # keeping padding either at the beginning or at the end
#     for i, sample in enumerate(samples):
#         if pad_first:
#             results[i, -len(sample[0]):] = LongTensor(sample[0])
#         else:
#             results[i, :len(sample[0]) ] = LongTensor(sample[0])
#     return results, tensor([sample[1] for sample in samples])
In [46]:
batch_size = 64
train_sampler = SortishSampler(labeled_list.train.x,
                               key=lambda t: len(labeled_list.train[int(t)][0]),
                               batch_size=batch_size)
train_dl = DataLoader(labeled_list.train, batch_size=batch_size,
                      sampler=train_sampler, collate_fn=pad_collate)

Let's look at one training batch:

In [47]:
iter_dl = iter(train_dl)
x, y = next(iter_dl)

We can see the padding at the end of the non-initial movie reviews:

In [48]:
x
Out[48]:
tensor([[    2,     7,  1148,  ...,    12, 15754,    24],
        [    2,     7,  4521,  ...,     1,     1,     1],
        [    2,     7,    65,  ...,     1,     1,     1],
        ...,
        [    2,    19,  1423,  ...,     1,     1,     1],
        [    2,     7,   175,  ...,     1,     1,     1],
        [    2,   402,    17,  ...,     1,     1,     1]])
In [49]:
x.size()
Out[49]:
torch.Size([64, 3310])
In [50]:
y
Out[50]:
tensor([1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,
        1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
        1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1])
In [51]:
y.size()
Out[51]:
torch.Size([64])

Let's look at the lengths of the documents in this batch. We can get their length by subtracting the number of padding tokens from the length of the tensor:

In [52]:
lengths = []
for i in range(x.size(0)):
    lengths.append(x.size(1) - (x[i]==1).sum().item())
print(lengths)
[3310, 2211, 1950, 1868, 1613, 1428, 1390, 1351, 1350, 1325, 1322, 1318, 1317, 1311, 1304, 1302, 1301, 1291, 1289, 1284, 1263, 1258, 1250, 1243, 1236, 1231, 1220, 1220, 1217, 1188, 1186, 1181, 1179, 1169, 1161, 1160, 1151, 1138, 1137, 1129, 1127, 1126, 1124, 1117, 1112, 1109, 1092, 1087, 1082, 1078, 1076, 1068, 1062, 1062, 1061, 1061, 1054, 1049, 1049, 1034, 1025, 1024, 1021, 1016]

This is the first batch so it has the longest movie review first. The last one is the shortest movie review in the batch.

If we look at the next batch, we see the lengths fall within a much narrower range:

In [53]:
x,y = next(iter_dl)
lengths = []
for i in range(x.size(0)):
    lengths.append(x.size(1) - (x[i]==1).sum().item())
print(lengths)
[359, 359, 358, 358, 357, 357, 356, 356, 356, 355, 355, 355, 355, 355, 355, 354, 354, 353, 353, 352, 352, 351, 351, 350, 350, 350, 350, 350, 349, 349, 349, 349, 348, 348, 347, 347, 347, 347, 346, 346, 345, 345, 344, 344, 344, 344, 343, 343, 342, 342, 342, 342, 342, 341, 340, 340, 340, 340, 339, 339, 339, 339, 339, 339]

And we add a convenience function:

In [54]:
# def get_clas_dls(train_ds, valid_ds, batch_size, **kwargs):
#     train_sampler = SortishSampler(train_ds.x,
#                                    key=lambda t: len(train_ds.x[t]),
#                                    batch_size=batch_size)
#     valid_sampler = SortSampler(valid_ds.x,
#                                 key=lambda t: len(valid_ds.x[t]))
#     return (DataLoader(train_ds, batch_size=batch_size, sampler=train_sampler,
#                        collate_fn=pad_collate, **kwargs),
#             DataLoader(valid_ds, batch_size=batch_size*2, sampler=valid_sampler,
#                        collate_fn=pad_collate, **kwargs))

# def clas_databunchify(splitdata, batch_size, **kwargs):
#     return DataBunch(*get_clas_dls(splitdata.train, splitdata.valid, batch_size, **kwargs))
In [55]:
batch_size = 64
bptt = 70
data = clas_databunchify(labeled_list, batch_size)
In [ ]: