PK iAo Unit_2/01-Introduction.en.srt1
00:00:00,000 --> 00:00:05,000
Last unit, we learned about regular expression and finite state machines.
2
00:00:05,000 --> 00:00:09,000
In this unit, we're going to put those concepts together
3
00:00:09,000 --> 00:00:15,000
to make a lexical analyzer--a program that reads in a web page or a bit of JavaScript
4
00:00:15,000 --> 00:00:20,000
and breaks it down into words, just like I might break and English sentence down into words.
5
00:00:20,000 --> 00:00:23,000
This is going to be a really important tool in our arsenal,
6
00:00:23,000 --> 99:59:59,000
and it's one of the first steps towards making a web browser,
PK iA Unit_2/02-Welcome_Back.en.srt1
00:00:00,000 --> 00:00:02,000
Welcome back.
2
00:00:02,000 --> 00:00:07,000
In our last exciting episode we learned about regular expressions,
3
00:00:07,000 --> 00:00:14,000
a concise notation as a way to write down or denote or match a number of strings.
4
00:00:14,000 --> 00:00:20,000
This 4 through 7 in brackets corresponds to 4 different strings 4, 5, 6, and 7.
5
00:00:20,000 --> 00:00:24,000
We learned to write more complicated regular expressions like this one--
6
00:00:24,000 --> 00:00:28,000
"b a +."--this plus means one or more copies of a's,
7
00:00:28,000 --> 00:00:35,000
yielding words like ba, baa, baaa, and eventually yielding my sheep.
8
00:00:35,000 --> 00:00:40,000
I assert that it's a sheep. You can tell because of the label. Those labels never lie.
9
00:00:40,000 --> 00:00:47,000
We also learned how you can use regular expressions in Python by importing,
10
00:00:47,000 --> 00:00:52,000
bring in, the functions and data types from the regular expression library.
11
00:00:52,000 --> 00:00:58,000
An example of such a function was findall, which, given a sort of needle regular expression,
12
00:00:58,000 --> 00:01:03,000
would return all of the places in the haystack that it matched.
13
00:01:03,000 --> 00:01:09,000
We also learned that you could turn regular expressions into finite state machines.
14
00:01:09,000 --> 00:01:14,000
This finite state machine accepts the same language as our ba, baa, baaa
15
00:01:14,000 --> 00:01:16,000
regular expression from above.
16
00:01:16,000 --> 00:01:19,000
Starting in a start state, on a b we transition to the middle state,
17
00:01:19,000 --> 00:01:22,000
on an a we end up in the third state, which is an accepting state.
18
00:01:22,000 --> 00:01:25,000
You can tell by the double circle. Then there's a self loop back.
19
00:01:25,000 --> 99:59:59,000
That was last time.
PK iA*u Unit_2/03-Specification.en.srt1
00:00:00,000 --> 00:00:06,000
Now, we're going to learn how to specify important parts of HTML and JavaScript,
2
00:00:06,000 --> 00:00:11,000
and, in an incredible surprise move, we're going to do this specification
3
00:00:11,000 --> 00:00:13,000
using regular expressions.
4
00:00:13,000 --> 00:00:17,000
Just a quick reminder, an outline of the overall project,
5
00:00:17,000 --> 00:00:24,000
we want to start with a web page and then break it down into important words.
6
00:00:24,000 --> 00:00:28,000
Maybe the less than and the greater than sign used for the tag are important words,
7
00:00:28,000 --> 00:00:32,000
the 1, the plus, and the 2, but we're largely ignoring this sort of white space
8
00:00:32,000 --> 00:00:35,000
or these new line characters.
9
00:00:35,000 --> 00:00:41,000
Then we want to take those words and diagram them into this cool tree-like structure.
10
00:00:41,000 --> 00:00:44,000
Of course, this tree is growing upside down, but that won't be a problem at all.
11
00:00:44,000 --> 99:59:59,000
Finally, we're going to interpret that tree to figure out what it means to get the result.
PK iA+- Unit_2/04-Html.en.srt1
00:00:00,000 --> 00:00:04,000
HTML stands for the "hypertext markup language."
2
00:00:04,000 --> 00:00:07,000
Many of you may have some previous experience with HTML,
3
00:00:07,000 --> 00:00:09,000
but that's not necessary.
4
00:00:09,000 --> 00:00:13,000
HTML as w know it was invented to Tim Berners-Lee,
5
00:00:13,000 --> 00:00:18,000
a British computer scientist working in Switzerland around 1990.
6
00:00:18,000 --> 00:00:24,000
For our purposes, HTML just tells a web browser how to display a webpage.
7
00:00:24,000 --> 00:00:32,000
In fact, HTML is not all that different from using symbols like stars or underscores
8
00:00:32,000 --> 00:00:36,000
to emphasis text that you're writing to someone else.
9
00:00:36,000 --> 00:00:41,000
In HTML this emphasized plain text becomes "I,"
10
00:00:41,000 --> 00:00:47,000
and then this special punctuation that means let's do some bold now, "really,"
11
00:00:47,000 --> 00:00:51,000
this special punctuation that means I'm done with bold, let's go back to normal,
12
00:00:51,000 --> 00:00:54,000
and then "like you." The "b" stands for "bold."
13
00:00:54,000 --> 00:00:56,000
Let's go see how that pans out.
14
00:00:56,000 --> 00:01:01,000
Here in this particular window, I'm showing the raw HTML source on the left
15
00:01:01,000 --> 00:01:04,000
and how it might look in a web browser on the right.
16
00:01:04,000 --> 00:01:08,000
Here when I've added the bold tags around really, we see "I really like you."
17
00:01:08,000 --> 00:01:10,000
The "really" is rendered in bold.
18
00:01:10,000 --> 00:01:16,000
Other comment approaches in HTML for emphasis are the use of underlines, the "u,"
19
00:01:16,000 --> 00:01:21,000
and italics, "i." Each one of these is called a "tag."
20
00:01:21,000 --> 00:01:26,000
This special syntax with the "b" in angle brackets and the "b" sort of similar angle brackets,
21
00:01:26,000 --> 00:01:29,000
the "u," the closing "u," the "i," the closing "i,"
22
00:01:29,000 --> 00:01:34,000
is a tag that's associated with that word, or that span of text,
23
00:01:34,000 --> 00:01:36,000
and tells the web browser how to render it.
24
00:01:36,000 --> 00:01:40,000
This part here, the left angle bracket and the right angle bracket--
25
00:01:40,000 --> 00:01:45,000
that's a starting tag, and this other part--the left angle bracket followed by a slash,
26
00:01:45,000 --> 00:01:49,000
the slash is super important, begin an end tag.
27
00:01:49,000 --> 00:01:51,000
That's a little complicated to say.
28
00:01:51,000 --> 00:01:56,000
Mark the start of an ending tag and tell you that the current tag is about to stop.
29
00:01:56,000 --> 00:01:59,000
Here's a beginning bold tag. Here's an ending bold tag.
30
00:01:59,000 --> 99:59:59,000
You can see that play out on the right. Only the word "really" is bolded.
PK iA Unit_2/05-Really.en.srt1
00:00:00,000 --> 00:00:04,000
Let's check your knowledge of that with a multiple choice quiz.
2
00:00:04,000 --> 00:00:09,000
Here a number of times I've written the sentence "George Orwell was really Eric Blair."
3
00:00:09,000 --> 00:00:14,000
You might think I've written it, say, 1984 times, but really, just four.
4
00:00:14,000 --> 00:00:17,000
I hear if you repeat something like this enough, it becomes true.
5
00:00:17,000 --> 00:00:21,000
What I'd like you to do is mark in this multiple multiple choice quiz
6
00:00:21,000 --> 00:00:25,000
which of these will end up showing the world "really" in bold.
7
00:00:25,000 --> 99:59:59,000
They can show other words in bold, but I want to know that they'll show "really" in bold.
PK iA}
Unit_2/06-Really_Solution.en.srt1
00:00:00,000 --> 00:00:02,000
Let's go through these together.
2
00:00:02,000 --> 00:00:04,000
This is well-formed HTML, we're beginning the bold tag.
3
00:00:04,000 --> 00:00:08,000
It ends after the word "really." This looks great.
4
00:00:08,000 --> 00:00:13,000
Unfortunately, in this next sentence we end the bold before we start it,
5
00:00:13,000 --> 00:00:16,000
and then we start it over here with Eric Blair.
6
00:00:16,000 --> 00:00:19,000
That's not going to work out well. I will show you in just a minute what that looks like.
7
00:00:19,000 --> 00:00:25,000
Here, we begin the bold tag and then we have lots of space and then the word "really."
8
00:00:25,000 --> 00:00:27,000
It turns out that is totally fine.
9
00:00:27,000 --> 00:00:31,000
Web browsers use the same sort of techniques we talked about in the last unit
10
00:00:31,000 --> 00:00:36,000
to break up sentences like these into words based on white space.
11
00:00:36,000 --> 00:00:38,000
All this extra space doesn't matter.
12
00:00:38,000 --> 00:00:42,000
Finally, down here we start bold at the beginning of the sentence,
13
00:00:42,000 --> 00:00:46,000
so all the of these words--George-Orwell-was-really-Eric-Blair"--
14
00:00:46,000 --> 00:00:51,000
they're all bolded. Notably "really" was bolded as well, so this works out.
15
00:00:51,000 --> 00:00:53,000
Let's go see how this plays out.
16
00:00:53,000 --> 00:00:56,000
Here we have the first option--"George Orwell was really Eric Blair"--
17
00:00:56,000 --> 00:00:58,000
and really is definitely bolded.
18
00:00:58,000 --> 00:01:01,000
If I reverse these, it's harder to interpret.
19
00:01:01,000 --> 00:01:06,000
This bold tag closes nothing. It's ill-balanced. This makes me super unhappy.
20
00:01:06,000 --> 00:01:11,000
But this next one applies to "Eric Blair," and then falls off into the end of the universe.
21
00:01:11,000 --> 00:01:13,000
This isn't very good.
22
00:01:13,000 --> 00:01:15,000
I can put huge numbers of spaces here, and as we see,
23
00:01:15,000 --> 00:01:19,000
this does not influence the rendered web page at all.
24
00:01:19,000 --> 00:01:22,000
Then in this version I have the tags at the start and the end of the sentence,
25
00:01:22,000 --> 00:01:24,000
and the whole sentence is bolded.
26
00:01:24,000 --> 00:01:28,000
George Orwell is perhaps best known for writing 1984,
27
00:01:28,000 --> 99:59:59,000
and I hear that HTML has always been at war with JavaScript.
PK iA} Unit_2/07-Tags.en.srt1
00:00:00,000 --> 00:00:06,000
As we hinted before, this special syntax in HTML is called a tag.
2
00:00:06,000 --> 00:00:11,000
It's kind of like a price tag you might attach to a shirt or another item you're buying.
3
00:00:11,000 --> 00:00:15,000
This modifies nearby text and tells you how to interpret it,
4
00:00:15,000 --> 00:00:18,000
whether you can wash it or not in a machine, how much it costs,
5
00:00:18,000 --> 00:00:21,000
whether or not it should be bolded or underlined--that sort of thing.
6
00:00:21,000 --> 00:00:25,000
Another super common kind of tag is the anchor tag,
7
00:00:25,000 --> 00:00:29,000
which is used to add hyperlinks to webpages.
8
00:00:29,000 --> 00:00:33,000
In some sense this is the defining characteristic of what it means to be a webpage.
9
00:00:33,000 --> 00:00:38,000
Here I've written a fragment of HTML that includes such an anchor tag.
10
00:00:38,000 --> 00:00:43,000
It begins here, but unlike the relatively simple bold and underline tags,
11
00:00:43,000 --> 00:00:46,000
it has an argument.
12
00:00:46,000 --> 00:00:49,000
This means pretty much the same thing it did when we were talking
13
00:00:49,000 --> 00:00:51,000
about functions in Python or math.
14
00:00:51,000 --> 00:00:54,000
Here the argument or my sine function is pi.
15
00:00:54,000 --> 00:01:01,000
Here the argument or modifier for my anchor tag is href equals.
16
00:01:01,000 --> 00:01:05,000
This stands for hypertext reference--the target of this link.
17
00:01:05,000 --> 00:01:10,000
Here I've given a string that is a URL, a web address.
18
00:01:10,000 --> 00:01:12,000
Hypertext transfer protocol google.com.
19
00:01:12,000 --> 00:01:18,000
This text in the middle is often rendered in blue with an underline, although it doesn't have to be.
20
00:01:18,000 --> 00:01:23,000
Then over here we're ending the anchoring tag. Let's see how this plays out.
21
00:01:23,000 --> 00:01:26,000
Here I've the old "Eric Blair was really George Orwell" text,
22
00:01:26,000 --> 00:01:31,000
but I've added a new sentence--"Click here for a link to a webpage."
23
00:01:31,000 --> 00:01:35,000
Right after the anchor starts, the text is rendered in a slightly different color.
24
00:01:35,000 --> 00:01:39,000
If we were to click on it, you can potentially see down in the lower left
25
00:01:39,000 --> 00:01:41,000
that it goes to google.com.
26
00:01:41,000 --> 00:01:45,000
Just to break this down, if this is a fragment of HTML,
27
00:01:45,000 --> 00:01:50,000
then the words "Click here and now" will all be drawn on the screen.
28
00:01:50,000 --> 00:01:54,000
This syntax marks the beginning of the anchor tag.
29
00:01:54,000 --> 00:01:59,000
This syntax, left angle bracket slash a right angle bracket,
30
00:01:59,000 --> 00:02:01,000
marks the end of the anchor tag.
31
00:02:01,000 --> 00:02:03,000
This part in here is the argument of the tag.
32
00:02:03,000 --> 00:02:07,000
It contains extra information for things that are more complicated
33
00:02:07,000 --> 99:59:59,000
than simple bold or underline.
PK iAWZJ " Unit_2/08-Interpreting_Html.en.srt1
00:00:00,000 --> 00:00:05,000
Here I've written a significantly more complicated fragment of HTML,
2
00:00:05,000 --> 00:00:10,000
and I would like you, gentle student, to help me interpret it.
3
00:00:10,000 --> 00:00:13,000
In this multiple-multiple choice quiz, check each box
4
00:00:13,000 --> 99:59:59,000
that corresponds to a word that will be displayed on the screen by the web browser.
PK iAoZ Z + Unit_2/09-Interpreting_Html_Solution.en.srt1
00:00:00,000 --> 00:00:02,000
Let's go through it together.
2
00:00:02,000 --> 00:00:06,000
Href is actually one of the arguments to this anchor tag.
3
00:00:06,000 --> 00:00:08,000
It's not displayed.
4
00:00:08,000 --> 00:00:12,000
If the user clicks on this link, they'll go to Wikipedia.org,
5
00:00:12,000 --> 00:00:16,000
but they'll never see the href. This is not shown.
6
00:00:16,000 --> 00:00:22,000
Mary, on the other hand, is not the name of a tag or the argument to a tag.
7
00:00:22,000 --> 00:00:24,000
It will be shown.
8
00:00:24,000 --> 00:00:27,000
Similarly, Vindication will be shown.
9
00:00:27,000 --> 00:00:31,000
It'll be shown as part of a link and italicized, but it'll be there.
10
00:00:31,000 --> 00:00:37,000
Wikipedia is part of the argument to this anchor tag,so it will not be shown.
11
00:00:37,000 --> 00:00:43,000
Wrote, however, will be shown, and then this i, the italic tag, isn't shown.
12
00:00:43,000 --> 00:00:46,000
Instead, the next is actually slanted.
13
00:00:46,000 --> 00:00:49,000
Here we're taking a look at how it would render in a web browser.
14
00:00:49,000 --> 00:00:54,000
We can see Mary Wollstonecraft wrote A Vindication of the Rights of Women.
15
00:00:54,000 --> 99:59:59,000
Href, Wikipedia, and i are not printed on the screen.
PK iA8O " Unit_2/10-Taking_Html_Apart.en.srt1
00:00:00,000 --> 00:00:04,000
Now that we understand how HTML works,
2
00:00:04,000 --> 00:00:09,000
we want to separate out these tags from the words that will be displaced on the screen.
3
00:00:09,000 --> 00:00:15,000
Breaking up words like this is actually a surprisingly common task in real life.
4
00:00:15,000 --> 00:00:22,000
For example, ancient Latin was often written or inscribed without spaces.
5
00:00:22,000 --> 00:00:26,000
This particular set of letters "SENTATUSPOPULUSQUEROMANUS"
6
00:00:26,000 --> 00:00:30,000
is inscribed on the arch of Titus, which I've doodled over here as a column,
7
00:00:30,000 --> 00:00:33,000
but what can you do? Arches are apparently beyond my power.
8
00:00:33,000 --> 00:00:37,000
I know. It has just become an arch. Those labels never lie.
9
00:00:37,000 --> 00:00:41,000
Roman inscriptions like this were written without spaces,
10
00:00:41,000 --> 00:00:46,000
and it requires a bit of domain knowledge to know how to break this up.
11
00:00:46,000 --> 00:00:52,000
"Senate and the People of Rome." That inscription was made quite some time ago.
12
00:00:52,000 --> 00:00:57,000
Similarly, in many written Asian languages, they don't explicitly include spaces
13
00:00:57,000 --> 00:01:00,000
or punctuations between the various characters or glyphs.
14
00:01:00,000 --> 00:01:04,000
In this particular Japanese example, and both my handwriting and my stroke order
15
00:01:04,000 --> 00:01:08,000
are very, very poor--have pity--some amount of domain knowledge is required to break up
16
00:01:08,000 --> 00:01:14,000
"ano" from "yama"--"that mountain."
17
00:01:14,000 --> 00:01:18,000
Finally,even if you're not familiar with Asian languages or ancient Latin,
18
00:01:18,000 --> 00:01:21,000
you might have seen the same sort of thing in a much more modern guise,
19
00:01:21,000 --> 00:01:23,000
in text messaging.
20
00:01:23,000 --> 00:01:28,000
Some amount of domain knowledge is required to break this up into "I love you"
21
00:01:28,000 --> 00:01:31,000
even though no particular spaces are given.
22
00:01:31,000 --> 00:01:36,000
We will want to do the same thing for HTML to break it up into words
23
00:01:36,000 --> 00:01:40,000
like "Wollstonecraft" and "wrote" that will appear on the screen
24
00:01:40,000 --> 00:01:46,000
or this special left angle bracket slash maneuver that tells us that we're starting end tag,
25
00:01:46,000 --> 00:01:48,000
this special word in the middle that tells us which tag it was,
26
00:01:48,000 --> 00:01:51,000
and then this closing right angle bracket.
27
00:01:51,000 --> 00:01:56,000
Once again, for this HTML fragment we want to break it up into this first word,
28
00:01:56,000 --> 00:02:02,000
the start of the closing tag, another word, the end of the closing tag,
29
00:02:02,000 --> 00:02:04,000
and then another word.
30
00:02:04,000 --> 00:02:07,000
We're going to need to do this to write our web browser.
31
00:02:07,000 --> 00:02:11,000
In order to interpret HTML and JavaScript, we're going to have to break sentences down
32
00:02:11,000 --> 00:02:15,000
into their component words to figure out what's going on.
33
00:02:15,000 --> 00:02:21,000
This process is called--dun, dun, dun, dun-- lexical analysis.
34
00:02:21,000 --> 00:02:26,000
Lexical here has the same roots and "lexicon" like a dictionary.
35
00:02:26,000 --> 00:02:29,000
This means "to break something down into words."
36
00:02:29,000 --> 00:02:36,000
You'll be pleased to know that we're going to use regular expressions to solve this problem.
37
00:02:36,000 --> 00:02:39,000
Here I've written another one of those decompositions.
38
00:02:39,000 --> 00:02:45,000
We might have broken an HTML fragment down into these word-like objects,
39
00:02:45,000 --> 00:02:51,000
but this time you're going to help me out by doing the problem in reverse.
40
00:02:51,000 --> 00:02:58,000
So in this multiple multiple choice quiz, I'd like you to mark each one of these HTML fragments
41
00:02:58,000 --> 99:59:59,000
that would decompose into this sequence of five elements.
PK iA + Unit_2/11-Taking_Html_Apart_Solution.en.srt1
00:00:00,000 --> 00:00:04,000
Let's go through it together, and this first one starts with a left angle bracket,
2
00:00:04,000 --> 00:00:07,000
which looks super promising, but then it has this slash
3
00:00:07,000 --> 00:00:11,000
which we don't see reflected up here, so this one doesn't match,
4
00:00:11,000 --> 00:00:14,000
could not have produced this sequence of 5.
5
00:00:14,000 --> 00:00:17,000
Over here we have a left angle bracket, a b, a right angle bracket.
6
00:00:17,000 --> 00:00:19,000
Looking great. Salvador, looking good.
7
00:00:19,000 --> 00:00:22,000
Dali, looking great. Oh, yeah, this totally matches.
8
00:00:22,000 --> 00:00:25,000
Down here we have almost the same sentence, but there's no space
9
00:00:25,000 --> 00:00:28,000
between salvador and dali.
10
00:00:28,000 --> 00:00:32,000
This is very close, but instead of getting 2 separate words at the end,
11
00:00:32,000 --> 00:00:36,000
it would break down into just one word at the end, salvadordali,
12
00:00:36,000 --> 00:00:38,000
so this one doesn't match.
13
00:00:38,000 --> 00:00:42,000
Over here we start with a bold tag, have salvador and then dali,
14
00:00:42,000 --> 00:00:46,000
but then we have a few more characters that aren't shown in this list of 5,
15
00:00:46,000 --> 00:00:48,000
so this doesn't match exactly.
16
00:00:48,000 --> 00:00:51,000
Here we have salvador followed by the bold tag.
17
00:00:51,000 --> 00:00:56,000
That's getting the order wrong, and the order of this breakdown is really going to matter.
18
00:00:56,000 --> 00:00:59,000
We really need to know the order of words in a sentence.
19
00:00:59,000 --> 00:01:01,000
Super important, it is.
20
00:01:01,000 --> 00:01:05,000
Finally, over here we start with bold, and we have salvador dali again.
21
00:01:05,000 --> 00:01:08,000
This looks great. No problems there.
22
00:01:08,000 --> 00:01:10,000
Notice the spacing was a little different.
23
00:01:10,000 --> 00:01:14,000
Here we had a space between the bold tag and salvador.
24
00:01:14,000 --> 00:01:16,000
Here we had kind of a space over here.
25
00:01:16,000 --> 00:01:18,000
These spaces don't matter very much.
26
00:01:18,000 --> 00:01:23,000
Salvador Dali was a Spanish artist famous for his surrealist paintings,
27
00:01:23,000 --> 00:01:27,000
probably most famous for painting The Persistence of--
28
00:01:27,000 --> 99:59:59,000
I can't remember. Let's just go on.
PK iAQ Unit_2/12-Html_Structure.en.srt1
00:00:00,000 --> 00:00:02,000
Since HTML is structured,
2
00:00:02,000 --> 00:00:08,000
we're going to want to break it up into words and punctuation and word-like elements,
3
00:00:08,000 --> 00:00:14,000
and we use the special term token to mean all of those.
4
00:00:14,000 --> 00:00:18,000
In general, a token can refer to a word, a string,
5
00:00:18,000 --> 00:00:20,000
numbers, punctuation.
6
00:00:20,000 --> 00:00:25,000
It's the smallest unit of the output of a lexical analysis.
7
00:00:25,000 --> 00:00:27,000
Remember, that's what we're currently working on.
8
00:00:27,000 --> 00:00:32,000
Mostly tokens do not refer to white space,
9
00:00:32,000 --> 00:00:38,000
which is just a formal way of referring to the spaces between words.
10
00:00:38,000 --> 00:00:42,000
We're going to be focusing on lexical analysis,
11
00:00:42,000 --> 00:00:45,000
a process whereby we break down a string, like a sentence
12
00:00:45,000 --> 00:00:50,000
or an utterance or a webpage, into a list of tokens.
13
00:00:50,000 --> 00:00:53,000
One string might contain many tokens
14
00:00:53,000 --> 00:00:58,000
in the same way that one sentence might contain many words.
15
00:00:58,000 --> 00:01:02,000
Here I've written 6 HTML tokens,
16
00:01:02,000 --> 00:01:07,000
given them names on the left and examples on the right.
17
00:01:07,000 --> 00:01:10,000
Now, the naming of tokens is a bit arbitrary.
18
00:01:10,000 --> 00:01:15,000
In general, though, tokens are given uppercase names
19
00:01:15,000 --> 00:01:18,000
to help us tell them apart from other words or variables.
20
00:01:18,000 --> 00:01:23,000
Here this left angle corresponds to an angle bracket
21
00:01:23,000 --> 00:01:27,000
facing left presumably--not quite sure how to draw that.
22
00:01:27,000 --> 00:01:29,000
The smaller end is to face.
23
00:01:29,000 --> 00:01:34,000
Left angle slash is a < followed by a /, division sign.
24
00:01:34,000 --> 00:01:39,000
The right angle bracket, > facing to the right.
25
00:01:39,000 --> 00:01:43,000
Here's the angle. Here's the face.
26
00:01:43,000 --> 00:01:45,000
The equal sign is just =.
27
00:01:45,000 --> 00:01:49,000
A string is going to have double quotes around it,
28
00:01:49,000 --> 00:01:55,000
and a word is anything else, welcome to my webpage, punctuation like that.
29
00:01:55,000 --> 00:02:00,000
Now, it turns out that the naming of tokens is not quite an arbitrary matter.
30
00:02:00,000 --> 00:02:02,000
You may think I'm as mad as a hatter.
31
00:02:02,000 --> 00:02:04,000
No, that's a different story.
32
00:02:04,000 --> 00:02:06,000
We're just going to go with these token names for now,
33
00:02:06,000 --> 00:02:09,000
but if you were designing a system from the ground up,
34
00:02:09,000 --> 99:59:59,000
you can rename them to be anything you like.
PK iAҷ
" Unit_2/13-Specifying_Tokens.en.srt1
00:00:00,000 --> 00:00:03,000
We're going to use regular expressions,
2
00:00:03,000 --> 00:00:07,000
which are very good at specifying sets of strings
3
00:00:07,000 --> 00:00:09,000
to specify tokens.
4
00:00:09,000 --> 00:00:13,000
Later on we'll want to match a bunch of different tokens
5
00:00:13,000 --> 00:00:17,000
from webpages or JavaScript, and this is how we write out
6
00:00:17,000 --> 00:00:20,000
token definitions in Python.
7
00:00:20,000 --> 00:00:25,000
The t tells us--and it tells the Python system--
8
00:00:25,000 --> 00:00:27,000
that we're declaring a token.
9
00:00:27,000 --> 00:00:30,000
The next letters are the name of the token.
10
00:00:30,000 --> 00:00:33,000
You either get to make this up yourself, or in the homework
11
00:00:33,000 --> 00:00:35,000
I'll tell you what I want it to be.
12
00:00:35,000 --> 00:00:40,000
Tokens are in some sense going to be functions of the text they match.
13
00:00:40,000 --> 00:00:42,000
More on this a bit later.
14
00:00:42,000 --> 00:00:44,000
Skip me for now.
15
00:00:44,000 --> 00:00:47,000
Next, we have a regular expression
16
00:00:47,000 --> 00:00:50,000
corresponding to this token,
17
00:00:50,000 --> 00:00:53,000
which in this case, for the right angle token,
18
00:00:53,000 --> 00:00:57,000
there's really only 1 string it can correspond to,
19
00:00:57,000 --> 00:01:01,000
so we've written out the regular expression that corresponds to a single string.
20
00:01:01,000 --> 00:01:05,000
And then here on the last line we're returning the text of the token unchanged.
21
00:01:05,000 --> 00:01:09,000
We could transform it, and you'll see us do that for more complicated tokens
22
00:01:09,000 --> 00:01:13,000
like numbers where maybe we'll want to change the string 1.2
23
00:01:13,000 --> 00:01:16,000
into the number 1.2.
24
00:01:16,000 --> 00:01:21,000
Now it's your chance to define your first token.
25
00:01:21,000 --> 00:01:24,000
What I would like you to do is write code
26
00:01:24,000 --> 00:01:27,000
in the style of the procedure I just showed you before
27
00:01:27,000 --> 00:01:30,000
for the LANGLESLASH token.
28
00:01:30,000 --> 00:01:33,000
The LANGLESLASH token is surprisingly important.
29
00:01:33,000 --> 00:01:36,000
We really need it to know when all of our tags end.
30
00:01:36,000 --> 00:01:40,000
Use the interpreter to define a procedure
31
00:01:40,000 --> 99:59:59,000
tLANGLESLASH that matches it.
PK iAKo o + Unit_2/14-Specifying_Tokens_Solution.en.srt1
00:00:00,000 --> 00:00:02,000
Let's go through a possible answer together.
2
00:00:02,000 --> 00:00:05,000
I have to name my procedure with t.
3
00:00:05,000 --> 00:00:08,000
That tells the interpreter that I'm defining a token.
4
00:00:08,000 --> 00:00:10,000
Now I give the name of the token, LANGLESLASH.
5
00:00:10,000 --> 00:00:12,000
That was given as part of the problem.
6
00:00:12,000 --> 00:00:14,000
All of our tokens are actually functions.
7
00:00:14,000 --> 00:00:17,000
We've been eliding that bit. We're still going to skip over it.
8
00:00:17,000 --> 00:00:21,000
Next I have to have a regular expression for the string that matches the token,
9
00:00:21,000 --> 00:00:24,000
and here again there's only 1 string that matches,
10
00:00:24,000 --> 99:59:59,000
and I'm going to return the token unchanged.
PK iA% Unit_2/15-Token_Values.en.srt1
00:00:00,000 --> 00:00:04,000
It's not enough to know that a string contains a token,
2
00:00:04,000 --> 00:00:08,000
just like it's not enough to know that a sentence contains a verb.
3
00:00:08,000 --> 00:00:11,000
We need to know which one it is, and formally
4
00:00:11,000 --> 00:00:14,000
we refer to that as the value of the token.
5
00:00:14,000 --> 00:00:19,000
By default, the value of a token is the value of the string it matched.
6
00:00:19,000 --> 00:00:22,000
We can rebuild it, however. We have the technology.
7
00:00:22,000 --> 00:00:24,000
Let's see how that plays out.
8
00:00:24,000 --> 00:00:27,000
Here I've written a definition for a slightly more complicated token,
9
00:00:27,000 --> 00:00:32,000
a number, one or more copies of the digit 0-9,
10
00:00:32,000 --> 00:00:35,000
and now I would like you, our last, best hope for victory,
11
00:00:35,000 --> 00:00:37,000
to help me understand it.
12
00:00:37,000 --> 00:00:41,000
If the input text is 1368,
13
00:00:41,000 --> 00:00:45,000
what will the value of the token be?
14
00:00:45,000 --> 99:59:59,000
Check all that apply.
PK iA] &