tags that are not balanced by a corresponding
in an HTML document. """ print ("report_unclosed_paras(%r)" % url) global parser parser = scrubby.html_parser(load_url(url).decode('utf8')) # The markup_parser distinguishes several types of objects: Tags, # markup directives, running text, and comments. Each of these # objects is represented by an instance of markup_object or one of # its subclasses. # Every object has a read-only .obj_type attribute that contains a # string describing what the object is. Open tags (e.g.,) # have an obj_type of "start_tag", which distinguishes them from # close tags (e.g.,
) and XML style self-delimiting tags # (e.g.,tag " "at line %(line_num)s, column %(col_num)s" % dict(line_num = ln, col_num = cn)) # Of course, it's possible that no matching tag will be found. # When that happens, the parser tries a few simple rules to # limit the scope of tags. The partner of a correctly-matched # start tag should be an end tag with the same name; if it's not, # we're seeing an example of the parser's ad-hoc rules in action. elif (p.partner.obj_type != 'end_tag' or p.partner.tag_name.lower() != 'p'): pln, pcn = p.partner.linepos print ("Mismatched
tag at line %(line_num)s, column %(col_num)s\n"
" implicitly closed by %(closer)s"
" at line %(partner_line_num)s, column %(partner_col_num)s" %
dict(line_num = ln, col_num = cn, partner_line_num = pln,
partner_col_num = pcn, closer = p.partner.source))
print (" tag containing the comic is the first image found
# inside the "comic" division.
# So, first we'll find the earliest
. The
# contents of a tag are defined as all those objects between the
# tag and its partner, not including the tag itself. The .first()
# and .find() methods of markup_tag are just wrappers for the
# parser's .first() and .find() methods, which supply the
# search_inside keyword argument in addition to whatever else you
# provide.
image = cdiv.first(tag_name = 'img')
# Now we can just extract the source and grab the image. Start
# and self tags behave like dictionaries with respect to their
# attributes: tag.keys() returns a list of attibute names defined
# on the tag, and tag['attrname'] returns a markup_attribute
# object that represents the attribute. The .value of an
# attribute object gives the text of the value as defined in the
# source file.
src = image['src'].value
print ("Loading: <%s>" % src)
image_data = load_url(src)
file_name = os.path.basename(src)
print ("Saving: %s" % file_name)
with open(file_name, 'wb') as fp:
fp.write(image_data)
print ("