Discussion:
The joy of format-immutable, scrape-able text
Add Reply
oldernow
2024-04-24 14:14:44 UTC
Reply
Permalink
Isn't is nice when you write a script to scrape and parse
online text, and whoever owns the site containing it keeps
the format the same for fairly long periods of time?

A couple years ago, I wrote a script called "mlb" that
scrapes a site showing American Major League Baseball
standings (it's not the official mlb.com site, as it
is utterly useless for that for being a typical modern
javascript nightmare). And it still works! I present it
as a possible fairly simple example of how to leverage
Lua for such in case a Lua neophyte chances upon this post:

(Pardon the absence of comments and intervening empty
lines, but I no longer believe in such, because I generally
can't even figure out what my comments mean no matter
how hard I try to make them enduringly meaningful, and I
prefer being able to see more code at once that aspects
of it being set apart by empty lines.)

(In a nutshell, it gets formatted web page output
via "elinks -dump", parses is, and presents just the
information I'm interested in via "less -X" (I can't
remember what the -X accomplishes, and the 'less' man page
didn't clarify what I apparently understood back when I
wrote the script).)

----------------------------------------------------------
#! /usr/bin/env lua
local show = false
local function remove_playoff_indicators(s)
s = string.gsub(s, ' x%-', ' ')
s = string.gsub(s, ' y%-', ' ')
return s
end
local out = io.popen('less -X', 'w')
local handle = io.popen('elinks -dump https://www.baseball-reference.com/leagues/MLB-standings.shtml')
for line in handle:lines() do
line = remove_playoff_indicators(line)
if show then
if string.match(line, 'Major League Baseball Detailed Standings') then
break
else
line = string.gsub(line, '%[%d+%]', ' ')
if not string.match(line, '^%s*$') and string.match(line, '^%s') then
out:write(line .. '\n')
end
end
else
if string.match(line, 'East Division Table') then
show = true
line = string.gsub(line, '%[%d+%]', ' ')
if not string.match(line, '^%s*$') and string.match(line, '^%s') then
out:write(line .. '\n')
end
end
end
end
handle:close()
out:close()
----------------------------------------------------------

How the output in "less" looks at the moment:

----------------------------------------------------------
East Division Table
Tm W L W-L% GB
New York Yankees 16 8 .667 --
Baltimore Orioles 15 8 .652 0.5
Boston Red Sox 13 11 .542 3.0
Toronto Blue Jays 13 11 .542 3.0
Tampa Bay Rays 12 13 .480 4.5
Central Division Table
Tm W L W-L% GB
Cleveland Guardians 17 6 .739 --
Detroit Tigers 14 10 .583 3.5
Kansas City Royals 14 10 .583 3.5
Minnesota Twins 9 13 .409 7.5
Chicago White Sox 3 20 .130 14.0
West Division Table
Tm W L W-L% GB
Seattle Mariners 12 11 .522 --
Texas Rangers 12 12 .500 0.5
Los Angeles Angels 10 14 .417 2.5
Oakland Athletics 9 15 .375 3.5
Houston Astros 7 17 .292 5.5
East Division Table
Tm W L W-L% GB
Atlanta Braves 16 6 .727 --
Philadelphia Phillies 15 9 .625 2.0
New York Mets 12 11 .522 4.5
Washington Nationals 10 12 .455 6.0
Miami Marlins 6 19 .240 11.5
Central Division Table
Tm W L W-L% GB
Milwaukee Brewers 14 8 .636 --
Chicago Cubs 14 9 .609 0.5
Cincinnati Reds 13 10 .565 1.5
Pittsburgh Pirates 13 11 .542 2.0
St. Louis Cardinals 10 14 .417 5.0
West Division Table
Tm W L W-L% GB
Los Angeles Dodgers 14 11 .560 --
San Diego Padres 13 13 .500 1.5
Arizona Diamondbacks 12 13 .480 2.0
San Francisco Giants 12 13 .480 2.0
Colorado Rockies 6 18 .250 7.5
(END)
----------------------------------------------------------

Ain't that a thing of pure beauty to any "terminal first"
- or, better yet, "terminal only" - types out there?

Also, go Yankees and Brewers! :-)
--
oldernow
xyz001 at nym.hush.com
oldernow
2024-04-24 14:22:05 UTC
Reply
Permalink
On 2024-04-24, oldernow <***@dev.null> wrote:

(NOTE: in the following, pretend <TAB> is having actually pressed
the tab key to input a tab character...)

Gosh DAMN it! Dumbo, here, did a vim ":%s/<TAB>/ /"
(substitute tab characters with two spaces) in the
initial post so that tabs wouldn't default to something
too big... but *should* have done an ":%s/<TAB>/ /g"
(see the 'g' at the end) so that *all* tab characters
(not just the first on a line) would have been substituted.

Sorry about that possibly making the Lua source code look
like more of a mess than intended!
--
oldernow
xyz001 at nym.hush.com
D
2024-04-24 21:55:41 UTC
Reply
Permalink
Post by oldernow
Isn't is nice when you write a script to scrape and parse
online text, and whoever owns the site containing it keeps
the format the same for fairly long periods of time?
That is indeed a joy! My longest lived source of information online are
RSS feeds. I parse them with a script and generate emails which I send
to myself for consumption in my favourite email client alpine.

I am also anachronistic in the way that I like the old school TV-text
pages which still exist in sweden.

They do have an internet interface, so when I'm travelling I also have a
script that converts them to an email as well. ;)

When it comes to baseball, I thought that common sense and common truth
proclaimed that the white sox are the best, right? At least that's what
I learned when living in chicago and I thought it was a lesson taught to
all american. ;)
oldernow
2024-04-25 12:42:06 UTC
Reply
Permalink
Post by D
Post by oldernow
Isn't is nice when you write a script to scrape and parse
online text, and whoever owns the site containing it keeps
the format the same for fairly long periods of time?
That is indeed a joy! My longest lived source of
information online are RSS feeds. I parse them with a
script and generate emails which I send to myself for
consumption in my favourite email client alpine.
I am also anachronistic in the way that I like the old
school TV-text pages which still exist in sweden.
They do have an internet interface, so when I'm travelling
I also have a script that converts them to an email as
well. ;)
I used to do stuff like that.

But I generally avoid "news", and when it comes to blogs
(and their gopher/gemini counterparts), I'm fine with
stumbling upon things in a sort of "how Berners-Lee
intended it" kind of way. I'm to the point of banning
bookmarks from my life, even, mostly because - as odd as
it may sound - there seems to be no greater death-knell
in my life for someone I enjoy reading, or for specific
articles I found particularly good, than to bookmark
them. For me, articles related by links is a better,
more natural/holistic form of bookmarking, and articles
disappearing - completely with that creating bad links
elsewhere - seems far more in accord with "real" life, and
stockpiling bookmarks seems more an artificial aberration
and somehow counter intuitively lowers the chances of my
ever reading things I've saved bookmarks to again.
Post by D
When it comes to baseball, I thought that common sense
and common truth proclaimed that the white sox are the
best, right? At least that's what I learned when living
in chicago and I thought it was a lesson taught to all
american. ;)
I've rooted for the White Sox here and there.
--
oldernow
xyz001 at nym.hush.com
D
2024-04-27 21:41:19 UTC
Reply
Permalink
Post by oldernow
Post by D
Post by oldernow
Isn't is nice when you write a script to scrape and parse
online text, and whoever owns the site containing it keeps
the format the same for fairly long periods of time?
That is indeed a joy! My longest lived source of
information online are RSS feeds. I parse them with a
script and generate emails which I send to myself for
consumption in my favourite email client alpine.
I am also anachronistic in the way that I like the old
school TV-text pages which still exist in sweden.
They do have an internet interface, so when I'm travelling
I also have a script that converts them to an email as
well. ;)
I used to do stuff like that.
But I generally avoid "news", and when it comes to blogs
(and their gopher/gemini counterparts), I'm fine with
stumbling upon things in a sort of "how Berners-Lee
intended it" kind of way. I'm to the point of banning
bookmarks from my life, even, mostly because - as odd as
it may sound - there seems to be no greater death-knell
in my life for someone I enjoy reading, or for specific
articles I found particularly good, than to bookmark
them. For me, articles related by links is a better,
more natural/holistic form of bookmarking, and articles
disappearing - completely with that creating bad links
elsewhere - seems far more in accord with "real" life, and
stockpiling bookmarks seems more an artificial aberration
and somehow counter intuitively lowers the chances of my
ever reading things I've saved bookmarks to again.
I do have quite a nice and curated bookmark list. I use it most often to
keep track of books I want to buy, and potential gifts for myself and my
family.

Apart from that, I'd say that I only use about 1% regularly if at all. The
rest were bookmarked once, because of the enormously happy feeling in the
soul that now it is "forever" within my control. ;)

What tends to happen 99% of the time is informal bookmarking. That means I
use a site, its stored in the history so pops up first, and I use it again
and again... until I don't.

So the bookmarks are a nice illusion.

As for authors, they do tend to become worse over time. Since you liked
them at one point on time, at one stage in their career, it is inevitable.
The author wants to develop try new things, and that's when they stop
being interesting for me. Almost always. And very few author resign
themselves to write the same type of fiction all their lives.
Post by oldernow
Post by D
When it comes to baseball, I thought that common sense
and common truth proclaimed that the white sox are the
best, right? At least that's what I learned when living
in chicago and I thought it was a lesson taught to all
american. ;)
I've rooted for the White Sox here and there.
Ahh... so there was some truth in it perhaps!
oldernow
2024-04-28 01:57:57 UTC
Reply
Permalink
Post by D
I do have quite a nice and curated bookmark list. I use
it most often to keep track of books I want to buy, and
potential gifts for myself and my family.
Apart from that, I'd say that I only use about 1% regularly
if at all. The rest were bookmarked once, because of
the enormously happy feeling in the soul that now it is
"forever" within my control. ;)
What tends to happen 99% of the time is informal
bookmarking. That means I use a site, its stored in
the history so pops up first, and I use it again and
again... until I don't.
So the bookmarks are a nice illusion.
As for authors, they do tend to become worse over
time. Since you liked them at one point on time, at one
stage in their career, it is inevitable. The author wants
to develop try new things, and that's when they stop being
interesting for me. Almost always. And very few author
resign themselves to write the same type of fiction all
their lives.
I can't say they were my favorite authors, but I got
into John Updike and Joyce Carol Oates back in the 1990s,
and their writing worked absolute magick on my vocabulary.
--
oldernow
xyz001 at nym.hush.com
D
2024-04-28 09:47:46 UTC
Reply
Permalink
Post by oldernow
Post by D
I do have quite a nice and curated bookmark list. I use
it most often to keep track of books I want to buy, and
potential gifts for myself and my family.
Apart from that, I'd say that I only use about 1% regularly
if at all. The rest were bookmarked once, because of
the enormously happy feeling in the soul that now it is
"forever" within my control. ;)
What tends to happen 99% of the time is informal
bookmarking. That means I use a site, its stored in
the history so pops up first, and I use it again and
again... until I don't.
So the bookmarks are a nice illusion.
As for authors, they do tend to become worse over
time. Since you liked them at one point on time, at one
stage in their career, it is inevitable. The author wants
to develop try new things, and that's when they stop being
interesting for me. Almost always. And very few author
resign themselves to write the same type of fiction all
their lives.
I can't say they were my favorite authors, but I got
into John Updike and Joyce Carol Oates back in the 1990s,
and their writing worked absolute magick on my vocabulary.
I've only read one Joyce Carol Oates book in my life and that was a
history of boxing. Can you recommend a good second one?
oldernow
2024-04-28 12:04:01 UTC
Reply
Permalink
Post by D
Post by oldernow
I can't say they were my favorite authors, but I got
into John Updike and Joyce Carol Oates back in the 1990s,
and their writing worked absolute magick on my vocabulary.
I've only read one Joyce Carol Oates book in my life and
that was a history of boxing. Can you recommend a good
second one?
I liked them all. But my JCO reading indulgence was too
long ago to remember any favorite(s).

In a way, for me the stories took somewhat of a back seat
compared to the writing itself when it came to John Updike
and Joyce Carol Oates. I remember being in a perpetual
state of amazement what could be done with English.
--
oldernow
xyz001 at nym.hush.com
D
2024-04-28 13:21:33 UTC
Reply
Permalink
Post by oldernow
Post by D
Post by oldernow
I can't say they were my favorite authors, but I got
into John Updike and Joyce Carol Oates back in the 1990s,
and their writing worked absolute magick on my vocabulary.
I've only read one Joyce Carol Oates book in my life and
that was a history of boxing. Can you recommend a good
second one?
I liked them all. But my JCO reading indulgence was too
long ago to remember any favorite(s).
In a way, for me the stories took somewhat of a back seat
compared to the writing itself when it came to John Updike
and Joyce Carol Oates. I remember being in a perpetual
state of amazement what could be done with English.
It would be interesting to give them a chance. Since I'm not a native
english speaker, maybe it would be incomprehensible to me or just plain
old "boring". ;)

I think I'll have to hunt around annas archive to find me a good ebook or
two.
oldernow
2024-04-28 14:23:05 UTC
Reply
Permalink
Post by D
Post by oldernow
In a way, for me the stories took somewhat of a back seat
compared to the writing itself when it came to John Updike
and Joyce Carol Oates. I remember being in a perpetual
state of amazement what could be done with English.
It would be interesting to give them a chance. Since
I'm not a native english speaker, maybe it would be
incomprehensible to me or just plain old "boring". ;)
Let's put it this way: I *thought* I was a native English
speaker until I read them - complete with consulting a
dictionary frequently.
--
oldernow
xyz001 at nym.hush.com
D
2024-04-28 18:15:00 UTC
Reply
Permalink
Post by oldernow
Post by D
Post by oldernow
In a way, for me the stories took somewhat of a back seat
compared to the writing itself when it came to John Updike
and Joyce Carol Oates. I remember being in a perpetual
state of amazement what could be done with English.
It would be interesting to give them a chance. Since
I'm not a native english speaker, maybe it would be
incomprehensible to me or just plain old "boring". ;)
Let's put it this way: I *thought* I was a native English
speaker until I read them - complete with consulting a
dictionary frequently.
Wow! Now I'm curious!
oldernow
2024-04-29 13:21:25 UTC
Reply
Permalink
Post by D
Post by oldernow
Post by D
Post by oldernow
In a way, for me the stories took somewhat of a back seat
compared to the writing itself when it came to John Updike
and Joyce Carol Oates. I remember being in a perpetual
state of amazement what could be done with English.
It would be interesting to give them a chance. Since
I'm not a native english speaker, maybe it would be
incomprehensible to me or just plain old "boring". ;)
Let's put it this way: I *thought* I was a native English
speaker until I read them - complete with consulting a
dictionary frequently.
Wow! Now I'm curious!
The I gots to hope you're not part feline, 'cuz I don't
want to find out the sharing ended ya! :-)
--
oldernow
xyz001 at nym.hush.com
Loading...