どこでも見れるメモ帳

とあるSEの備忘録。何かあれば気軽にコメントください〜

HTML特殊文字を含めたストップワード

f:id:ni66ling:20160612010154p:plain

はじめに

自然言語処理するにあたって、Web収集した文書についてHTML特殊文字が邪魔したので、それを含めたストップリストを作成した.*1

ストップリスト

a
a's
aacute
able
about
above
according
accordingly
acirc
across
actually
acute
aelig
after
afterwards
again
against
agrave
ain't
alefsym
all
allow
allows
almost
alone
along
alpha
already
also
although
always
am
among
amongst
amp
an
and
ang
another
any
anybody
anyhow
anyone
anything
anyway
anyways
anywhere
apart
appear
appreciate
appropriate
are
aren't
aring
around
as
aside
ask
asking
associated
asymp
at
atilde
auml
available
away
awfully
b
bdquo
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
believe
below
beside
besides
best
beta
better
between
beyond
both
brief
brvbar
bull
but
by
c
c'mon
c's
came
can
can't
cannot
cant
cap
cause
causes
ccedil
cedil
cent
certain
certainly
changes
chi
circ
clearly
clubs
co
com
come
comes
concerning
cong
consequently
consider
considering
contain
containing
contains
copy
corresponding
could
couldn't
course
crarr
cup
curren
currently
d
dagger
darr
definitely
deg
delta
described
despite
diams
did
didn't
different
divide
do
does
doesn't
doing
don't
done
down
downwards
during
e
each
eacute
ecirc
edu
eg
egrave
eight
either
else
elsewhere
empty
emsp
enough
ensp
entirely
epsilon
equiv
especially
et
eta
etc
eth
euml
even
ever
every
everybody
everyone
everything
everywhere
ex
exactly
example
except
exist
f
far
few
fifth
first
five
fnof
followed
following
follows
for
forall
former
formerly
forth
four
frasl
from
further
furthermore
g
gamma
ge
get
gets
getting
given
gives
go
goes
going
gone
got
gotten
greetings
gt
h
had
hadn't
happens
hardly
harr
has
hasn't
have
haven't
having
he
he's
hearts
hellip
hello
help
hence
her
here
here's
hereafter
hereby
herein
hereupon
hers
herself
hi
him
himself
his
hither
hopefully
how
howbeit
however
i
i'd
i'll
i'm
i've
iacute
icirc
ie
iexcl
if
ignored
igrave
image
immediate
in
inasmuch
inc
indeed
indicate
indicated
indicates
infin
inner
insofar
instead
int
into
inward
iota
iquest
is
isin
isn't
it
it'd
it'll
it's
its
itself
iuml
j
just
k
kappa
keep
keeps
kept
know
known
knows
l
lambda
lang
laquo
larr
last
lately
later
latter
latterly
lceil
ldquo
le
least
less
lest
let
let's
lfloor
like
liked
likely
little
look
looking
looks
lowast
loz
lrm
lsaquo
lsquo
lt
ltd
m
macr
mainly
many
may
maybe
mdash
me
mean
meanwhile
merely
micro
middot
might
minus
more
moreover
most
mostly
mu
much
must
my
myself
n
nabla
name
namely
nbsp
nd
ndash
ne
near
nearly
necessary
need
needs
neither
never
nevertheless
new
next
ni
nine
no
nobody
non
none
noone
nor
normally
not
nothing
notin
novel
now
nowhere
nsub
ntilde
nu
o
oacute
obviously
ocirc
oelig
of
off
often
ograve
oh
ok
okay
old
oline
omega
omicron
on
once
one
ones
only
onto
oplus
or
ordf
ordm
oslash
other
others
otherwise
otilde
otimes
ought
ouml
our
ours
ourselves
out
outside
over
overall
own
p
para
part
particular
particularly
per
perhaps
permil
perp
phi
pi
piv
placed
please
plus
plusmn
possible
pound
presumably
prime
probably
prod
prop
provides
psi
q
que
quite
quot
qv
r
radic
rang
raquo
rarr
rather
rceil
rd
rdquo
re
real
really
reasonably
reg
regarding
regardless
regards
relatively
respectively
rfloor
rho
right
rlm
rsquo
s
said
same
saw
say
saying
says
sbquo
scaron
sdot
second
secondly
sect
see
seeing
seem
seemed
seeming
seems
seen
self
selves
sensible
sent
serious
seriously
seven
several
shall
she
should
shouldn't
shy
sigma
sigmaf
sim
since
six
so
some
somebody
somehow
someone
something
sometime
sometimes
somewhat
somewhere
soon
sorry
spades
specified
specify
specifying
still
sub
sube
such
sum
sup
supe
sure
szlig
t
t's
take
taken
tau
tell
tends
th
than
thank
thanks
thanx
that
that's
thats
the
their
theirs
them
themselves
then
thence
there
there's
thereafter
thereby
therefore
therein
theres
thereupon
these
theta
thetasym
they
they'd
they'll
they're
they've
think
thinsp
third
this
thorn
thorough
thoroughly
those
though
three
through
throughout
thru
thus
tilde
times
to
together
too
took
toward
towards
trade
tried
tries
truly
try
trying
twice
two
u
uacute
uarr
ucirc
ugrave
uml
un
under
unfortunately
unless
unlikely
until
unto
up
upon
upsih
upsilon
us
use
used
useful
uses
using
usually
uucp
uuml
v
value
various
very
via
viz
vs
w
want
wants
was
wasn't
way
we
we'd
we'll
we're
we've
weierp
welcome
well
went
were
weren't
what
what's
whatever
when
whence
whenever
where
where's
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
who's
whoever
whole
whom
whose
why
will
willing
wish
with
within
without
won't
wonder
would
wouldn't
x
xi
y
yacute
yen
yes
yet
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves
yuml
z
zero
zeta
zwj
zwnj

参考サイト

このストップワードは,以下サイトのものを組み合わせた.
1. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
2. http://pst.co.jp/powersoft/html/index.php?f=3401
具体的な組み合わせ作業は次の通り.

組み合わせ作業メモ

2.のHTML特殊文字一覧ページから、&と;に囲まれた文字列を取得する.*2

let xpaths = ["/html/body/div/div/div[3]/table/tbody/tr", 
              "/html/body/div/div/div[4]/table/tbody/tr"];
for(let j=0; j<xpaths.length; j++) {
  let nodes = document.evaluate(xpaths[j],document,null,7,null);
  for(let i=0; i<nodes.snapshotLength; i++) {
    let entity = nodes.snapshotItem(i).childNodes[1].innerHTML.match(/\&amp\;([a-zA-Z]+)\;/);
    if(entity && entity[1]){
      console.log(entity[1]);
    }
  }
}

1.のストップワードに2.のHTML特殊文字の文字列を追記して(stopwords.txt)マージ

$ cat stopwords.txt | tr '[A-Z]' '[a-z]' | sort | uniq > stopwords_merge.txt 

*1:そもそもパースしろよっていうツッコミはあるが,もしかしたらニーズがあるかも知れないのでここに残す.

*2:例えば$amp;のamp