Jump to content

Talk:String-searching algorithm

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Untitled

[edit]

Moved from article

These descriptions are insufficient!

Couldm somebody expand this entry ?

The "Finite Automata" link is entirely unhelpful, redirection to a page about finite state machines that has no information on string searching. --Furrykef 05:53, 29 Jun 2004 (UTC)

Nasty encodings

[edit]

Some mention should be made of nasty character encodings that require special precautions when searching.......... Plugwash 11:35, 21 March 2006 (UTC)[reply]

Matching times

[edit]

Are the matching times listed in the comparison table factual at all? I see most of them have been marked as Θ(n). Is it just a stub? References? --Bisqwit 11:46, 26 May 2006 (UTC)[reply]

Also, Θ() notation is used incorrectly, e.g. naive string searching is O(nm) -- when the two strings are the same, running time is Θ(n).

DFA search image

[edit]

I've just fixed up the DFA image Image:DFA search mommy.svg so that it renders again. It should be useful here as a demonstration of the simple DFA-based algorithm, but I'm not sure where to put it. Please assist. Dcoetzee 03:57, 2 April 2007 (UTC)[reply]

[edit]

Hi,

  I would like to include the following link which has more algorithms for exact string search EXACT STRING MATCHING ALGORITHMS.
  Any other thoughts also welcome.

Thanks czar — Preceding unsigned comment added by Crxz0193 (talkcontribs) 06:37, 23 April 2012 (UTC)[reply]

O( (N/M + M) logM ) for alphabet size 2 and up, optimal average case.

[edit]

Alpha Skip Search was published at Combinatorial Pattern Matching 1998. The method appears to be optimal on average, for all alphabet sizes, and large M,N. Keys are sequences of logM letters. Preprocess the pattern into a lookup table with M-logM+1 keys of size logM (log base is alphabet size), encoding the key position in the pattern. Then skip through the text by length M-logM+1, building a logM sized key, and try to match the key in the lookup table. If a match is found, then try naively to match the pattern at the implied start position in the text. The expected run time of the algorithm is O((N/M + M)logM) for all alphabet sizes, including 2! Daniel Pehoushek 24.33.70.144 (talk) 18:45, 24 August 2014 (UTC)[reply]

Two-way string search algorithm

[edit]

This page seems to forget to mention the existence of the two-way string-matching algorithm, which provides a worst-case running time of O(n+m). This is the algorithm used by glibc's strstr() function, for example. EdSchouten (talk) 10:18, 20 June 2016 (UTC)[reply]

Reference to "vector_space"

[edit]

Have just corrected where intro said strings are "formally" "vectors", and linked to vector_space. Perhaps someone merely accidentally linked to the wrong sense of "vector"; but it's incorrect as well as unhelpful. It was never mentioned again in the article, never had a connection described, and was unsourced). I changed it to array, which is not without problems, but at least a common basic analogy, and I removed the "formal", because it the analogy does not approach formality (nor does it need to).

  • Character strings are sequences of characters, not numeric quantities. There is no intrinsic relationship between "A" and 1 (or 65), or between control characters, ligatures, diacritics, etc., etc. Character strings (not to mention DNA base-pair sequences) existed millennia before ASCII.
  • Even a *coded* character string in a particular coded character set, is a sequence of integers draw from an extremely small set (even for Unicode), whereas even the infinite set of integers do not constitute a [Field_(mathematics)|field], which is required to define a vector space. Characters do not support addition, subtraction, multiplication, and division:
    • What is "C" + 1? Not "D" if it's EBCDIC; and the integer 1 is not a character anyway, so we could only refer to "C" + "1", which may be a totally different thing)
    • What is "C" * "#"?
    • What is "Z"/" "?
  • Strings do not behave like a space (or lattice) of numbers. One does not typically do linear algebra on strings. The dot product of two strings is an unusual notion (dot products are important in information theory, on arrays of words; but not for "string searching" per se). Affine transforms would be very strange way to do case-folding or space-normalizaiton (both highly important in string searching), if possible at all.
  • In strings, the length is highly variable, and the length in and of itself is of little importance; in linear algebra the length of a vector is an extremely important intrinsic property. This again shows they are quite different concepts, and one cannot trivially be a "formal" model for another.
  • Saying "Formally" suggested the introduction of a mathematical model which covers the important properties of a certain thing. No such model is in evidence. The mathematical properties of vector spaces are not even a basically correct model of strings, and do not approach being adequate, much less conceptually helpful. — Preceding unsigned comment added by Sderose (talkcontribs) 20:28, 31 August 2017 (UTC)[reply]