Jump to content

Paving the way for better numismatic experience


mordehaus

Recommended Posts

1 hour ago, Severus Alexander said:

I would say that acsearch's image search function makes a good start on this.

As far as I know, all the auctions on biddr stay online. @SimonW, is that correct?  Just go to https://www.biddr.com/closed and narrow down by auction house using the dropdown.

I have never tried ACSearch's image search.  I would be happy to read an article or write-up about the experience, if anyone has tried it.

I have never seen a biddr auction go away.  My fear is that all of the archived data on biddr could go away, within the span of a few seconds, without warning.  I purchased coins on pecunum.com between 2013-2016.  That entire trove of data is gone.  (It was a major site, operated by A. J. Gatlin of CoinArchives fame.  I do not know if anyone purchased the rights to the archives of the company.)

With Biddr, some auctions (e.g. this one) have a link to download a PDF catalog.  (Many do not, but perhaps the browser could generate PDFs of each page of 100 lots.).  An archive of those catalogs would have a value of $0 as long as Biddr operates, but if Biddr ever goes away it could become important.

Of course, not every auction is on biddr.  Frank Robinson's catalogs go away when the next auction is loaded.  If someone sold me a coin and said it was ex-Frank Robinson Auction 80, it would be hard to verify even with the printed black and white catalogs.

  • Like 2
Link to comment
Share on other sites

2 minutes ago, Ed Snible said:

I have never seen a biddr auction go away.  My fear is that all of the archived data on biddr could go away, within the span of a few seconds, without warning.  I purchased coins on pecunum.com between 2013-2016.  That entire trove of data is gone.  (It was a major site, operated by A. J. Gatlin of CoinArchives fame.  I do not know if anyone purchased the rights to the archives of the company.)

I wondered if that's what you had in mind. Agreed! It is a concern. Although biddr and acsearch will be around a long time, I'm sure.

The acsearch image search isn't a machine learning or neural network type thing (which to my mind is what's really needed for this purpose), it's hard coded based on matching outlines.  It seems to work fairly well, and errs on the side of including coins that don't match, giving you a score for how well.  So the same coin will typically get a high score whereas a die match somewhat lower.  Hard to know how reliable it is, though, without starting with a large list of images not in the database but known to have matches in it.  I would certainly encourage you to try it out, though!

Since there aren't enough coins in this thread, here's one for which the acsearch image search worked, i.e. I found a previous auction record I wasn't aware of:

image.jpeg.bdd4ded674552e485fb9ac5b58d81518.jpeg

It was previously sold in Westfälische Auktionsgesellschaft, Auction 51 (2009), lot 9.

  • Like 4
Link to comment
Share on other sites

13 minutes ago, DonnaML said:

I don't suppose that by any chance it includes his Sale 86 from 1989? My Tiberius "tribute penny" denarius was apparently Lot 305 in that sale.

I understand your concern about the need to continue to build a database of online catalogues, and the problem of sales disappearing -- although anything that's been on biddr, numisbids, sixbids, etc. should still be available, as @Severus Alexander points out, and should be available as well at ACSearch to the extent it covers particular auction houses for the last one or two decades, and at CNG, Heritage, etc. for their own auctions. And of course the continuing addition of catalogues to online databases is a prerequisite to the ability to utilize those databases effectively.

My Cederlind catalogs start in 1994.  The ANS library in Manhattan has a copy of that catalog, in the open stacks.

https://donum.numismatics.org/cgi-bin/koha/opac-detail.pl?biblionumber=191411

I have several coins that appeared in CNG auctions that aren't in their archives.  CNG has a policy of removing unsolds.  Those unsolds get sold to dealers, and then I buy them, and fail to verify the provenance online.  Luckily PDFs are available of the printed catalogs, but unsolds from CNG e-auctions are not available from CNG.

The technology for doing word searches of the catalogs is known.  I know the Internet Archive applies OCR to everything.  That gets better every year, and allows full text searching.  (OCR of coin catalogs is a bit weak because most OCR programs expect monolingual text, not a mixture of English and Greek capitals.).

The problems are related to copyright and business models.  There are a lot of up-front costs related to scanning and writing software.  Someone with a better idea has to re-do all of the work already done by many other firms before they can start working on new ideas.

  • Like 2
Link to comment
Share on other sites

  • Benefactor
1 hour ago, Severus Alexander said:

I would say that acsearch's image search function makes a good start on this.

I admit that I haven't used ACSearch's image search feature, but I'm a bit suspect because their "normal" search is too basic. That's the major reason I haven't become a paying member as I just don't see the current functionality as valuable and I'm not going to pay 70 cents to test an image search feature to see if it's better. If you want to lure me as a paying customer, let me search a few times for free to verify how useful it is.

Speaking of basic search, the following would be highly useful.

  • Search by catalog reference. Right now this is a basic string search and mostly useless.
  • Be smart enough to know similar catalog references. Many coins are listed by multiple services. If I search for one, understand that the others may match too.
  • Understand debated attributions. For example, coins from Myos and Myndos often have identical catalog references (such as SNG Copenhagen 1022). Therefore, if I search for Myos I should receive Myndos and the other cities these are often attributed to.
  • Include highlights for known fakes. Right now these are often removed by auction houses but remain in search results. It would be great to keep them in the results, but put a red border or other designation on them. This could be crucial when searching for coins.

I should also note that I do not have a complex system that feeds into my web pages. That's simply overkill. I use an Excel spreadsheet. I know my own coins and can easily find the one I need with a search. My web site is all static content because I tell stories with each one and I can't autogenerate that content (well, I could - but it would suck).

  • Like 1
Link to comment
Share on other sites

8 minutes ago, kirispupis said:

I admit that I haven't used ACSearch's image search feature, but I'm a bit suspect because their "normal" search is too basic. That's the major reason I haven't become a paying member as I just don't see the current functionality as valuable and I'm not going to pay 70 cents to test an image search feature to see if it's better. If you want to lure me as a paying customer, let me search a few times for free to verify how useful it is.

Speaking of basic search, the following would be highly useful.

  • Search by catalog reference. Right now this is a basic string search and mostly useless.
  • Be smart enough to know similar catalog references. Many coins are listed by multiple services. If I search for one, understand that the others may match too.
  • Understand debated attributions. For example, coins from Myos and Myndos often have identical catalog references (such as SNG Copenhagen 1022). Therefore, if I search for Myos I should receive Myndos and the other cities these are often attributed to.
  • Include highlights for known fakes. Right now these are often removed by auction houses but remain in search results. It would be great to keep them in the results, but put a red border or other designation on them. This could be crucial when searching for coins.

I should also note that I do not have a complex system that feeds into my web pages. That's simply overkill. I use an Excel spreadsheet. I know my own coins and can easily find the one I need with a search. My web site is all static content because I tell stories with each one and I can't autogenerate that content (well, I could - but it would suck).

I think your demands are a bit too high given the huge scope of acsearch... there would be no way to automate the catalog reference functionality you mention, for example; and since the database is over 10 million items at this point, if it can't be automated, it can't be done.  That said, it would be possible to use the thesaurus function to equate catalog references.  Users can suggest equivalent terms for insertion into the thesaurus.

Searching is quite flexible, with wildcards, boolean functions, etc.

What I can tell you is that you're likely paying too much for your coins if you're not paying for acsearch to see the hammers.  It easily pays for itself IMO.

  • Like 1
  • Yes 1
Link to comment
Share on other sites

2 hours ago, Severus Alexander said:

I use Tap Forms for Mac as a database for my collection.  It doesn't have the sophisticated abilities that @Kaleun96 speaks of, but it's highly customizable and easy to use. You can easily create different "forms" for viewing the data, including one for printing out flip inserts.  Here's a screenshot:

image.jpeg.8337542e54c3acdbde5714afc220d114.jpeg

That's a pretty nifty bit of software, $50 is perhaps a bit steep for many but at least it's a one-off cost and comes with a mobile app. Can you create inputs from formulas like in Excel/Sheets, e.g. a field is generated by the values of several other fields or anything like that?

Link to comment
Share on other sites

23 minutes ago, kirispupis said:

I should also note that I do not have a complex system that feeds into my web pages. That's simply overkill. I use an Excel spreadsheet. I know my own coins and can easily find the one I need with a search. My web site is all static content because I tell stories with each one and I can't autogenerate that content (well, I could - but it would suck).

Not sure if you're replying to my comment addressed to you but as I talked about feeding data into our respective websites, I'll assume you might be. Anyway, I didn't say you had a complex system, I only said that you (we) have customised solutions for our own needs and that includes a secondary need of being able to upload that data to a website - even if that is manually copying it over as you suggest you're doing.

My point is not in the details of how you manage your collection, my point is that just because *you* don't see a need for it, doesn't mean that need doesn't exist. You're already likely in the top 5-10% of collectors when it comes to having something to manage your collection digitally, even with something as basic as a spreadsheet. If you're fine with managing data that way and copying it over to your website manually, then of course the value of such an app is minimal for you and there's no reason to change what you're currently doing. However, a lot of people may not be so keen on managing their collection in this way and might pay for the convenience of a user-friendly app/site that makes this easier and provides other features.

Also, I think you know this already as someone working in software but there are other reasons (besides "autogenerating stories") for not wanting to manually move data between systems. Whether it's overkill or not depends on the needs.

Link to comment
Share on other sites

  • Benefactor
25 minutes ago, Severus Alexander said:

I think your demands are a bit too high given the huge scope of acsearch... there would be no way to automate the catalog reference functionality you mention, for example; and since the database is over 10 million items at this point, if it can't be automated, it can't be done.  That said, it would be possible to use the thesaurus function to equate catalog references.  Users can suggest equivalent terms for insertion into the thesaurus.

The features I mentioned wouldn't be overly difficult to implement with some decent engineers. Most coin descriptions follow similar formats, so it wouldn't be difficult to construct an AI model to pull the catalog references, city, and years from a listing.

From there, one can correlate the catalog listings. For example, if 95% of listings that mention BCD Boiotia 539 also mention SNG Cop 339, then they're likely the same type.

Speaking of that, year search would also be useful. For example, if I'm searching for Alexandria Troas, I personally don't care about the Roman coins - which are like locusts. I only want to see the Greek ones from before 250 BCE.

  • Like 2
Link to comment
Share on other sites

  • Benefactor
13 minutes ago, Kaleun96 said:

Not sure if you're replying to my comment addressed to you but as I talked about feeding data into our respective websites, I'll assume you might be. Anyway, I didn't say you had a complex system, I only said that you (we) have customised solutions for our own needs and that includes a secondary need of being able to upload that data to a website - even if that is manually copying it over as you suggest you're doing.

Yes, I did take this to mean me. There are really two needs you discuss here. a) managing my coin collection, which I do with an Excel spreadsheet and b) copying data from my home system to my web site, which doesn't take place. There really isn't any content on my web site that comes from my Excel spreadsheet. Even the attribution is altered. I expect this to diverge more next year, when I make planned major changes to the site. In that case, I do expect to have some things that are data driven, but they will be driven from other sources that are separate from the spreadsheet I use just for keeping track of what coins I have.

The big point here is that both my website development and my coin website are driven by a balance of utility vs effort. I'm never going to automate something because it's cool to automate. I'm going to automate it because I calculate that the cost of automating it will save me time overall in the long run.

17 minutes ago, Kaleun96 said:

My point is not in the details of how you manage your collection, my point is that just because *you* don't see a need for it, doesn't mean that need doesn't exist. You're already likely in the top 5-10% of collectors when it comes to having something to manage your collection digitally, even with something as basic as a spreadsheet. If you're fine with managing data that way and copying it over to your website manually, then of course the value of such an app is minimal for you and there's no reason to change what you're currently doing. However, a lot of people may not be so keen on managing their collection in this way and might pay for the convenience of a user-friendly app/site that makes this easier and provides other features.

In past discussions we've had on this site, everyone has some way of managing their coins. Some have more basic systems and others more advanced. I read the OP's post as more of an advertisement that didn't seem to understand the reality of how most people collect. The best systems I've seen are from those who are already tightly meshed into the hobby, so they already know first-hand what the needs are. It then follows the basic principles of software development: put a V1 out there based on what you desperately need and then keep listening to your customers.

Keep in mind also that those of us who even have a website discussing our coins are a tiny fraction of the collectors out there, and there wouldn't be much of a business justification to supporting automated copying (which would be problematic given the lack of a standard).

23 minutes ago, Kaleun96 said:

Also, I think you know this already as someone working in software but there are other reasons (besides "autogenerating stories") for not wanting to manually move data between systems. Whether it's overkill or not depends on the needs.

Not sure why you keep harping on moving data. Yes. I work in the software industry and I have a patent on auto-generation of code. I personally don't auto-generate my content not because I don't believe in it but because it wasn't cost-effective for that solution. That may change in the near future, but this software is not going to help in the effort, nor is that relevant to the conversation.

Link to comment
Share on other sites

1 hour ago, Kaleun96 said:

That's a pretty nifty bit of software, $50 is perhaps a bit steep for many but at least it's a one-off cost and comes with a mobile app. Can you create inputs from formulas like in Excel/Sheets, e.g. a field is generated by the values of several other fields or anything like that?

Yes, you can do calculations: https://www.tapforms.com/help-mac/5.3/en/topic/calculation . Scripts too, for more complex data manipulation: https://www.tapforms.com/help-mac/5.3/en/topic/scripts . I see the scripts can pull data from websites, so it might be possible to do some of the more difficult things you mentioned.  Personally I haven't used either of these capabilities, but it's nice to see the app has them.

54 minutes ago, kirispupis said:

The features I mentioned wouldn't be overly difficult to implement with some decent engineers. Most coin descriptions follow similar formats, so it wouldn't be difficult to construct an AI model to pull the catalog references, city, and years from a listing.

From there, one can correlate the catalog listings. For example, if 95% of listings that mention BCD Boiotia 539 also mention SNG Cop 339, then they're likely the same type.

Speaking of that, year search would also be useful. For example, if I'm searching for Alexandria Troas, I personally don't care about the Roman coins - which are like locusts. I only want to see the Greek ones from before 250 BCE.

I think you're right about the year of issue... it might not take too much labour to get that to work.  But I think you're underestimating the difficulty of incorporating catalog references. Since you work in a specialized area, catalog references are fairly uniform. But how to do this when you've got Sogdian coins referencing Mitchiner, Chinese coins referencing Hartill, Buyid coins referencing Album, etc. etc. Dealers are not at all consistent in how they refer to these catalogs, plus the records are riddled with errors. I would estimate that up to a quarter of the listings have some sort of cataloguing mistake in them! I don't think it would be possible to generate reliable results without actually consulting the catalogs, which is prohibitively time-expensive.

That said, I'm all ears if you have ideas for how to make this workable!  It would be great. 🙂

On your Alexandria, Troas example, I find that the "-" search function helps a lot eliminating unwanted results. (I often -commentaires to get rid of the cgb results, which include a lot of historical info creating false positives.) This particular example is tough, but it is possible to improve results quite a bit. The search string "(alexandria alexandreia) troas" gets 3874 hits, while "(alexandria alexandreia) -col -avg troas -bellinger -roman -provincial -pseud* -autonomous," gets only 879. Still quite a few Roman coins included, though.

For pricing, the bulk-volume auction houses cause trouble with their extremely minimal descriptions.  If the coin you're researching is a common low value bronze, or something else they might have, it's always a good idea to try a search with minimal terms to see if there's a bulk auction price level.

Edited by Severus Alexander
  • Like 2
  • Yes 1
Link to comment
Share on other sites

46 minutes ago, kirispupis said:

The features I mentioned wouldn't be overly difficult to implement with some decent engineers. Most coin descriptions follow similar formats, so it wouldn't be difficult to construct an AI model to pull the catalog references, city, and years from a listing.

When I signed up for acsearch, more than 10 years ago, I clicked to agree to the terms.  I assume the terms I agreed to then are similar to the current terms, which say "2.1 Any intervention in the function or manipulation of the website is forbidden to the highest degree. This in particular includes ... the use of so called web scrapers, web robots and other software for the systematic collection of data and content ..."

Without the data it is hard to create and test software to analyze the data!

I have never asked Simon Wieland for permission to access his data.  I don't have time to work on this project seriously.  Perhaps someone could approach him to see if he would be willing to sell programmatic access to his non-price data to companies and non-profits doing research and new product design.

 

  • Like 1
Link to comment
Share on other sites

  • Benefactor
25 minutes ago, Severus Alexander said:

Yes, you can do calculations: https://www.tapforms.com/help-mac/5.3/en/topic/calculation . Scripts too, for more complex data manipulation: https://www.tapforms.com/help-mac/5.3/en/topic/scripts . I see the scripts can pull data from websites, so it might be possible to do some of the more difficult things you mentioned.  Personally I haven't used either of these capabilities, but it's nice to see the app has them.

I think you're right about the year of issue... it might not take too much labour to get that to work.  But I think you're underestimating the difficulty of incorporating catalog references. Since you work in a specialized area, catalog references are fairly uniform. But how to do this when you've got Sogdian coins referencing Mitchiner, Chinese coins referencing Hartill, Buyid coins referencing Album, etc. etc. Dealers are not at all consistent in how they refer to these catalogs, plus the records are riddled with errors. I would estimate that up to a quarter of the listings have some sort of cataloguing mistake in them! I don't think it would be possible to generate reliable results without actually consulting the catalogs, which is prohibitively time-expensive.

That said, I'm all ears if you have ideas for how to make this workable!  It would be great. 🙂

On your Alexandria, Troas example, I find that the "-" search function helps a lot eliminating unwanted results. (I often -commentaires to get rid of the cgb results, which include a lot of historical info creating false positives.) This particular example is tough, but it is possible to improve results quite a bit. The search string "(alexandria alexandreia) troas" gets 3874 hits, while "(alexandria alexandreia) -col -avg troas -bellinger -roman -provincial -pseud* -autonomous," gets only 879. Still quite a few Roman coins included, though.

The thing is, this would be a search and not an authoritative reference. You are absolutely correct that many listings attribute coins incorrectly, but it wouldn't be the goal of the search engine to fix that. For example, the bronzes attributed to Nektanebo II have been pretty definitively proven to not have been minted by him. However, I'm not insulted when a search engine returns them because that's not the engine's fault. Complicating the matter further is many of these coins were listed when that was believed the correct attribution.

Also, the algorithm doesn't need to really understand the reference. Sure, there are some such as RIC - where the format quoted often differs - that may need some massaging (though that can be automated too). Overall, I don't see it as a more difficult problem than others that AI models have tackled. It's certainly far easier than trying to match dies when the image quality from older catalogs is abysmal.

Another point I would make about ACSearch is that - at least for me - it's not overly helpful in ensuring I don't overpay. The thing is, what someone paid for a coin in 2003 doesn't help me very much. There are two reasons:

  1. Some coins I'm after are extremely rare, and the price I expect to pay is based more on how much competition I expect to receive than what they went for in the past. For some of those rare coins I know I won't receive much competition, but on others I'll have to pay through the nose. Collectors' goals change and what a rare coin for which only a couple copies exist went for five years from ago isn't a great prediction of what it will receive now.
  2. Even well-known coins change dramatically in price from year to year. Realistically, I need to know what they sold for in the last six months. The free version of CoinArchives already tells me this.
  • Like 1
Link to comment
Share on other sites

  • Benefactor
30 minutes ago, Ed Snible said:

When I signed up for acsearch, more than 10 years ago, I clicked to agree to the terms.  I assume the terms I agreed to then are similar to the current terms, which say "2.1 Any intervention in the function or manipulation of the website is forbidden to the highest degree. This in particular includes ... the use of so called web scrapers, web robots and other software for the systematic collection of data and content ..."

Without the data it is hard to create and test software to analyze the data!

I have never asked Simon Wieland for permission to access his data.  I don't have time to work on this project seriously.  Perhaps someone could approach him to see if he would be willing to sell programmatic access to his non-price data to companies and non-profits doing research and new product design.

 

Yes, the data source remains the major issue. Even for ACSearch, an API into his data wouldn't be sufficient. The raw data itself would be necessary.

My hope with the suggestions is that the owners of ACSearch or CoinArchives pick them up, or someone else does. It really has to be a labor of love, since there aren't enough ancient coin collectors to justify the expense.

  • Like 1
Link to comment
Share on other sites

2 hours ago, kirispupis said:

The thing is, this would be a search and not an authoritative reference. You are absolutely correct that many listings attribute coins incorrectly, but it wouldn't be the goal of the search engine to fix that. For example, the bronzes attributed to Nektanebo II have been pretty definitively proven to not have been minted by him. However, I'm not insulted when a search engine returns them because that's not the engine's fault. Complicating the matter further is many of these coins were listed when that was believed the correct attribution

I'm not really talking about the Nektanebo situation, but the situation where the wrong RIC number is given, or the wrong SNG number, because the auction house has incorrectly copied another attribution.  The correlational strategy you mention is going to get seriously messed up by these kinds of mistakes, which are legion.  (I know this from running AMCC auctions.)  The algorithm would have to be resistant to extremely noisy data.  Not saying it can't be done, but it might require deep learning or something like that (which is what I think is needed for image matching, ultimately). That in addition to the fact that across the spectrum of coins we're dealing with thousands of different catalogs... yikes. I suppose some progress could be made with very limited scope. I guess you'd vote for Hellenistic coins around 300 BC plus or minus 50 years? 🙂 

2 hours ago, kirispupis said:

 

  1. Some coins I'm after are extremely rare, and the price I expect to pay is based more on how much competition I expect to receive than what they went for in the past. For some of those rare coins I know I won't receive much competition, but on others I'll have to pay through the nose. Collectors' goals change and what a rare coin for which only a couple copies exist went for five years from ago isn't a great prediction of what it will receive now.
  2. Even well-known coins change dramatically in price from year to year. Realistically, I need to know what they sold for in the last six months. The free version of CoinArchives already tells me this.

Of course I agree about great rarities (where the rarity matters, historically or numismatically, of course; not talking about random flyspecking).  Previous price data is always of limited use there.

We're rather different collectors.  I'm a generalist who looks through all the thousands of lots of a Leu auction. (More fool me! 😆)  This means there are typically lots of coins I'm interested in, and I narrow them down by what I can get for a bargain.  I'm not willing to pay pandemic prices!  Everything back to 2017 or 18 and sometimes further is relevant to me.  This year I've landed coins for the cheapest price they've been sold for in a decade.  I definitely couldn't do this without a paid acsearch account! (@Curtisimo can back me up here, he calls me the Bargain Ninja. 😆)

  • Like 3
  • Clap 1
Link to comment
Share on other sites

  • Benefactor
5 hours ago, Severus Alexander said:

I'm not really talking about the Nektanebo situation, but the situation where the wrong RIC number is given, or the wrong SNG number, because the auction house has incorrectly copied another attribution.  The correlational strategy you mention is going to get seriously messed up by these kinds of mistakes, which are legion.  (I know this from running AMCC auctions.)  The algorithm would have to be resistant to extremely noisy data.  Not saying it can't be done, but it might require deep learning or something like that (which is what I think is needed for image matching, ultimately). That in addition to the fact that across the spectrum of coins we're dealing with thousands of different catalogs... yikes. I suppose some progress could be made with very limited scope. I guess you'd vote for Hellenistic coins around 300 BC plus or minus 50 years? 🙂 

The following is a rough outline of an algorithm that should work.

  1. Create a model to pull the full attribution from a listing. This should look something like "Doobie Bros 1989 Pl 10 num 5; Flipperdoodle SNG 95.2". This shouldn't be overly difficult to train and should have a high success rate because the syntax for the attribution is similar across listings. 
  2. Given the extracted attribution, create an AI model to pick apart the individual attributions. So, for the example above, we would have two. Doobie Bros 1989 Pl 10 Num 5 and Flipperdoodle SNG 95.2. Again, this doesn't need the most complex model.
  3. Normalize the above attributions. This may be the trickiest part so far because the normalized syntax will likely vary by catalog. It may be necessary to create a state machine (though they can probably be generated) to accomplish this. Normalization will be required on the search engine side too, to normalize the input before querying.
  4. You've now accomplished a basic attribution search. One can search for a specific type from Doobie Bros or Flipperdoodle, the input will be normalized, and then you don't even need a full text search. An index (yes, people seem afraid of them now) search may achieve what we need.
  5. Cross correlation is also not difficult. Because we have a collection of attributions for each listing, we just need to first index on the attributions (which we should have already done in the previous step) and then for each attribution find all other attributions that accompany it. We can then calculate a correlation between them. Given a threshold that we should be able to calculate from the data, we can then add a secondary index to pull all the records that match from a correlated attribution. Note that these operations can be done with non-relational databases too, for those who find SQL weird.
  6. Now, we've achieved most of what we were after. Attempting to pick out mis-attributed coins is possible if an image search is also available. Basically, for each attribution we calculate a similarity score of their images. Alternatively, we could create a boolean AI model that simply returns whether the images are similar. For the small number that are not, we could remove the attribution - or more likely index it separately so customers may decide whether to include them. Actually fixing the attribution is more complicated, since that would involve a very sophisticated image search that I don't believe exists today.

In reality, I don't think many are going to be bothered by the mis-attributions as long as the attribution search (based on the normalized listing only) is accurate. At this point the algorithm is returning only those listings that were actually attributed as requested, and any errors are caused by the seller. 

  • Like 1
Link to comment
Share on other sites

When did this thread become about acsearch? 🙂 I am happy to answer all questions and listen to ideas for improvments. I do, however, not want to hijack @mordehaus thread about his new application and will, thus, keep my answer short. Please be assured that I have read all comments and ideas, nonetheless. If you have further questions, ideas or feedback, please open a new thread and I will gladly jump on it. 

One search operator that is very powerful and barely used is the proximity search operator described here (https://www.acsearch.info/howto.html). When I am searching for a certain catalog reference (e.g. Hadrian RIC 9), my search query looks like this: Hadrian "RIC 9"~2. This ensures that the search only returns results were RIC and 9 are no further away from each other than 2 words, which works amazingly good. If you add the denomination on top of that, you will almost certainly have mainly perfect matches (as long as the descriptions are correct). There are many other helpful search operators. Check out the How to page on acsearch.

Edited by SimonW
  • Like 5
Link to comment
Share on other sites

23 hours ago, kirispupis said:

The features I mentioned wouldn't be overly difficult to implement with some decent engineers. Most coin descriptions follow similar formats, so it wouldn't be difficult to construct an AI model to pull the catalog references, city, and years from a listing.

From there, one can correlate the catalog listings. For example, if 95% of listings that mention BCD Boiotia 539 also mention SNG Cop 339, then they're likely the same type.

Speaking of that, year search would also be useful. For example, if I'm searching for Alexandria Troas, I personally don't care about the Roman coins - which are like locusts. I only want to see the Greek ones from before 250 BCE.

I think the issue here would not be the model so much as the data cleaning. For example, you can refer to SNG BnF Cilicia in so many different ways: for SNG you have SNG and Sylloge Nummorum Graecorum; for the first suffix you have Bibliothéque Nationale, BnF, France, Paris, BN (maybe others); and for Cilicia you probably use either Cilicia, 2, Vol. 2, or some combination of them. Not to mention all the different characters you may get between them, like: SNG (BNF) 2 1402, SNG Paris (2) 1302, SNG France 95 (now you have to look-up which volume it's from based on the mint), and so on.

You could handle 95-99% of cases with some regex but this is just for one publication, you'd need other regex for other publications, particularly non-SNG ones. A model may work given you've gone and labelled a large chunk (thousands) of it first to train it on. But you would also need a pipeline to have gone through and identified the mints, so you can do the look-up on the SNG volume if not provided - and if the SNG volume is not provided, you have to be able to identify that it's not provided, which probably assumes a single set of integers following a known SNG publication "phrase" refers to the type number and not the volume number. You could do that by looking for a set of integers that is ended by a full-stop, new-line, or some other character or punctuation. But then you risk missing out on SNG France 2. 1403, which refers to Vol 2 but Type 1403. Okay, so then you allow for two sets of integers that are separated by one of a set of known punctuations but doing this you could introduce errors where the volume is not provided but the Type number is followed by a full-stop and some other, irrelevant, number. But perhaps this can be solved by knowing the number of volumes for each publication and only allowing the first set of integers (likely a single integer) to represent the volume, if the integer is out of that range then it must be the Type and not the Volume, and any set of integers then following this one are definitely not related....unless a range of Type numbers are given, such as SNG France 1403-1405. Okay, so we'll allow for hyphens and a select few other characters that may identify a range of Type numbers.

Ok, we've solved that one (for this publication), let's get back to identifying the volume when not provided based on the name of the mint or region. What happens if the description contains more than one mint/region name? What if it's not spelt correctly or uses alternative spelling? Maybe not such a big issue compared to the previous one but this is another pipeline/data flow that needs to have been setup, validated, and integrated with the main model.

Anyway, you can see where I'm going with this. In my experience, the difficulty with ML/AI is not the model but the data, and validation, handling, and cleaning the data takes up the majority of the time, resources, and effort. Letting a model work out all the difficulties and nuances is probably the "easiest" way but think how much data you would've need to have gone and labelled to get a good representation of all the major publications and their variations in spelling and formatting? I could see it being do-able for, say, just the SNG publications, but since there's little convention in publication attribution, it would take a lot of labelled data to have something that works well generally. I'm sure there's some cutting-edge NLP method that may avoid some of these issues but then you're talking about hiring some expensive people.

Link to comment
Share on other sites

23 hours ago, kirispupis said:

Yes, I did take this to mean me. There are really two needs you discuss here. a) managing my coin collection, which I do with an Excel spreadsheet and b) copying data from my home system to my web site, which doesn't take place. There really isn't any content on my web site that comes from my Excel spreadsheet. Even the attribution is altered. I expect this to diverge more next year, when I make planned major changes to the site. In that case, I do expect to have some things that are data driven, but they will be driven from other sources that are separate from the spreadsheet I use just for keeping track of what coins I have.

The big point here is that both my website development and my coin website are driven by a balance of utility vs effort. I'm never going to automate something because it's cool to automate. I'm going to automate it because I calculate that the cost of automating it will save me time overall in the long run.

That's all fine but, again, I'm not wanting to get into the specifics of how and why you manage your collection, I only was mentioning that you already have some system and may have secondary uses for that system (apparently not, but not important) that may create friction and prevent you from being able to use an off-the-shelf solution. But someone without those secondary uses has less friction when wanting to move to a different system. There's of course varying degrees as to how much friction a "secondary use" creates, in my case it would be a huge amount of friction, but in other cases not so much. SaaS products tend to be focussed on particular types of users, so my point was that you, I, and some others here, may not initially be in that target user group for OP and thus when we say "you're not solving a problem I have", it's not necessarily relevant to what OP might be trying to do.

Quote

In past discussions we've had on this site, everyone has some way of managing their coins. Some have more basic systems and others more advanced. I read the OP's post as more of an advertisement that didn't seem to understand the reality of how most people collect. The best systems I've seen are from those who are already tightly meshed into the hobby, so they already know first-hand what the needs are. It then follows the basic principles of software development: put a V1 out there based on what you desperately need and then keep listening to your customers.

Agree on putting a v1 out there, that was going to be my initial feedback in this thread but others covered it well enough.

Quote

Keep in mind also that those of us who even have a website discussing our coins are a tiny fraction of the collectors out there, and there wouldn't be much of a business justification to supporting automated copying (which would be problematic given the lack of a standard).

I'm not proposing OP develops some automated syncing tool for coin data to a person's respective website. I don't imagine OP would initially want to target users who have existing websites, unless it's to convert them to whatever tool is being built. Later on it would be easy to expose an API so a user could use OP's tool as management software and easily fetch that data to bring into their own website where they might do other things with it but would many use it? Probably not.

Quote

Not sure why you keep harping on moving data. Yes. I work in the software industry and I have a patent on auto-generation of code. I personally don't auto-generate my content not because I don't believe in it but because it wasn't cost-effective for that solution. That may change in the near future, but this software is not going to help in the effort, nor is that relevant to the conversation.

I'm not harping on about moving data, you were the one to get stuck up on it. I only mentioned "feeding data to your website" as an example of something you might be doing. It was my mistake to assume that you might be doing this but it was no slight on you so I'm not sure why you took offense to it.

Re: auto-generation (not sure why auto-generation of code is relevant here?), I was only taking exception to your characterisation that syncing data to your website has no use because you write stories and autogenerating stories is a bad idea (I agree). But there's many genuine reasons you might want to sync data to a website whether your only content is writing stories or not.

Edited by Kaleun96
Link to comment
Share on other sites

15 hours ago, kirispupis said:

The following is a rough outline of an algorithm that should work.

  1. Create a model to pull the full attribution from a listing. This should look something like "Doobie Bros 1989 Pl 10 num 5; Flipperdoodle SNG 95.2". This shouldn't be overly difficult to train and should have a high success rate because the syntax for the attribution is similar across listings. 
  2. Given the extracted attribution, create an AI model to pick apart the individual attributions. So, for the example above, we would have two. Doobie Bros 1989 Pl 10 Num 5 and Flipperdoodle SNG 95.2. Again, this doesn't need the most complex model.
  3. Normalize the above attributions. This may be the trickiest part so far because the normalized syntax will likely vary by catalog. It may be necessary to create a state machine (though they can probably be generated) to accomplish this. Normalization will be required on the search engine side too, to normalize the input before querying.
  4. You've now accomplished a basic attribution search. One can search for a specific type from Doobie Bros or Flipperdoodle, the input will be normalized, and then you don't even need a full text search. An index (yes, people seem afraid of them now) search may achieve what we need.
  5. Cross correlation is also not difficult. Because we have a collection of attributions for each listing, we just need to first index on the attributions (which we should have already done in the previous step) and then for each attribution find all other attributions that accompany it. We can then calculate a correlation between them. Given a threshold that we should be able to calculate from the data, we can then add a secondary index to pull all the records that match from a correlated attribution. Note that these operations can be done with non-relational databases too, for those who find SQL weird.
  6. Now, we've achieved most of what we were after. Attempting to pick out mis-attributed coins is possible if an image search is also available. Basically, for each attribution we calculate a similarity score of their images. Alternatively, we could create a boolean AI model that simply returns whether the images are similar. For the small number that are not, we could remove the attribution - or more likely index it separately so customers may decide whether to include them. Actually fixing the attribution is more complicated, since that would involve a very sophisticated image search that I don't believe exists today.

In reality, I don't think many are going to be bothered by the mis-attributions as long as the attribution search (based on the normalized listing only) is accurate. At this point the algorithm is returning only those listings that were actually attributed as requested, and any errors are caused by the seller. 

With regard to #1 and #2, I'm not sure I would call multilingual information extraction "not the most complex". Unless you intend to exclude all non-English auction houses, that is.

Link to comment
Share on other sites

  • Benefactor
1 hour ago, Kaleun96 said:

I think the issue here would not be the model so much as the data cleaning. For example, you can refer to SNG BnF Cilicia in so many different ways: for SNG you have SNG and Sylloge Nummorum Graecorum; for the first suffix you have Bibliothéque Nationale, BnF, France, Paris, BN (maybe others); and for Cilicia you probably use either Cilicia, 2, Vol. 2, or some combination of them. Not to mention all the different characters you may get between them, like: SNG (BNF) 2 1402, SNG Paris (2) 1302, SNG France 95 (now you have to look-up which volume it's from based on the mint), and so on.

You could handle 95-99% of cases with some regex but this is just for one publication, you'd need other regex for other publications, particularly non-SNG ones. A model may work given you've gone and labelled a large chunk (thousands) of it first to train it on. But you would also need a pipeline to have gone through and identified the mints, so you can do the look-up on the SNG volume if not provided - and if the SNG volume is not provided, you have to be able to identify that it's not provided, which probably assumes a single set of integers following a known SNG publication "phrase" refers to the type number and not the volume number. You could do that by looking for a set of integers that is ended by a full-stop, new-line, or some other character or punctuation. But then you risk missing out on SNG France 2. 1403, which refers to Vol 2 but Type 1403. Okay, so then you allow for two sets of integers that are separated by one of a set of known punctuations but doing this you could introduce errors where the volume is not provided but the Type number is followed by a full-stop and some other, irrelevant, number. But perhaps this can be solved by knowing the number of volumes for each publication and only allowing the first set of integers (likely a single integer) to represent the volume, if the integer is out of that range then it must be the Type and not the Volume, and any set of integers then following this one are definitely not related....unless a range of Type numbers are given, such as SNG France 1403-1405. Okay, so we'll allow for hyphens and a select few other characters that may identify a range of Type numbers.

Ok, we've solved that one (for this publication), let's get back to identifying the volume when not provided based on the name of the mint or region. What happens if the description contains more than one mint/region name? What if it's not spelt correctly or uses alternative spelling? Maybe not such a big issue compared to the previous one but this is another pipeline/data flow that needs to have been setup, validated, and integrated with the main model.

Anyway, you can see where I'm going with this. In my experience, the difficulty with ML/AI is not the model but the data, and validation, handling, and cleaning the data takes up the majority of the time, resources, and effort. Letting a model work out all the difficulties and nuances is probably the "easiest" way but think how much data you would've need to have gone and labelled to get a good representation of all the major publications and their variations in spelling and formatting? I could see it being do-able for, say, just the SNG publications, but since there's little convention in publication attribution, it would take a lot of labelled data to have something that works well generally. I'm sure there's some cutting-edge NLP method that may avoid some of these issues but then you're talking about hiring some expensive people.

An ML/AI model would be best suited here to removing the catalog from the listing and dividing it into attributions. The normalization itself, I agree, would be ill suited to an ML/AL model. Other posts have mentioned regular expressions, but I suspect a form of state machine would be better suited. I've used them on similar projects with success.

As I mentioned earlier, the normalization of catalog records is the most time consuming phase. As you mention, there are catalogs like SNG that have varying syntaxes. Other catalogs are more uniform and I'm sure the majority could be handled by a straightforward machine or regex. Some will need more time, but it's a solvable problem. I've created systems that dealt with phone numbers and addresses - both covering any country in the world. If those problems could be solved, so could this.

Link to comment
Share on other sites

  • Benefactor
54 minutes ago, velarfricative said:

With regard to #1 and #2, I'm not sure I would call multilingual information extraction "not the most complex". Unless you intend to exclude all non-English auction houses, that is.

The following is the original listing for my Ptolemy I. The part in bold is the catalog reference we would need to pull. As you can see, it's pretty language agnostic.

Type : Tétradrachme
Date : c. 311-305 AC
Nom de l'atelier/ville : Alexandrie, Égypte
Métal : argent
Diamètre : 27,5 mm
Axe des coins : 1 h.
Poids : 14,60 g.
Degré de rareté : R1

"COMMENTAIRES SUR L'ÉTAT DE CONSERVATION : Monnaie centrée avec un petit manque à 6h. Très joli buste d’Alexandre ainsi qu’un revers bien détaillé. Patine grise de collection
RÉFÉRENCE OUVRAGE : Sv.162 (37 ex) - Cop.29 - GC.7750 var. - BMC.- - MP.6
AVERS
Titulature avers : ANÉPIGRAPHE.
Description avers : Buste cornu et diadémé d'Alexandre le Grand sous les traits de Zeus-Ammon à droite, coiffé de la dépouille d'éléphant avec l'égide.
REVERS
Description revers : Athéna Promachos ou Alkidemos marchant à droite, brandissant une javeline de la main droite et tenant un bouclier de la gauche ; dans le champ à gauche, un casque corinthien, un monogramme et un aigle sur un foudre tourné à droite.
Légende revers : ALEXANDROU/ (AU).
Traduction revers : (d’Alexandre).
COMMENTAIRE
Contremarque sur la joue d’Alexandre. Pour cette variété, Svoronos avait répertorié trente-sept exemplaires."

Here's another one from a German seller. It's definitely within the realm of possible to create an ML model to retrieve these. Splitting them up is also doable.  

Grade: schöne Tönung, VF+ | Abbreviations
Catalog: Le Rider Taf. 46, 29; SNG ANS 731–735
Material: Silver
Weight: 2.57 g
Philipp II., 359-336 v. Chr.
AR-Pempte (1/5 Tetradrachme), 318-317 v. Chr. (postum)
Amphipolis
Vs.: Kopf des Apollon mit Tänie n. r.
Rs.: Jüngling reitet n. r., unten seitlich gesehener Schild

Die Existenz dieses merkwürdigen duodezimal-inkompatiblen Nominals hatte finanztechnische Gründe: In Makedonien kursierten damals Geldstücke im schweren, attischen Standard (Alexander-Typen) sowie im leichten, thrakisch-makedonischen Standard (Philippos-Typen). 1 1/5 Philippos-Tetradrachmen ergaben eine Alexander-Tetradrachme.

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...